 The second is event processes processing. We need to process those events one by one and enrich those events. Third is we need to store those events and finally visualize those events in a visualization layer. I'm going to discuss all these four components one by one. First is event flow. A general event flow looks something like this. You have a set of producers and you have a set of consumers which are interested in consuming those events. And you need a Qing message broker or some Qing solution in between which can facilitate this event flow. So the requirements for this solution are we need something, some solution which can provide low latency, high throughput. It should be able to handle like fault tolerant. So if there are any failures at the end of like producers or consumers, it should be able to handle that. It should provide some message delivery guarantees like ordering of the messages, for example, like which event happened before the second event or the ordering needs to be maintained in some cases. Also, the delivery guarantees like at least once exactly once or at most once these delivery guarantees varies from use case to use case. The last one is like scalability. It needs to be scalable to handle like billions of events per second. So the solution I'm going to discuss over there here for event flow is Apache Kafka, which is sort of like the defect of standard in the industry these days and is used by many big companies. So how Kafka works is it has a set of like brokers and these brokers have like partitions for different topics. So each topic is divided into multiple partitions and each there are multiple producers and each producer can produce events to multiple partitions. Events are stored in a partition in a sequenced way and each event is identifiable by an offset, which gives you an ordering guarantee within a particular partition. You can also have like multiple consumers, which will consume events from these partitions in a sequence way. There can also be like multiple consumer groups to consume those events into like multiple consumer groups. Each consumer also tracks its own offset so that like if there is a failure on the consumer and it can read, come up back again and start reading messages from the last good known offset. So to summarize, there are like the Apache Kafka provides low latency, high throughput. It has like message delivery guarantees of at least once. The Kafka team also introduced exactly once guarantees in their latest release last month. It also has a reliable design to handle failures, which with like having the message acknowledgments between the producers and brokers. You can configure data replication of one brokers for broker failure and consumer can read from like any desired offset and they can keep track of their own offsets. The next layer is event processing where we want to process the event. There is some source where the event is happening. We want to consume those event process it and then produce those in that event to a another thing. So a consume process and produce pattern. But what we want to achieve from this is we want to enrich and transform those event streams. Apply some business logic for ever like for example, like filter nulls in our Wikipedia rate stream case and reach those events by adding the geolocation information. You might also want to apply some windowing logic or maybe like join multiple stream streams into a single one. This also needs to have like failure handling and scalability aspects. So there are many, many solutions out there in the market. For example, Apache, Samza, Spark, Flink, Apex, Kafka streams and storm. The one I'm going to discuss over here is Kafka streams, which works very well with Kafka. So Kafka streams is just a lightweight streaming library embedded with the ship with the Kafka releases. So it processes an event at a time. It also has like operators for stateful processing like windowing, joining aggregation operators. It uses a local state which uses rocks DB underneath to store the events. And this local state is also backed by a Kafka change log topic, which means that like this, this change log topic is used for failure recoveries. So if you restart your Kafka stream application, or if you, if there is any failure and it gets restarted, it will, it can read that change log from Kafka and replay those events and create the local state again. You can also scale it and run it in a distributed and fault tolerant manner. If you compare it with a Kafka consumer, it is a higher level, which is like faster to build any sophisticated app. And at the cost of like it provides a little bit less control with like very fine grained consumption. If you need that, you can use like low level consumer APIs. So for the Wikipedia edit stream here is like a simple code, which does like the processing in three steps. The first step is it reads the data from a Wikipedia raw stream, which we are getting from the Wikipedia. It reads that data and builds a Kafka stream object. So on that Kafka stream, then in the second step, it parses those events, matches those events against a pattern, parses those events in a Java object named Wikipedia message. And then it all again, then it maps and processes these events at geolocation information to those events and filter any events which are like, for example, like null and like any empty strings. Finally, it produces those events to a Wikipedia and this topic, which from where our data store can now consume those events. So the next piece is the data store, which is also like one of the most critical pieces in this pipeline. So the data store needs to have like multiple to power and interactive dashboard. It also needs to have multiple requirements. So it needs to be able to first in just streaming data because we are talking about creating dashboards on data streams. So it needs to be able to consume those events coming from a streaming way. Then it also needs to make those events available for queries as soon as those events are ingested into the data store. The second requirement is because we are powering dashboards over here, it needs to provide like subsequent query response times so that those dashboards feel interactive. And it also needs to because we don't know what the user is going to query. So it needs to support like arbitrary slicing and dicing of the data. It also like as we are visualizing the data, what we are interested in is providing like summarized and aggregated view of your data. And the same requirements of scalability and high availability needs to be there. So the solution I'm going to discuss over there for this is Druid, which I work on. And so Druid is a column oriented distributed data store, which can provide subsequent query times. It supports, it has support for both like real time streaming ingestion as well as batch ingestion. So you can do batch ingestion via Hadoop or Spark and also do streaming ingestion from like multiple data sources. You can like either push data to Druid or like pull data from Kafka or like any other messaging broker. It supports like arbitrary slicing and dicing of your data. It uses concepts like automatic data summarization, approximate algorithms to provide you like fast query response times. It has been like scaled to petabytes of data in production and is highly available. So now I'm going to discuss the suitable use cases and how Druid handles like this. This much workload and how it is internally works and how it can, it is able to provide those subsequent query times. So the suitable use cases are like powering interactive user-facing applications, which we are discussing in this talk. Arbitrary slicing and dicing of your large data sets. User behavior analysis like measuring this distinct counts, how many unique users visited my website in last one week or last one month. Doing retention analysis like how many of my users were retained in this week as compared to previous week. Doing funnel analysis, A-B testing and those kinds of use cases, user behavior analysis. So what it is not very good at is if you are interested in dumping the entire data set. It is good if you are trying to use, get an aggregated view or like summarize view of your data and in the form of queries. But if you are trying to do and dump the entire data set, like since it's columnar in nature, it is not very good at that. So coming to the storage internals on like how Druid internally stores the data. So Druid internally stores the data in the form of segment files, which are partitioned by time. So I can have like, for example, in this figure, I have my segments partitioned by each day. And so for Monday, I have different segment for Tuesday, I will have different segment. Ideally, the segment sizes are smaller than like one gigabytes in size. If your data for a particular day is larger than that, you can even create like multiple charts for that particular day. For example, on Friday, we have two different charts. So these time based slicing provides Druid the ability to filter your segments based on the time range you are querying. So for example, if I'm interested in only visualizing events which happened in last one week, I can quickly only scan those segments which are corresponding to last one week of that date time range and provide you the results rather than scanning all the different segments. So coming back to our Wikipedia example, this is how our data looks like after being enriched when it is sent to the data store. So the data has a timestamp of when that event was edited, which is the first thing. Then it has some dimensions and those dimensions are, for example, like which page was edited, which language the edit was being made, which city and country the edit was being made in. Then there are some metrics about like how many characters were added, how many characters were deleted in this particular edit event. So now I will discuss like the various techniques that Druid uses in order to provide fast query times. The first technique it uses is data rule up or data summarization. So by that, what I mean is if in your use case you are, for example, in your use case, if you are only interested in querying your data aggregated by an hour. So what you can do is you can provide that at the time of ingestion to Druid that I'm only interested in aggregating data by art or any further granularity. I'm never interested in querying data by, for example, like by each minute or looking at events which happened in each minute. So in that case, what Druid can do for you is it can summarize your data. So for example, these first three rows, since they all happened in the same hour and you are always going to query me data aggregated by an hour or a course of granularity. So what I'll do is I will just summarize all these three rows into a single row and store that there were three edits being made. And these were the sum of added, deleted min add, min added, min max edit. You can define all those metrics that you want to compute at the ingestion time. And so this reduces your data size, which needs to be scanned on the query time by like many folds, depending on like how you are rolling up your data. This helps with giving faster query times. The second technique is it uses is dictionary encoding. So instead of storing the row columns as it is, what it does is it creates dictionary encoding for these columns. So what it will do is it will assign IDs to each of the values for this column. For example, for this page column, we have Justin Bieber, Kesha and Selena Gomes. So it will assign IDs of zero one and two. So instead of storing the repeated strings, what it will essentially store is the dictionary and then the column data, which is just a sequence of those IDs. Compressed using different encoding, different compression techniques. Similarly, like each of these columns are compressed using this dictionary encoding technique. The third thing it uses is bitmap indices. So it creates a bitmap index for each of the value which is being stored. For example, like in this case, Justin Bieber, that appears in first three rows. So it will create a bitmap index saying that having the value one, one, one, wherever the Justin Bieber is present and zero in the rows where it is not present. Similarly, it will create this bitmap indexes for like all the possible, all the values in your column. Now, when you try to query and filter your data, what it does, for example, I will query for like any row which has either Justin Bieber or Kesha. So it since it already knows in which rows Justin Bieber is present and in which rows Kesha is present. So it can just do a bitmap or to compute the final result set. And then it will only scan those rows, which needs to be scanned in order to answer your queries. So the data filtering is essentially just doing some bitmap or and operations in trade. And these bitmap indexes are compared are compressed with concise or roaring encoding to compress the size to reduce the size of these bitmap indices even further. The next technique it uses is approximate sketch columns. So generally you have like user IDs in your click data or you have like long page URLs. So and you're not interested in, in most of the cases, you're not interested in looking at those individual user IDs, what those individual users are doing. But interested in getting some aggregate information about that, like how many unique users visited my website and things like that. So for those, those unique users and distinct counts or like retention analysis kind of things you need not store those exact values. What you can do is you can store approximate sketches, which can be hyper log log or like any other data sketch, which supports these type of computations. So these data sketches reduces the size of your data even further. So instead of storing the row columns, what you do is you just store a sketch object, which can do computations like distinct. For example, if you store hyper log log, you can do like distinct count and things like that to summarize to get a bit more details about like approximate algorithms. You can store sketch objects instead of raw column values, which gives you also like better roll up because you can roll multiple instead of storing individual values you are storing like single aggregated object for that particular column. You can have you'll get like reduced data size. You'll also the use cases are like approximate distinct count approximate histograms funnel or retention analysis. There are certain limitations to storing these sketch objects. Obviously, since you are not storing the raw data, you cannot do like exact counts in this fashion and you also cannot filter on individual row values if you are using approximate columns. Coming to grid architecture. So this is a diagram showing like grid architecture. So we have like first of all, we have streaming data coming in. Then there is an ETL system, which is massaging and enriching those events, transforming those events. And finally, those events are being sent or consumed to buy the real time indexed stars. So these real time indexed stars, what they do is they keep an in memory state of the events as soon as they receive it. So they create a right optimized in memory data structure, which can be used to serve queries. So now at this point, if there is if someone asked for the data, it can provide the data in the query results. Periodically, what those real time indexed stars does is they convert that right optimized data structure into a read optimized immutable data structure, which are the final grid segments. And they hand over those segments to the deep storage. What they do is they persist those, those segments to some distributed data storage, which can be like S3, KSN, RISDFS or any other network file system. It just needs to be available for all the nodes from this deep storage. There are another set of nodes which are called the historical nodes, which loads this particular data segments. And then memory maps those segments and serve queries on top of those segments. So periodically the data is handed over from the real time nodes to the broken, to the historical nodes. And this historical nodes, nodes can then handle historical data, which is immutable in nature and can serve queries on top of that. The historical nodes, they are the main workhorses of the grid cluster and serve queries. There is another set of nodes called the broker nodes, which just keeps track of all the data, all the segments and on which nodes those segments are present in the cluster. What they do is they have the ability to scatter and gather, to scatter your query, to break your query into multiple segments, send them to multiple real time as well as historical nodes, gather back the results from these nodes and provide those results to your query layer. Just to see like the flow of an event, for example, an event happened. It was enriched and then sent to the real time nodes from at this point. If a query comes this query cap, the event will be served and visible on your dashboard via the real time nodes after a certain point of time. These real time nodes will hand over the this particular segment to the historic to the deep storage from where it will be loaded on the historical nodes from the historical nodes. Now, if since like as soon as the historical nodes loads the data, it will announce that it has loaded this particular segment and the real time nodes will drop the data from its memory. And now if a query comes, it can be served from the historical nodes itself. Just some quick facts about Drid's performance and scalability on like how it looks like in production. So Drid, the largest cluster is hosted by MetaMarkets, which was also the company which initially started Drid. So that has that ingests around like 300 billion events per day. Jolata uses Drid for computing around like 1 billion matrix per minute. There is the largest cluster at MetaMarkets is around like 200 node cluster. And the largest early ingestion rate in terms of data size is reported by Netflix, which is around two terabytes per hour. Here is a list of like some companies which use Drid in production. There are a bunch of different companies over here which uses Drid in production at like various scales and various sizes. The next layer is with the visualization layer where we now we have the data in a data store which can provide subsequent queries. Now we want to create dashboards and visualize those those events and analyze them on a dashboard. So for that the requirements are we need to have a visualization layer where we can create the dashboards which has the rich the dashboarding capabilities. It generally in the organizations we do not have data in a single data source. We have data spread across multiple data sources and coming from like multiple sources. So it should be able to also query like multiple data sources. And since it is a user facing application, it needs to have a security and access control. It also needs to be extensible enough so that you can customize it for your own use cases. So the solution I'm going to discuss for the visualization layer is superset, which is a python backed flask application which has authentication and uses pandas for like rich analytics. It uses like SQL alchemy for its SQL tool kit. For the front end it uses react and nvd3 and it also has a deep integration with Drid so that like it can generate native Drid queries and which are optimized and pass back the results and serve those results in the in the dashboard. So just to show like some of the some of the dashboarding capabilities, it has support for like various visualizations like remaps, sunburst. It also has many, many visualization layers. In fact, like I can show you guys some of the some of the other dashboards, some of the other sample dashboards which has those visualizations. For example, this is a world's bank data dashboard, which has world's health and population data. So you can see there are like map visualizations. There is like this sunburst visualization of breakdown on like off the rural areas. There are you can have different like a line charts. You can have tables and like stack charts, bubble charts. Lots of lots of visualizations and you can. It also provides an explore view where you can see what all visualizations are there. Yeah, so these are all the all the visualizations that superset supports today. It is also extensible. So if you want, you can add your own custom visualizations to this and it is very easy to like create great dashboards with superset on like multiple data sources. So you can have your data sources defined. You can connect like multiple data sources. So I have one SQLite database and on the data sources. I have these like different data sources connected to this particular superset instance. Yeah, coming back to the slide. So yeah, it has lots of lots of visualizations which you can use and customize for your own use cases and create dashboards. So to summarize how this this whole Wikipedia dashboard was working and so essentially we had like this Wikipedia streams coming in. We were using a simple Java application using like Kafka Connect APIs to just dump all the raw events which we are getting into a Kafka topic named Wikipedia Raw. Then we are reading. Then we were using Kafka streams and we created a IP to geolocation processor in the Kafka streams which was reading the these events from Kafka and writing them to another topic named Wikipedia and this topic. And from this topic we were then pulling data into dread using dreads Kafka indexing service. And finally we were querying dread to visualize all those events in superset. These are the link to the project websites. You can go to these websites if you need like additional information about any of these projects. With this I will end my talk and I would also like to announce that there is an off the record session on experiences and challenges in working with dread. So if any of you are using dread in production and are facing like any issues or are interested in knowing like the future roadmap and discussing the challenges that you have faced. Please join me tomorrow at like three twenty five p.m. in the in room one for like further discussions and now I'm open for questions. Questions. Hello. Yeah. Just thanking you for the session. My question. Yeah. Am I now. Can you hold the mic closer? Okay. Is it better now? Yeah. Okay. Thanks. So Nishant two questions on droid really. So you mentioned you know how you would actually decide on a roll up frequency as as a design element and and the data gets rolled up on that on that roll up a slice that you've taken. Right. In your case. Often there are scenarios where you want the roll up but you also want to have access to the raw data. Are you able to model those types of scenarios with droid as well. So you want the roll up to happen as well as you want the raw data. That's correct. Yeah. Yeah. So essentially like it's it's not a Q building solution where you build like multiple cubes actually. So what you can do is you can have your. Yeah. Not rolled up by any granularity and then as your query comes. So the brokers also have a concept of cashing so they can cash your aggregated results for each segment. And if you re query though that particular aggregated data set it can look up into that cash rather than actually processing your query. So this cashing layer actually helps with these kinds of scenarios. Where you want to query like multiple granularities but you do not want to roll for for that particular granularity. Though you will pay for the storage size for the extra storage needed to store that even question. How does that Apache Kafka streams and the flink. How do you choose that between these two. But it was about Apache Kafka stream. What was the question. And there is Apache flink also flink. Yeah. So how do you choose between these two. So actually like I'm not going to like a compare all these ETL engines. So in particular there are times in particular like I haven't worked with the like flink. So I cannot like comment on that. But talking about Kafka streams. It is like you can have your processing in like multiple stages. And which is like pretty good and which is pretty like it. So if you are moving towards like microcontainers or dollars or any of these kinds of solutions. You can use that if like an e-commerce website and you have different users visiting that particular website. So you have you store though you get those user IDs in each of the events right. So those user IDs are of the order of like billions of cardinalities right. And what you interested in instead of knowing like this particular user visited this page. Instead what you are interested in knowing is that how many users visited my e-commerce website in last one week. So those kinds of questions can be answered by these data sketches. And one example is hyper log log objects like you can create hyper log log objects and store that. The second is another library created by Yahoo for like data sketches. So if I compare both of these hyper log log is not very good at like doing intersections. It is good at doing unions but not at intersections. So for intersections this the other data sketches library contributed by Yahoo that works pretty well. So intersection is needed for use cases like retention analysis and things like that. Hi I'm starting on a similar activity. So this session is really helpful for me. Now I have two questions. One is how does it compares to ELK stack? You know if I look at the whole framework. So honestly I haven't done any benchmarks but there were some benchmarks done by the team at Netflix with the ELK stack. And what they found was like for search kind of search use cases that was pretty good. But for if you are interested in like aggregating doing aggregates the ELK stack needed like much more resources and was getting like much more costlier in terms of both like query latency as well as the hardware requirements. There is like one blog one one like complete post on the Druid query forums Druid user forums on like this topic. So you can also like search for that and for the exact numbers that they got. Okay and the superset allows to change the query like how difficult it is to write a fresh query superset allows to create a query. Yeah it's it's pretty easy actually. So superset has this like concept of slices. So you can just like for example I just selected the data source you can select your visualization for example pie chart and then you can pass in some like group by keys. Like for example I want to group by country name and then query that. So yeah there is like let me just yeah this is like one last question. So this is one way and the second way is the SQL lab. So it also has an SQL lab where you can like post your SQL query and get your results. One last question. That's right. Is it good for mutable data for example transactional data which have some columns. It is good for like OLAP use cases not like OLTP use cases where you have like transactional flows. So it right now what it does is it creates it treats your past historical data as immutable and if you want to make edits or changes what you need to essentially do is reprocess the data. For that particular interval. So since the data is time slice so you can replace segments for like any selected interval. For example like for there was like some bug and I want to process like events for last one week. I can just reprocess that in a in a batch fashion via Hadoop map reduce job or via spark and then create newer segments which will replace those older segments. That may be good for one of a kind of time where you know things have gone bad. But if something is changing forever for example last say two months of data is always keeping you know keep changing then maybe this might not be a good. I mean can you define more on. So for example I have a transaction which has different status for example no transaction initiated then in process success failure and then I want to know kind of. There are like multiple ways to like just like model this this data set. So you can model this in terms of like event streams. So transaction at this time it's state old state and new state. So if you just have this as an event stream so this now becomes an immutable data right. And if you want to query the final state you can just query the latest event and filter by by the latest event you will get the latest date. And by this this model you also get the historical states. So you can see that like instead of just storing the current state you can store the old state new state in an event based way right. So there are ways to like model such data sets also. Thank you. And I mean if there are any further questions we can take that like tomorrow in the off the record session. Yes we can take this discussion offline. Thanks a lot. We will now take a break for tea and be back here by 12 0 5 p.m. sharp. There are a few short announcements. There's feedback form at your seats. Your feedback makes us help build these conferences better. So please fill the feedback forms and drop it at the counter. There are certain bags for the feedback forms. Also you can pay for your food at the vendors by using Paytm or you can buy tokens from the token counter. We accept card and cash and cash is not accepted by the vendors directly. So you can either pay vendors by Paytm or token and you can buy token from Hasgeek counters. The merchandise counter in the lobby is open from 10 30 a.m. to 5 p.m. You can pick up your prepaid t-shirts over there. And also we are inviting flash stocks today from 5 10 p.m. to 5 40 p.m. It could be anything a 5 minute presentation about an open source project. You have worked on some tips and tricks some experience you would like to share with people. Also the code of conduct is there on the Hasgeek website and should also have an email to you along with the ticket. If go ahead and read that code of conduct make sure you're not violating any of these to submit a flash stock proposals. You can contact me as a hall manager or any of the three people sitting in the right side front seat front row seat. Thank you. We return here at 12 5 p.m. sharp talk by Gaurav Godwani from data kind. He's going to talk about transforming India's budget into open link data over to Gaurav. So good afternoon everyone. Today I'm going to discuss about how to transform India's budgets into open link data. So let's see what does budgets mean. So budgets reflect the priorities and values of the people of its state. It tells about the promises of the government and it tells about how much they are achieved. So it has so called moral documents. It has been referred as moral documents across several geographies. There is one story linked here from box.com a proper data journalism website talks about how in US they consider them as moral documents. But in India the state of budget is something like this. They are very hard to access very difficult to comprehend. It makes us really difficult to analyze them in time and make our priorities accordingly in terms of analysis reporting and see where the government priorities are fitting in. So on your left you can see a budget of BBMP and on on the right is budget of Karnataka state government. You can see the disparity in the format and the structure difficulties involved there. So major issues with India's budget are most of them are unstructured PDF documents difficult to parse difficult to analyze limited availability of budgets online. For example Tamil Nadu government doesn't publish budget for more than a year on their website. So suppose I have to go back and see what was happening in last two years or so I can't do that. So that's a major issue. Inconsistent formats. Each of these government bodies keep changing their formats as and when they change a new vendor or have a new addition in terms of policy or a major change in terms of politics so on. No metadata. So none of these websites give any metadata. So we don't have detailed information about the currency detailed information about what's inside the budget document. So you have to really go through thousands of pages to understand the key keywords related to that particular documents which are pain. And lastly inconsistent and incomplete budget codes. So you can see budget codes as unique IDs for your database. So if you have inconsistent and changing budget codes how you're going to map the whole time series and see how the trend has been. So that those are the major four to five problems we have been facing. That's where open budget India comes. We are a platform to make India's budgets open usable and easy to comprehend. It's a committee driven initiative to focus on the open budget data. So from public accounts to trust in government this is generally a cycle which has been followed across various geographies. Public accounts are where you can see the detailed information about how government is spending. And if they publish the budget data and if it's an open format you can then enable fiscal transparency using it. You can see where the priorities have been what kind of tenders have been invested where the money has been doing a going across various departments and ministries. And eventually that can lead to trust in government. So for today's talk we would be focusing on just the open budget data aspect of things. Let's see how things work. So open budget data is something which is publicly accessible. It's available online for everyone to use. It is a reusable format. Not just giving the analysis but giving the hardcore data points to the maximum disaggregated level possible so that people can find their own trends and do their own analysis without any restriction. It should be free. It should be legally open. So whenever you see that small C symbol in a circle that should be missing from the government website. No copyright. Also it should be machine readable. It should be editable online. It should be either in Excel or CSV to begin with. This is what we say open budget data should be ideally. As per the Tim Barnsley founder of worldwide web there are five stars of open data. The number one is PDF which we get at the moment with most of the government website. Second one is XLS slightly open but requires a vendor driven software to use. Third one is CSV any machine any digital machine can read this document and understand what is happening inside. Fourth one is Artia. That means you should have a web URL for your data set. And fifth one is linked open data. You are able to interact with these documents in a form of a graph. You are able to play around from one database to another database, link them together, draw your analysis. That is the fifth level of transparency. This is where we want to take India's budget to a link to open data. We are an open source committee driven initiative. All our code design algorithms documentation everything is available on GitHub. And we would dig in about data pipeline today. This is how it looks like. Simple steps. Number one scrape the documents from various government websites in whatever format they are available. XLS and XLS like Sikkim and Union budget website. Rest of them still give us messy PDFs. Second pass them into clean machine readable data to use. Transform them for making it more usable on a timely basis. Try to find the unique IDs make it completely machine readable machine consumable. Fourth one is publish publish it online. Give a URL to each data set so that it could be used via API. And last one is analyze the interesting part here is analyze comes after publish. So all the analysis should happen once we have already published this data. So you're not just restricting analysis in house. You're kind of cultivating the open analysis of budget data. So let's focus on the scrape component today. There are around 150 such budget websites which give us information of several priorities of government. As you can see each one follow a different template. That means each one has a different HTML structure. So we developed something like this utility. A centralized scheme of functions and methods which could be used for scraping call it scraping utils. Here you can download. You can download a file. You can do session management. You can do cookie management. You can do expat selection and so on. Then for each particular website we write a very small plugin so that all the custom logic sits according to the website. So tomorrow the website structure changes. We would just change the plugin. This is relatively very small amount of code compared to the utility. At the end we get the PDFs and excels and whatever format they are available on the website. Just to give a brief about expat expat helps to convert the HTML document or XML document in a tree structure. And then with the help of expat you can access a particular node or a set of nodes which follow a certain rule. Certain parents certain child so on. So this is an example to access addition language German and French. And you can see the Wikimedia is the super parent projects are further and then project editions. These are the hierarchy. Second step is bars. So from 150 websites we get 150 plus budget document structures. And that's another messy problem to solve. These are the kind of PDFs we get. So as you can see there is a slight tabular information there as a human. We can see the lines. We can see the columns. We can see approximately these are the cells but how to train computer to do the same. That's the challenging problem. So what we have done similar to the scrape pipeline we have created a parsing pipeline. The centralized repository is PDF to CSV. We have most of our munching code parsing code reside then individually there are plugins to customize and tailor the content as per the state or union. Let's dig into the parts of algorithm. What are the steps we follow? Step number one is loop over each page in the PDF. And convert them into images. We convert each page of a particular PDF document into image. Sometimes these PDFs are rotated. So we unrotate them. We make them strengthen. And at the same time we need to sometimes change the format. Sometimes these are a three page need to convert them into a four and so on. These kind of page layout mechanisms happen in this stage. So we look over all the pages. And for each page what we do is try to identify a set of vertical and horizontal lines prominent in that particular image. This happens using Hub Transform. It's a very popular computer vision technique to detect line. What it does you can see the demo there. It's like a lighthouse. It keeps checking for all the points available in the vicinity. And then try to move into the direction where more number points are present. So it's sort of a moving lighthouse step by step you progress and you start drawing the line. So that rankings you can see on the right hand side of the moving diagram there. So more the rankings in particular direction. That would be the way we would move forward to draw the line. Once we have drawn the lines, we try to detect the largest contour. So this happens using again OpenCV, a popular Python library for doing computer vision. We detect the biggest bounding box possible, the largest rectangular contour present. And that would give us the table boundary. Next, what we do is we in this step, another thing is happening. We also extend the vertical lines to touch the table boundaries so that we ensure the whole table structure is in place. Next, we compute the coordinates for tables and columns for each page. We call it table attributes. These are top left, bottom right. And then the column coordinates extracted from step two as C1, C2, C3, and so on. This is then passed to a popular public library known as Tabula. It is very good at parsing PDFs, but it requires input from human. So what we are trying to do is give them the boundary boxes and give them the column coordinates and what it does is detect the cells out of it. For each particular character in the PDF, we get something like this, the information of top left height width and rotation. So given a bounding box, we try to calculate the characters falling into that box. It internally uses Apache PDF box, which is a popular Java library to parse PDFs to examine in this format. Apart from the common munging, we require some specific munging which happens in the plugin, like fixing header values. Sometimes headers are inconsistent. Sometimes these rows and columns are merged or splitted because to make it print friendly. So what we do is either split them or merge them based on the logic, make it more machine readable, filter out non-UTF characters because when you are pushing through API, it becomes a problem to deal with non-UTF characters. And similarly, we do other data sanity checks to make sure the PDFs are converted well into CSVs, like total and cross net, those kind of calculations. Finally, we get something like this out of it, which is much more easy to consume. But still there are some problems with it. And that's what we would look into the transform step, third step. So for specifically state budgets, this is how the unique ID looks like. So these are the seven heads you get. So demand number is department of that particular state government. Major head is function of that particular government. Sub-major head is sub-sector of within that function. Minor head is a program with the sub-sector. Sub-minor head would be scheme like, and detail heads would be like salary, official expenses, objects, so on. And in objects, we would have sub-schemes wherever possible. These are the seven heads and demand number is the unique identifier for a particular demand. But this is what we get from the state budget. This is Karnataka's budget. You can see here, you just get one code. So all information is hierarchical in nature. You can see urban health service allopathy being repeated multiple times. So you are trying to present hierarchical information in a flat structure, which is a pain to deal with. So what we do, we collect all this and bucket it with the help of a specific budget code. And once you have these budget code ready, this can act like a unique idea. This is now finally ready to go into a table in your database. Next we do publish. For publishing the data sets, we follow CCAM. And it's a platform which enables open data publishers to publish their data in a much more structured format. You can add detailed metadata. As you can see, you can have description. You can see what all formats are available for a particular data set. Keywords, sources, tags available for each data set, which is really helpful for anyone who is trying to consume this data. This is how CCAM architectures look like. So the base is a model. It's a typical MVC architecture. The base is a model which deals with system database and a data store, which is mostly in Postgre. And on top of it, we have ORM model of SQL alchemy and a search via solar. And then since the logic clear, it's so-called controller, where authentication sections, business logic, background tasks, like updating the data sets, generating the site, external, so on. Happens. Views are generally rendered using Python's Ginger 2 template, which is a popular technique to publish HTML documents. And as you can see from logic, there is direct access to the API, where most of the information is given as chase and or multi-part. And on top of it is simple routing algorithm, which routes your URL to a particular data set. On the left, you can see there is an opportunity to add custom plugins. Let's see what we can do with the help of plugins. So you can add libraries, custom libraries, JS libraries for visualizations, and other Python libraries if required. Your Python scripts for the controller, it uses Pylons as the base architecture, so you can add your custom Python scripts. You can create your own Ginger templates. You can change the view, the hierarchy of documents completely. And you can add custom CSS, image files, and so on. What you get for each extension is a simple functionality. For some, you can get visualizations. For others, you can get a sitemap, and so on. With the help of sitemap, you allow bots to index all this information. So none of the budget websites at the moment publish a sitemap. And that's a big problem for making it searchable. So what we do is generate a huge sitemap with detailed information so that these documents are searchable by users. And this is how the categorization look like. We categorize based on tiers of government, combined budget, union budget, state budget, municipal corporations. And in terms of sectors, we have chosen 12 developmental sectors like agriculture, education, drinking water, sanitation, and so on. So that specific researchers can look into the budget directly. We also publish the data sets via API. This is how the API looks like. You can easily access the keys and then approximate values for the same. And for each data set, there is a resource ID, a unique ID to get other information. All these data sets are under Creative Commons 4.0 CC by license. You have to just use with the attribution in your work. And how does the open link data comes into picture? So the unique ID which we discussed in a couple of slides back, 221001110, Urban Health Services, Elopathy Hospitals and Dispensaries. With the help of Compler Auditor General, the three codes need to be unique across all the states. There's a mandate. So now with help of this unique ID, you can query any state and try to link the whole information. So you can connect Kanataka budget for 2017-18 with previous years and compare it with Sikkim budget at the same time, just with the help of this unique ID. So that's the power of open link data. This kind of analysis earlier used to take months to do. But now with the help of this open link data, you are able to do this analysis just in 15 minutes or so. The last component of the data pipeline is analysis. This is how we analyze. You can see a time series of budget at a glance for Union Government, the central government from 1994 to 2017-18. You can compare on the recovery of loans, total revenue, total expenditure of the government and so on. And you can see how the trends have been. You can also compare the actual accounts, the estimates, the revised estimates and so on. Also, you can analyze budget of a particular municipal corporation. You can see what is happening in your municipality. This is for Ahmedabad. You can see the trends and you can see the priorities of how your municipality has been performing in couple of years. Also, with help of open link data, what we are able to do is produce analysis of Union budget in less than 15 days. So as soon as the budget is out, you are able to create a dynamic tool where you can see the sectoral priorities of the government and of the budget. And you can then argue on the facts rather than just the opinions. So you get all the data in one single place available for you. Also, to extend our efforts, what we are trying to do at the moment is have an aggregated comparison of state budgets. You can see which state has been investing more on which sector, again the same 12 sectors. You can see how in agriculture, Karnataka is doing as compared to Tamil Nadu, Madhya Pradesh, Himachal Pradesh and so on. And you can see not just the total expenditure, but also the total sector expenditure as percentage of state budget. You can do per capita expenditure, per capita as per the population of that particular state. So you can see drastically that the figures change. Maybe the total expenditure is huge, but when it's compared to the population of the state expenditure goes down drastically. Now also you can see the revenue expenditure and the capital expenditure. The revenue expenditure is something which is an ongoing cost while capital expenditure is something like recovery of recovery of loans or buying of land and so on. Also one more motive of our initiative is to educate people. Currently it's very difficult to understand what is happening in the budgets. What are these codes? Demystification is very key important aspect of open data. What we are trying to do is create something known as budget basics where you can get very detailed information of how budgets function, where is the money coming from, where does the money go and so on. Future work, what we want to do with these budget codes in place, we want to create a public national database of budgets covering all the levels of governments, union budget, state budget, municipalities and eventually district treasuries as well. This would help us to compare with the help of unique ID what is happening in each government body and we can see the time series of the same. So suppose I'm not aware of what is the budget code for hospitals. What I can simply do here is type in hospital and I would see all the data coming in. So the index is not just on the code index is also on the description and this would also facilitate fuzzy search. For example, Karnataka writes Sarvana, Sikshana, Abhiyana while Madhya Pradesh writes that Sarva, Sikshana, Abhiyana. So while you type either of these you should get both the results. So fuzzy search would be also in place. This is where we have been adding so far. How you can contribute help us generate more open data in budget. We are struggling and scaling up since the number of departments is huge. So help us evolve our algorithms contribute your ideas suggestions on what else we can do evolve our code base. Everything is in public domain. Everything is on GitHub. So give your suggestions, open some issues, help us find some bugs that would be really helpful. Cover budget data in your geography. Currently there are only 98 municipal corporations publishing their data out of more than 300 months of corporations. So help us cover more municipal corporations on the portal and help us convert those PDFs into clean CSVs. Refine our designs and help us do more analysis in time. Help us with more suggestions on how we can make this data searchable and usable. We are open to new ideas, suggestions and feedback. Here you can reach us. These are the slide URLs, the code URL, my email address and all our Twitter handles. That's it. We will take only a couple of questions as Gaurav and the next speaker Rakesh will be available for the OTR session. So just yeah. I think OTR is at 3 p.m. in the room one. Hi, this is Paridhi from American Express and thank you so much for the presentation. This is one of a kind that I've seen so far. One quick question that I have is that this is one use case which is for the budgets. But if I want to export this technology, this complete architecture to banks. For example, I want to know what my rival banks are performing over the period of time based on financial audits. So how well do you think this technology can be ported and how fast can it run? So each component of this pipeline is open source. So you can just start picking the scraping utils from here. You can pick the PDF to CSV from here if you're dealing with PDFs. You can see the transform code from here and already the published platform as an open source analysis is something which I think would be very custom tailored to your use case. But the first four key components are already in public domain. You can start picking and playing around with it depending on how complex is the structure of your data that would govern the time it would take to replicate this pipeline for your use case. So the processing or the development of the pipeline. Development of pipeline took us almost one and a half year, but now it's an open source. So the application would take hardly few weeks. That's what we estimate. Next question. This is a great initiative. This will make a data available for all of us citizens. Put your mic a little closer. Hello. But I wanted to ask you, are there any initiatives are not related to this, but any initiatives to help governments generate data in a systematic way, rather than them generating data in PDFs and we going through this complex pipeline. You give a tool to a municipality, you give a tool to a central government or a state government where they key in this data in a tabular format, generate the PDF as well as this CSV. So I think Rakesh would be covering in detail about that part, but I would just give my few points there. I think it's currently difficult to convince government to use an open source technology and give open data by default. So that's where the struggle has been. Most of these departments are using treasury softwares, but all of these softwares are proprietary in nature and suffer from vendor lock-in. So those are the major problems. We do have a policy in place, national data open sharing policy, but very few data sets are out in public. One last question. Hi. Does CSCAN store data in RDF format? Yeah. So URL one is actually RDF format. So you can access it via API. Okay. And what is the number of triples that you have? Number of? Triples. Can you explain that? Like that's each entry, subject predicate object that you have. So what is the total number of? I think it depends on the structure of the document. Some have more, some have less, but on an average we deal with at least 15 per. 15 per object. Performance so far of the platform or the parsing part. Platform so far we are able to serve more than 10,000 requests per hour. Thanks. Gaurav. Gaurav will be taking more questions at the room one at 2, 5 p.m. We have an OTR session which has Gaurav and Rakesh. So you can discuss this there. Thanks Gaurav. Or you can tweet me here. I'm available here as well. Coming next up is Rakesh Dupudu of Factly. He's going to talk about open data in government and the challenges around it. Discuss one specific case of Telangana Open Data Initiative. Sorry. In the process we also help governments opening up data. So currently we're working with the government of Telangana. So I thought I'll talk about in general the challenges in opening up data and our experience with the government of Telangana. So this is from the preamble of the Right to Information Act. So I'm sure most of you know Right to Information Act that was passed in the year 2005. Hailed as one of the best legislation this country has ever seen. This is from the preamble. So most of what we are talking about in open data or information from the government actually flows from this preamble. So it talks about we need to have an informed citizenry and informed citizenry is not possible without government supplying information. And informed citizenry in the process will help in controlling corruption and ensuring that all the government instrumentalities in terms of bureaucracy, the political system is accountable. Now, so there is a specific provision in the Right to Information Act which is called Section 4 which actually talks about proactive disclosure from the government. So right now if you, you know, people with experience in dealing Right to Information applications would know that it is mostly demand-based. So you make an application and government will supply whatever information. But there is also a supply-based provision in the Act which a lot of people do not know which is Section 4. Section 4 actually talks about proactive disclosure from the government. Of about 17 items which actually covers budgets as well. So when we talk about open data, open data actually flows from Section 4 where governments without people asking for it should actually disclose. Now we say it shouldn't be that data is available, but people cannot access it. So with after the advent of Right to Information, it's closed to 12 years now. So I can show you the transformation. So this is pre-2005, municipal medical records in a government hospital to, to collect rate in Nalgonda. Now this took a long time, close to 12 years. And I'm not trying to say that this is a situation in every office. It has changed. But unfortunately this hasn't helped us go here. So we still deal with, like Gaurav was saying, PDFs, unstructured data. You know, even when you use RTI, people give you physical format. So while information is now open, people are able to access information. It's still quote-unquote not machine data. And in the form that we want. So ideally we want governments to open up data. In the sense every government like, you know, one of the, one of them was saying that they publish in a structured format that people can readily use. So briefly let's talk about the government structure. I think that will clarify a lot of things that questions people have. So, you know, let us not be under the assumption that the government is a monolith. Government is not a monolith. So each government department is their own boss. So at the central government level, there is this ministry called Ministry of Statistics and Program Implementation, which usually generates a lot of data, especially for policy framing. So they have three different wings. One is the National Sample Survey Office. You might have heard about National Sample Surveys. So these are surveys done on a, on a certain sample size covering all the states. So every once in a while. So they, they don't take 10 years like the census they're done. So some of them done twice every five years, some of them once every 10 years and so on so forth. Then there is the Central Statistical Office, which calculates our GDP. So you must have heard about this office recently because of the whole, you know, that we ask about is the government fudging GDP data, etc. So this, the chief statistician comes from here. So they publish GDP data, they publish inflation data and all that. The third one is a program implementation within the MOSP. So this actually monitors a lot of big infrastructure projects. It's called IFMD. They also have something called as IFMD, infrastructure management division. So they regularly monitor all the big infra projects. Now within this, they also have, you know, MOSP has, government of India has come up with a guideline sometime ago where they said any big project of government of India or any scheme for that matter, which has an outlay of more than 150 crores will have to be monitored and we'll have to have an online system that people can, government can track it. So which is why you see a lot of schemes, a huge outlay of hundreds of crores usually have a working MOS. So this is at the central government level. Now while is this the only department that generates data? No. This is one of the departments, but their only job is generation of data. But on the other side, you have every department, every ministry at the government of India level generating data like, for example, the ministry of finance generating budget data year after year, the ministry of health generating some other health survey data. It is not as if there is a unified vision of data generation where we say, you know, every government entity will generate data in a specified format. So that's one issue. At the state government level, we have a department of planning or the state level. Each state has a department of planning. So this acts like MOSP at the state level. So we have a state level data of economic surveys that are released during budget time. So within, within the department of planning, we have the direct rate of economics and statistics. So they release annual reports of states every year if you follow annual reports. So this is more like a statistical snapshot of the state. So every year they release this report. Now most NSS surveys also have state component in it. In the sense, they also release data of the state level, what is happening at the state level. Like I said, all of the line departments also generate a lot of data for their own need and purpose. And there are independent agencies which quote unquote, or within the control of the government, they also generate data based on the need. So I'm sure the problem is pretty clear now. So it's not as if there is one standard, one agency, mandating the data should be generated in a certain way. So part of the problem is this. So even for example, even a village punch higher, the lowest tire of governance also generates data for their own need. Now briefly about the national sample survey. So the NSS does surveys and we read subjects, broadly four subjects. The first one is socio-economic. So this includes saving, survey and saving, survey and spending, survey and travel pattern, survey and various other things. So if you go to the MOSPE website, you'll find all these interesting surveys. They also do a five year survey on, very five years they do this survey, and land holding, livestock and agriculture. So you'll find what is the average land holding in India, how many are small farmers, how many are marginal farmers, how many are large farmers. Same with livestock, they also have a livestock census that goes with normal census of human beings and the agriculture census. They also do surveys on establishment and enterprise. So number of establishments, you know, in each sector, number of employees, number of labor, then village surveys. They also do a lot of village level surveys to understand village issues. So broadly these are subjects under which MOSPE, the National Sample Survey Office works. So like I said, how is data generated in the first place? So firstly, a lot of data even today, especially in the large schemes, is still manually entered. Now there are multiple issues here. Well, I'll tell you a case study of the NREGS. So we have, for example, for most government schemes which, for example, NREGS, the key is the muster role. So muster role is more like a sheet in which the names of people who attended that day's work is written and with their signature. So muster role is prove that you attended that day's work and you are entitled to a certain amount of, you know, that day's labor. So we still, a lot of data at the grass root level is still generated in physical forms and then manually entered. So you can see there is scope for error there. Then we have some automated applications where, especially with respect to wherever we do GIS tagging, so they do use automated applications. And there are various registers which are later converted into, you know, the physical data is converted into machine-radable form. So these days, for example, you know, at least in Telangana, one of the departments has been using WhatsApp extensively to collect data. So from school, so all the school headmasters would WhatsApp data in a specified format for five, six parameters. And then this goes to the central office where it is converted into an Excel. Then finally, this Excel is, so there are varied formats in which data is collated and generated. And like I said, automated, so especially with transaction data, so wherever government is involved in transactions, so that data is automated. I'll briefly explain about the flow of the NRGIS. I'm sure all of you know, the National Rural Employment Guarantee Scheme, which was introduced in the year 2005 after the UPA came to power. So this was their flagship project. So it talks about providing wage employment to all the rural people. So it doesn't differentiate between a rich man and a poor man. So any individual living in a rural area is entitled to get work. Now if you, so this is probably one scheme with the largest outlay in this country. So in the last 10 years, the outlay has been more than 3 lakh crores across the country. So if you look at this, the first one is registration of the worker. This happens physically. So at the Gram Panchayat office, there is a form that one needs to fill saying I want to get enrolled as a worker. So you'll have to submit a few proofs. That is the first step. And after this is done, this is actually entered into the system. So PE talks about post-event. So it's not real-time entry into the system. But once those physical forms are filled, bunch of them go to, usually what happens is, again, I'm speaking from a Telangana experience. It could be different in different states. Usually the block office has a dedicated place where about 5 to 10 systems would be there with internet connection. And all people from the village collect these physical forms, go to the block office. They reserve some time and then do this entry. So most offices at the block level use that dedicated space for online entry. So then you have issuance of the job card. So once you apply for work, so the form is proper, then you're issued a job card. Again, this is physical. Then demand for work. So the scheme is a demand-based scheme. It's not supply-based scheme. So in other words, I, as a worker, can go and demand for work, saying, you know, I don't have any other work elsewhere, so I want to work. So that demand is captured. In some states it's happening real-time, not in all states. Then work allocation. So how it works is, there is a shelf of work decided by the grand panchayat. So there is a list of works. And ideally, the grand panchayat should decide that work should be taken up as priority. So a lot of criteria involved. So then work allocation happens. Then e-muster. Of course, like I said, e-muster is not there in all states. Some states are doing it. e-muster is like the electronic master. So once work happens, you capture the attendance, then the master is generated, where it talks, where it has names of workers along with the amount that they should be paid. And then if you see the entire flow, so the daily attendance preparation of MB-MB is measurement book. So once the work is done, there is a technical engineer who comes in, measures the work done. And based on the measurement, there is something called known as schedule of rate. So based on the schedule of rate, based on the amount of work done, the particular wage rate is decided. So it could be 150, 170, and that is transferred to the wage seekers bank account. Then the preparation of wage list, wage slip, FTO is nothing but a finance transfer order. So financial transfer order, which talks about money actually going to the wage seekers account. So this is the entire flow. Now, as you know, it is actually very, very complex. And we are only talking about one scheme of the government of India, which, you know, in terms of budgetary figures, it's less than 5% of government's budget at an overall level. So you can imagine the complexity we are dealing with. So to tell you the format, so these are the physical formats. So job card application for job card. So somebody has to apply in this format an application of work. Of course, these are translated into local languages. They're not available in English. Then this is how the master role looks like. The physical master role then transformed into the electronic master role. And this is how it happens on the ground. So this is the work site. This is the engineer measuring and again the engineer measuring. So this is the work, a road laying work. This is for example, building a tank, water tank. So you can imagine the complexity. So this is a snapshot from the NRGMA. So this is one scheme, like I said, with schemes with a huge outlay. We have a working MIS. So if you go to NRGA.NIC.in, so all this data that I explained to you before is available on that MIS, the snapshot from that MIS. So you can actually go up to the worker level, wage seeker level, to understand what kind of works did he do in the last so many years. What was the amount of money that was given to him in the last so many years. The kind of works done at a block level, at a village level. So what is the ratio between material and men? So all that. Now, just to understand the complexity of this data, like I told you, the expenditure in the last 10 years has been, you know, closed more than 3 lakh crore. Then how many percent days of work was generated? 1,980 crore percent days, which is, you know, 18 times of population. Then there are 25 crore wage seekers, not active, but people are enrolled into the scheme, till date. Then there are about more than 12 crore works that were done on the ground. And on an average every day about 50 lakh percent days of work is generated. So in other words, you know, ballpark average, about 50 lakh people get work from government. So this is the complexity we're dealing with. Now, what is the status of, you know, data generation, data disclosure in the government? Like I said, for major schemes, we do have MIS at various levels. There's a central government, MIS, there'll be a state government, MIS. So the data will be available in some structured format. So pensions, PDS. So all the schemes with huge outlay, you have these, then for smaller schemes, we have data that is maintained at local systems. We don't have it on the internet, so it should be on the computer of, you know, some clerk or some officer. So it'll be in his system. So there could be schemes with an outlay of 5 crore, 6 crore. They don't have an MIS, but still 5 crore, 6 crore is not a small amount. So they're maintained at a, at a local level. And then still, for example, village budgets. So though we have Panchayat Raj ministry software to capture village budget, it's still not. So a lot of villages still follow the manual paper budgeting procedure. So I'll, let me spend some time in talking about this. Now, a lot of people think technology is the issue in the sense, you know, the issue in opening up government data is technology. That's not the case. The bigger issue in opening up government data is trying to change their culture. So they have a culture of secrecy. It's gone on for close to 60 years now. And the same kind of resistance we see even in right information. So when somebody applies under RTI getting information, I know if somebody has applied, you will know the pain. So the first thing that anybody has to deal with is the culture. How do we make them understand that you're part of the changing culture? It's no more secret. A lot of government officials derive their power from holding this information. So it is very, very difficult to part with this information. So they feel that everybody then will become powerful. Their power is taken away. So culture is a huge issue. While we might discount saying, you know, why can't the government just mandate everybody to do it? It's not as easy. The second one is fear. There's a lot of fear within the government of government bureaucracy about misuse of data. So for example, this also partly comes from their own understanding of technology. Because, you know, there was an official who was telling me if I put an Excel on the internet, can somebody not do the Excel change it and upload it back? So this was his fear that somebody would actually change the data. So this partly came from his own lack of understanding of technology and partly from, you know, because at the end of the day, they have rules that are very strict and anything that happens with data, they are liable. So these are two big challenges. You know, wherever you go if you talk to government officials at various levels, these are two big fears. The third one is lack of capacity. Frankly speaking, does our government have the capacity to handle data at this scale? Frankly, no. So how are they able to manage today? So each department has their own vendor. So yes, government has a technology we can call the National Informatics Centre. But not every state government or every scheme is handled by them. So I'll tell you an example. You know, Health Department, for example in Telangana has eight different modules, eight different modules and each one is developed by different vendor. So partly one reason because they don't have that big picture understanding, they tend to do temporary work. So I have a need today. I need a solution today. Let me develop something. But how does it actually connect to what is already there? How does it connect in the long term? People don't understand. So these are I think four big reasons, challenges in opening up data. So while there is a lot of work on the technology front, I believe in India especially we are not working enough on the first two. Changing the culture, trying to tell them that it's a changed world. No more 1950s, 60s. So we are facing the same issues with right information as well. So when people ask that culture of sharing information, then so about, I'll briefly tell you about the Telangana Open Data Initiative, what we're trying to do. Small video. Yeah, I hope you all understand. I'll move to the next slide. So what we're trying to do with the Telangana Open Data Policies is actually addressing the first two. You know, while we're also addressing technology, not that we are discounting technology. So the government does not believe in going belligerent and opening up, you know, all datasets, but wants to take it in a phase-wise manner where we talk to officials, explain it to them that, you know, why sharing is important. How can we change their own internal culture of data sharing? Then we're also focusing on, like I said, the challenges. So we intend to hold regular community events. I'll also tell you an example. So I not sure how many of you know, the Telangana government has reorganized its districts. So the earlier 10 districts are now 31. So even, for example, Google does not have maps today of those 31. So the government will soon release their finalizing boundaries. So we already have a dedicated data portal called data.telangana.gov. There are very few datasets, but the intention is to make this more active. So these are the new districts. The shapefiles are getting ready as the boundaries are organized. So very soon the government will release them. So I'll one more case study and we'll probably end it. So Mabumi is the comprehensive land records portal of the government of Telangana. So it has land records up to a village level, up to the individual beneficiary level. So it's in each state it's called by a different name. In Telangana it's called Pahani, village Pahani. Pahani is nothing but a register that contains the land records. Now in the land ownership record there are more than 10 parameters that are there in this in this MIS. For example if this is screenshot it's in Telangana so it talks about the total extent how much is cultivable, how much is not cultivable what is the source of water for that particular land ownership. Who is the owner then various other things like that. So what kind of land is it? Is it dry land, wet land various other stuff. So what can be done with this? So we had a community event where a few students developed this. So we wanted to develop a village dashboard. So at a village level can we this is one data set the agriculture data set so can we in the same way can we source more data sets and then prepare a village level dashboard for planning. So for example this comes from the Mahabhumi website is an example. What we have done was to understand the land ownership in that village. So what is the total ownership average ownership how many small farmers big farmers. So how much of the land is actually cultivable. So what kind of sources does that village have etc. So going forward the idea is you know take such case studies take such data sets and involve community in developing various solutions. So probably I'll discuss this in the OTA session we don't have enough time. So what do we expect from the community like I said the government intends to hold regular events we'll keep informing you please participate in them then please provide suggestions or request for data sets. Like I said the data that you want may not be available but if it's available we'll try and push departments. Please come up with solutions there's a lot of problems to be solved. So government is open if you have an idea that can actually solve a real problem please come forward and if you have any ideas please write to that email id open data thank you. We can have a few questions question this side. So there is a already existing portal of open data.gov.in so how different is your approach towards that from the data set. I see a lot of data being uploaded there and that's still available public. Data.gov.in like you know is the government of India portal and government of India usually has aggregated data of states. So the data comes from central government departments. As you know the real if you want to call the useful data would be hyperlocal data not the aggregated data so usually the state governments so what the data portal of Telangana intends to do is provide that hyperlocal data so aggregated data we are not trying to reinvent the wheel so which is already available in data.gov.in it will probably the data set will also be available here but the idea is not that so first of all this is regarding the rural employment scheme this is regarding the rural employment scheme so is that an insurance for these people is that is that an insurance insurance for these people who are working there is no insurance there is no insurance because there is an accident claim accident compensation but there is no insurance compensation given if you if something happens to you while you are on work but that is only for the BPL right that is only only for the BPL right below poverty level the rural employment guarantee scheme does not differentiate BPL APL so it's anybody living in a rural area if you want work you can demand work from government so work is different from insurance yeah so work is work but what I read recent times is like it's only the insurance is only covered or accidental insurance is only covered for the BPL and not for APL no no this see like I said as far as I know a bit wrong but there is compensation for accidents on site there is no insurance as such more questions this whole model for Telangana which we know is like a new state so you can have all the latest technologies and the latest trends implemented in it now once you've developed this model how modular is it that other states can adapt it like how have you taken that thought into consideration yeah though Telangana is a new state it has a legacy of the last 60 years so the new state is only in the name so the state still has all the ills that you see in a typical bureaucracy so in that way the state is not different but the leadership is very open in terms of changing now when we say modules see I'm actually not talking more but the data portal of the state is actually developed in an open source platform called decan which is another version of secan what we are intending to do is build a model by which where we force changes in the culture and not just give them a technology mandate something and then make them give data sets today and tomorrow when you and I are not there data set doesn't come so that is precisely what is happening with data.gov today at the government level it is not a legislation so we have to understand so this is still and we ask government departments request them to open up data so in terms of our priorities so once the state is successful in opening up a lot of data and actually solving problems it being useful to various stakeholders the idea is to document all this and share it with whoever wants to it will be available on the website as well one last question I want to ask like you talked about different kind of data you collect that is socio-economic land holding I'm curious to know like is there any geospatial kind of data like which you collect and like addresses or bus stops see again it varies from department to department so for example if bus stops have to be captured it is a municipal corporation that has to do it so there is no agency yes there is a there is an agency that deals with geospatial in every government so we have something known as TS track but their mandate is not to geotag everything so their mandate is to prepare maps and boundaries so it is a department's priority so if bus stops have to be mapped the local municipal corporation has to do it so there is no I wouldn't say the geospatial data is being generated at a large scale today but there is an effort in certain departments to do it so most of the for example the NRGA works are now geotagged most NRGA works are now geotagged in terms of where was the work and what kind of work and in large schemes it is happening definitely but in smaller schemes and smaller departments it's not yet done which is data which is what I'm telling you so right now the data portal of Telangana is very very nascent so you won't have a lot of data sets very very basic data sets the baseline data sets so in the days to come we intend to do that not right now but in future probably in 6 to 12 months you'll see some progress there thanks Rakesh that was a great talk and people who have more questions can catch Rakesh and Gaurav at the OTR session which starts at 2.5 p.m. in room 1 it would be about using open data in different scenarios and the challenges and opportunities around it we break now for lunch which will last for 1 hour we'll resume sharp at 2.5 p.m. in the same hall we have some limited dinner tickets available the conference dinner with these speakers the cost of ticket is 3000 it includes the dinner and networking you can buy the dinner ticket online or at the token counter outside and it's at the blue bowling center in the whitefield which mall a phoenix market city mall right and also we are inviting proposals for flash stocks which would be from 5.10 p.m. to 5.40 p.m. today the flash stocks need to be strictly of 5 minute duration and you can talk about anything open source project you've worked on some tips and tricks some experiences you've had to submit this flash stock proposal you need to write it on a piece of paper the topic or a postcard and submit it to me or any of the three volunteers sitting in the front row right side and that's it you can go for lunch now check check is it audible get it started so hi good afternoon everyone so I'm Ram Prakash so people please settle down we've started with the talk so I'm Ram Prakash I hope I make this post lunch talk interesting yes so not audible better now is it better now hello check so this is my Twitter alias and I work for this company called Zoho so we make lots of business and productivity applications we call ourselves the operating system of business and I work for a particular group called Zoho Labs where we design solutions to our product teams and we kind of offer it as a service to them so we work on lots of interesting problems like hardware acceleration using FPGAs GPUs we are one of the active contributors to PGSQL and we have a division that works on machine learning and artificial intelligence so I own the machine learning product stack there so I hope that gives you a good idea about me I'll get into the talk now let us look at the problem description right so ML Assets machine learning and artificial intelligence is very snake-coilish I know that term is a bit difficult to digest but you know it is more of there are no proper proven results of an ML system autonomously taking decisions there always has to be a human element which can get in and make the decisions for you so any ML practitioner would accept with me that this particular field hasn't matured enough for a full blown business production system and coming from the B2B world there is not much differentiation you as an ML service provider can offer because almost all the algorithms are open and in the B2C world you have this data advantage which means hey I know better than you than the other company so my algorithms would work better for you but that is not the case with B2B companies and that is not the case in highly regulated industries where data is well defined in a schema and if it is a CRM system you are only going to get contact leads if it is a support this system you are only going to get support tickets so there is not much differentiation you can offer apart from your competitor and we all use cross validation techniques to evaluate your ML models which I don't find it very convincing for a production system yes it is a good indicator but that might not be the best indicator of the accuracy of your ML systems so problems like these could happen so one is data leakage where an unintended variable in your training data set has an unintended effect on your result for example let's say a patient ID is heavily correlated to the chance of that patient getting a cancer or something so this is unintended data leakage and your cross validation systems might not be able to capture such edge cases and there could also be a data set shift meaning it is the standard practice in the ML industry to take a snapshot of data from the past train a model over it and then pass on other snapshots of data from the near future make validations on that and get the results but you know real life data keeps changing and in V2V companies you could be actually deploying one particular model and serving it to many customers so you'll have to make sure all your customers get their predictions right and you'll have to tune your models accordingly so these are all certain problems which we face and there comes a solution called model explainer so what if your model can explain its prediction so it basically happens in real life right so you ask someone to do it they won't do it but you ask someone to do it and tell them hey this is why you have to do this and there is many a chance that that guy is going to do that so so this is where we took our inspirations from this paper was like it took the industry by storm there were lots of you know interests around this paper and the interest hasn't faded out yet so we started looking at this paper and so this is what this paper is all about local interpretable model agnostic explanations I'll break that down into simple sentences for you so this is where we have our own we have hosted our own version of it we made it production ready it is Apache 2 licensed so it is commercial friendly licensed and so we had to do lots of changes so that we can serve this explainer in a real time production system so for which the existing implementation from the actual paper was a bit difficult and we are primarily a java based company so we wanted that to be on spark we use spark for our machine learning implementation so we thought it is better if we could write it on top of spark so coming to local and interpretable so the explanation you give has to be local to the given query I know I have an example coming up for that and that explanation has to be interpretable for an end user to understand right and this is one change which we are making so that we make it suitable for production so model agnostic is the actual paper is model agnostic meaning you have an explanation system which works independent of the underlying machine learning model with it it looks at the parent data set but we have got certain clues from the model so we have made it it is not model agnostic now so the one liner of this particular framework is you know your model can be complex and non-linear so we are trying to fit a simple linear model around your query so that your explanations are interpretable and local so I know I haven't given you the right example for local so this is the case so the ML world is a very famous example the housing price example so now you have lots of variables like you are going to buy a house somewhere in Bangalore or somewhere in Chennai so you have variables like size and square feet is it an apartment is it a luxury villa how far is it from the nearest landmark and all that so a global explanation would say particular result is because of the size and square feet so that would have the highest correlation with your end result but look at the second case looking at 600 square feet luxury villa here you cannot go and get your explanation saying hey this is priced greater than 1 crore because it is a 600 square feet house that reason doesn't holds good because for this particular query you will have to use a different variable so here the explanation could be it is a luxury villa so it is priced that way for the first case it is a 1800 square feet apartment so it is priced this way so I have explained the same in a visual way so these are certain features size and square feet distance is luxury yes and there are three categories less than 50 lakhs 50 lakhs to 1 crore and greater than 1 crore right so you get the context for locally interpretable right so this is what it is all about so and importantly the local faithfulness does not imply a global faithfulness so there are two different ways to explain models one is giving you a global picture second is giving you a local picture around a given query so here the problem we are trying to solve is a local picture a global picture there are lots of ways to do it it could be as well as you know getting the feature weights from your decision tree and just printing it out or visualizing your data set that can give you global solutions but local solutions are a bit a different ball game so that is what we are trying to solve so I will just explain you the design of the system so this is your actual this is how we lost the model agnostic feature so basically for each and every query we had to look up the training data so that the model is model agnostic the explanation model is model agnostic so we made it weak saying okay we will borrow some data from the model itself and then we will use it so that was one major change which we did so we lost the model agnostic property but we have still kept this open we could always you know switch it so that tomorrow in case a model agnostic need arises then we could always roll it back and get it out of there so this is how it is so there is raw data coming in that is the training data you give it to your train your machine learning model and we look at the continuous and categorical columns so far now we are only looking at tabular data sets we are not we are not ventured into text and images so far that is definitely in the pipeline so we look at the categorical columns we sample them and we look at the continuous columns we divide them into discrete buckets so that you know you kind of reduce your dimensions and we inject that into the model right so this is your training process so as in when you create your machine learning model then you can do this now this is this happens during a query time so you have a given machine learning model and you have a prediction query that is coming in you are going to predict it on the model and give it so now you have a binary vector so this is this places where it could open up for the future use cases so the binary vector here in the case of a categorical or a continuous feature it is a mere presence or absence of that particular variable right so so so in other cases let's say tomorrow you are trying to explain images with that it could be patches of images so it could be a continuous group of images let's say tomorrow you are trying to do text analysis with that it could be a continuous set of words that could be given for the explanation there is a binary vector which determines the presence or absence of that and we do some scaling on that and based on that discretized values we get the sample rates so now we get the weighted data and we have a feature selector and a regressor so that we could get out the features and you know give them a score getting a confidence right so we started off as this to primarily you know trying to test our own algorithms and configurations so the thing is like we we in our labs we try to serve other product teams so we'll have to give them a compelling value add first to the product managers of other product teams and then to the users of the other product teams I mean users of the other products so there has to be a compelling value add and this seems to be you know like it is without any dependence of the other product teams so we can get this started on our own and the value add would be exponential with this case so we started doing our pilot in our churn prediction app so like every every org has we have a churn predictor which looks at the access logs and tries to create a model on which if a particular subscriber would renew his subscription or would you churn so we have variables coming in from the access logs we have variables coming in from the help desk number of tickets interacted sentiment of the tickets number of tickets escalated and all that so we have so many variables that come into this churn prediction system so we thought we'll hook on this explanation engine to this churn prediction system so these are the results we were getting so this is for a user who is going to renew his subscription so it says he has done lots of bulk insert activities so this is for a campaigns product it falls over campaigns so this data has a snapshot of that he has sent more number I mean based on the campaign sent there is so much of probability that he will buy and based on the percentage of non-CTP 200 request there is a probability that he is going to renew so these are the explanations for this particular case and this is an explanation for a guy who is going to leave so this guy has a more probability of leaving because of the value of his percentage of non-CTP 200 request and too many support tickets and active sessions so we would also be able to print out the value but I am just giving you a visualization here so we are trying to roll this out to customer facing apps in a faced manner because mostly in the btv world people turn off automatic recommendations people want to do it themselves so we have started rolling out this to customer facing apps in the form of a notification this could be because of this so that we instill trust on the user and we get them to use more of the machine learning features we have in the pipeline that is the whole of the project so I thought I will add the slide for you know so these are the other ways you can actually get in and try explaining your models so a quantile regression is where you take it percentage by percentage I haven't explored all of these but these are the other things that are available in the market which you can go back and start looking at it glyphs and correlation graphs or visual representations of your original data set so we had used the lasso regressor there so there is one with least squares regression and there is one with elastic net regression so basically you try to reduce you try to calculate the feature importance given to that particular query based on some assimilated data and there are tools like gms which can allow you to hit a sweet spot between complexity I mean accuracy of the model and the explainability of that model so you can get in and more of you can tune it to see how things are working and then there are dimensionality reduction techniques like tsn a auto encoder networks pca and all that so which could project your data set into a lower dimensional space so that it'll be easy for you to visualize and understand so these are other techniques which you can go back and have a look at it so that's it for me and just to summarize so we are trying to build a predictor I mean we're trying to build an explainer that would explain predictions of a machine learning system this particular explanation is more mostly for a local query so a query that is I mean for a for a with respect to each and every individual query and this will instill trust on the end users so we are trying to roll it out in faces we have a github page so we have you know hosted it on git with an apache to license so your contributions welcome and these feedbacks welcome thank you we have questions we'll start from there hi thanks for this so you mentioned that the model is local and interpretable right so but I mean in the last slide you show for the turn customers so this is the functionality so but it should be for individuals turn customers right as in like if it is local yeah yeah these are just snapshots for individual cases so here you look at the number of variables in the system it is the same so for this particular result it has chosen three variables which is bulk insert campaign send and percentage of non-sttp200 request for this particular query it has chosen two different variables and percentage of non-sttp200 okay so is this for a single user user or the group of these are two different users so one is a guy who is going to buy who is going to renew one is a guy who is going to leave sorry it's not local right it's the global phenomenon for the churning group and the non churning group no no it's not for a group it's for that particular individual user question here this might be a very stupid question sorry for my ignorance but it look like your binary this thing vector and all looks like a bunch of nested if else to me sorry sorry I'm not able to I mean your binary vector thing and all this thing looks like a rule engine with a bunch of nested if else where is the model I mean I think what he's trying to say is that we saw a lot of if else there am I correct the model thing you're talking about seems to me depend a lot on your ability to parameterize and look at with the sample size of one like all the example was on the sample size of one so I think you're making a rule based engine here with a lot of if else thrown into it nested if else where the machine thing comes in we're trying to fit a simple linear model around a very complex system that is where so for linear models explanation is very simple so it holds good for a decision tree so we're trying to build a simple linear model around that particular query so that is the whole crux it is it is okay the when we make it model agnostic it is totally independent of the underlying model meaning your model can say this guy will turn but this guy can come in and give an explanation for this guy will subscribe so that is now that we are borrowing some data from the model as such I mean I know it is just fitting a simple linear model around a very complex non-linear model so that is the whole idea behind the explanation we have a question there yeah question is that you said it is not model agnostic so which models it is support and then how is it different from a decision tree or a regression model where where you get all of those decision nodes right decision trees also explains the whole thing actually and also a regression model gives you the coefficients actually so how is this different from that and thirdly is this a free tool an open source tool that anyone can use sorry I didn't get your last question it is free and open source it is Apache 2 licensed so did I answer that so so for the first question yes so actually for now it is for decision trees it is for random forests that is the version that is on GitHub we will soon open it up for gradient boosting and SVMs which are specific to Apache Spark so we have written this on top of Apache Spark and so yes how different is it from other linear systems it is basically a linear system so it is around your query you try to build that linear system on that so that's it question here yeah what would an explanation for say image look like yes so basically when it comes to interpretability so here interpretability is a mere in this case interpretability is a mere the value of this right but in text you cannot give him a single word and say this is the problem but in an image you cannot give him a single pixel and give him that this is the problem so you will have to give him a patch of image let's say this is picture a cat or not so you will have to give him some reasonable subset of the pixels so that it shows something like a cat eye same goes with text let's say you're trying to predict if it is going to be let's say you're doing a sentiment analysis so this product is so good so that would be a reasonable explanation than this our product so here it is just presence or absence of that variable but there it could be a group I see it hi I'd like to ask how would if I just extract rules from a cart model how would this be different from it yes that was the housing price example so your rules could be for a global explanation so you could say your size in square feet would heavily contribute to the result but not for that particular query for that particular query with let me show you that for that particular query with 600 square feet ultra luxury villa that square feet is not the variable that is going to be your result it's about that ultra luxury thing so that is where the difference between global and local comes in so the variables you select for each and every individual query could be different that's what we are using using a feature selector one last question one other confusion for me is in that big picture you have the scaling and sample rates that feed into the lasso regressor can you explain that because you're taking one data point how does the scaling and this sample rates how much data point does the lasso regressor have do you run the regression model at the time of for that data point so that would really help so basically we put them into discrete buckets so now we have a toned down data set that is residing along with the model and here you have a query as well now you're trying to place this query into any of those buckets so the closer the bucket the more it weight gets let's say you have variables you have 10 buckets 1 to 10 10 to 20 up to 100 your number is 25 so it is more close to the 20 to 30 bucket so there you get the weights and now you also get in this weighted data you also get the class probabilities from the model the probability of that class being a result in a classification case so you also have that now you try to basically we use these regression and selector tools to reduce the number of features so that we get only the most contributing variables so that is why we have a twofold case one is a size square selector which can select the features and one is a regressor which will in another way give me the probability of those weights so combining this would be a compelling result actually we could skip out one of the two so now we are able to get a prediction in about 100 milliseconds so in case there is a demand for you know very quick responses then we'll have to tone down one of the two so that was one of the reasons we made it non-model agnostic because if it is model agnostic then we'll have to look up the entire training data set for each and every query which is not good for a production system thank you thanks Ram so next we have a stretching session by Lochan I guess all of you must be tired sitting at the conferences since the morning so let's flex our muscles a little check there are some seats here for people who want to sit and also we have the OTR session going on in room 01 that is just to the left of this hall the session is about using open data in different scenarios and the challenges and opportunities around the same hey guys I'm back stupid I most people are leaving now anyway so alright so can we stand up again please yeah it's only not the best check check check okay so some better awesome alright how we doing our energy good can I see more thumbs up or thumbs down good like not quite there sleepy this yeah okay fair enough okay so let's fix that the best we can right so again I'd like you to guys to breathe a little something that we often ignore and before I continue I'd like to talk about what good breathing is what efficient use of our lungs really is okay so what I'd like you to do is take one hand and place it on your chest and the other and place it on your belly okay now I just want you to find out for yourself close your eyes if you can I want you to find out when you breathe what exactly is happening what hand is going ahead what is not moving just figure out what is working for you take a nice deep breath and feel what good breathing could be see what you're using your chest a lot more than your belly or the other way okay now that you've done that me just tell you that it's important to breathe a good deep breath to make obviously most use of your lungs to the extent that you have enough oxygen going around your body right and also getting all the nutrients where they should go to the cells alright so then in the same position I'd like you to breathe in first as you breathe in take a nice deep breath into your chest first once your chest is full I'd like you to breathe so your belly comes out and as you breathe out I want you to do the same thing the other way around so when you exhale you first breathe out at the level of your chest your thorax and then down in your belly push all the air out and again breathe in chest first out belly next breathe out chest in belly in okay once you have this figured I'd like you to do this more often if you can where you work because if you don't have the time or the space to stretch the least you could do is breathe well now let's move I think I last stopped at the level of the shoulder with you guys so now let's just raise one hand your right okay and get it over across so your elbow almost meets your left shoulder like so okay and you're not pushing too hard you're just seeing how far you can go okay and I want you to forget or rather not forget to breathe okay need to hold your stretcher a little the other the other side now feel the muscles being worked out it's not doing it passively but also like thinking and feeling about what is being engaged right okay now we move to doing this so you take you shoot your elbow up and then you basically force as little as possible eventually your elbow down okay and I want you to keep it there okay now just to add a bit of a challenger you could take your other hand and sort of just hold that if you can is that possible okay and now to the other side again just slowly excellent alright next again I'd like you to put one hand on your thighs say your right hand okay and I want you to stretch your other hand your left hand up above as you breathe out you go to the other side okay and this is a lateral movement you're not going forwards you're not bending backwards it's a very straight forward lateral bend okay try and get your hands as close to your ear as possible and again it's about staying there for a bit not pushing yourself too hard as far as you can go if it's until here it's still cool okay this is also called Ardakati Chakrasana in yoga okay now the other side as you breathe out right and you feel the stretch here and you feel the contraction on the other side stay you guys have probably seen this next movement it's a hip roll so you hold yourself like this to support your back your lower back your feet are about hip width apart and you go clockwise again this is not like a crazy swing like this alright so it's about keeping your head steady and just moving your hips okay and try to keep your head steady there just imagine you're holding it okay the other direction now how are we doing on time how much time do I have almost done okay I'll take you through some exercises for your wrist and your hands so you don't suffer from injuries particularly don't go crazy with that man it's cool keep your hands in front of you okay raise and lower raise and lower raise lower engage it raise lower do it with some feeling as well raise lower okay raise now make a fist clasp it tight and release fist and release fist and release now hold your right side of fingers and pull towards you okay now take your right thumb and pull it towards your upper arm like this pretty much don't again don't force it you don't want to worsen things that's the last thing you want no fingers no coding right okay okay so do the same thing with your left hand and I'll take you through the next set of exercises bye thanks Luchin hope you guys are all flexed out and ready for the next talk we have Charu Mitra Pujare from ptm talking about how to go from a recommendation carousel to personalizing the entire app and how they did this at ptm can you guys hear me perfect alright hi guys my name is Charu I work for ptm I manage all the data science and machine learning teams at ptm and build several products there today I'm going to talk about the personalization story at ptm right so I'm going to talk about our journey which we have taken in about last year and a half I want to tell you state of the union where we are at where we are going and we will also talk about you know points in time how did we pivot and how did we go about building whatever we have today in terms of personalization so agenda is very simple so we will talk about why do we first why do we care about personalization at ptm it seems like it's a very large payments company then we will talk about the evolution of you know how did we just describing you our journey and then finally how we engineered a very large scale personalization system right okay something about our scale most of you have heard about us I'm assuming most people know what ptm is if you had lunch today you could have actually paid using the qr code solution that we built so something about our scale we have 100 million plus products and we work in over 2,000 categories and you know it's a very wide variety we go from right from marketplace product like e-commerce products cell phone covers to fashion accessories then we sell travel tickets we sell movie tickets we enable a lot of offline payments we sell deals name a category we probably have it right so that's why we have very large number of categories and hence very large product set we have about 100,000 sellers in marketplace alone and about 6 million plus offline merchants and the reason I'm talking about these numbers is not to advertise but to tell you about the scale right so when you have a scale like this most of you know most of the out-of-box recommender systems whatever people have done in past kind of starts failing right so even standard databases, standard technologies that you can use to build these all of that starts failing and that's when real innovation happens because you actually have to go and build it to match your scale right obviously we have 220 million customers most of you have heard this we have 80 million plus monthly active users which is a very huge number that means that people are coming back on the app back on the website and then every time they're there you actually have to serve them some kind of products right so personalization at this point at this scale think about the categories that I talked about earlier right you're selling travel tickets and hotel packages on one side you know which can cost you a few thousand rupees and you're selling a fashion necessity or Bazaar products if some of you have been shopping at Paytm which can cost you tens of rupees or a few hundred rupees right that's such a huge variety so if I how do I actually rationalize what should go on my home page how do I rationalize what should go on individual category pages how do I actually start driving traffic to each one of these pages it is a very difficult problem to solve so variety of offering is our number one reason number two is when we talk about marketplace itself very long tail assortment of products right so we have roughly 100 million plus products in our marketplace catalog alone coming out of 100,000 merchants which basically means that not every product is going to sell some of the products won't even sell once in a year right and this is actually a typical retail problem it is not unique to us the uniqueness here is in the scale right then transaction is king right so I mean we are not a content website so today we are not a content website so it's not like I want you to come and spend half an hour every day on the website I want you to come and do a transaction and I want you to I want to make that experience so seamless for you that when you come I know why you have come here and I basically show that to you up front so that you can actually just come do your transaction and get out I don't want to come in your way and then like I said earlier in any digital business basically you always have again a long tail of properties also you have a long tail of products you have a long tail of properties what that means is you will have very few properties which are very impactful people can come there and people will keep coming there like your home page typically that's a very attractive proposition if you go down in the tree there will be some pages which will get let's say if your home page is getting a 10 million views per day some of your other pages will get traffic of let's say 100 views in that particular day right so that's why the premium properties are limited and it is kind of an obligation for us to make sure whatever we are putting on our premium properties is something which is very very useful for our customers by the way putting something on our premium property can make or break a business right so we have to be very cautious about that and this is exactly why personalization was not like a fashion v4 type of a product it is a necessity for us let's talk about so we started this journey somewhere around 2016 early 2016 and that's when we were having a lot of discussions so what do you mean by personalization what are you going to do so we said I come from a background where I have been trained to set metrics against every damn thing right so I mean whatever I do I try to set a metric against that so because everything should be measurable right so we said let's set some metrics first and then with the help of some marketing guys we actually put some nice words also around it and that's what I have on the slides here so number one target was objective was customer delight right so the simple metric against this is basically if the customers who are coming in are actually doing the transaction or not so if customer will like it they will obviously do a transaction if they know that they don't have to spend a lot of time and you know once they are on the app or once they are on the website they can see the product that they need they will obviously buy right so that's the first one so we want to make sure that we are showing the products preemptively to the customers and not everything needs to go through search because search is again a multi-hop process and you don't want your customers to go through that pain right remember like I said earlier transaction is getting my objective is that you come in and I should show you what you are here for you should do your transaction and you should be able to move on right we should show relevant products to a customer based on the prior purchase behavior this is more like a how then the second one is seller delight right we have so many sellers we are a two-sided marketplace so we have equal amount of respect for the sellers also like I said earlier if I put a product on a home page it can make or break a business right so what that means is that I need to make sure that I am being I am being equally just to my sellers and I am giving them an equal opportunity to sell on our platform because that's what the purpose of a market place is market places are not meant for one or two sellers to sell the whole thing and in that regard we truly believe in that and the metric that we set around that was the overall spread that a seller gets so in my top top row on my home page top product row on my home page I can literally measure at the end of every every day every week every month how many sellers did I end up promoting and we will come to you know how personalization actually made it possible and then third one is obviously our business objectives right so my category managers my business growth managers I cannot leave them alone they are an equal equally important entity in this Trinity so we actually call this a Trinity a triangle this has come from our CEO so there is a ptm triangle which is customer seller and then ptm business objectives right so the ptm business objectives are in a month it could be I want to drive a lot of transactions in some other month I want to drive a lot of gmv some other month I may want to drive a lot of click throughs and that that will be driven by what do I want to achieve in that particular month right so I have to I have to whatever I am building I have to build it in a way that I can kind of respect that so remember customer so I have to make sure that customers are happy and they are getting what they need I want to make sure that the sellers are happy and they are actually able to sell and it's just marketplace and ptm business objectives which is basically I want to make sure that by making sure that both of them are happy but we still need to make sure that you know whatever we want to drive we are also able to drive that cool I want to talk about let's talk about the evolution of personalization now and then maybe I can just go and talk about where we are at so again we started this journey somewhere in 2016 and that's when I was given one top carousel on homepage it's popularly called as edd internally it's basically the name of that list was exclusive discount deals and I will talk about how we used to build assortment for that but basically this is a property where we get a lot of clicks and obviously lot of visibility right and the way to build it was people would go and they will find some of the best deals and they will just put it here right so we started with that single carousel then we said oh it's working out we tested it we measured it we will talk about the measurement journey also and we said okay now why don't I just put it on all the other high traffic properties also so let me just pick up one high traffic property after the other which is called the category landing page as an example you can go patreon.com slash electronics and that will take you on the electronics homepage and we just started taking one carousel by one carousel on that also so notice the difference here though right if you go to most of the e-commerce companies they go by a different approach the general approach is go by the and the Amazon way is that I am going to go in a widget fashion what that means is that I will build two or three different types of algorithms so we took a lot of input you know how the scientific measurements are done and we actually baked all of that into our measurements platform and now if a customer is getting same type of treatment then they will continue to get that same treatment across the board so that they are not really polluting the experiment and initially I was a bit hesitant in doing all that but now I feel I think that was a great thing that we did and and we have seen really good results because of that so we built experimentation service then we went on and we did real time data for evaluating experiments and we built a really cool reporting tool I have few folks here sitting from our data lake team who actually helped us build the reporting solution and now we can literally see all of this data in near real time and the next step of this that we took recently is portfolio balancing so we had another guy who came from financial math background and he was like you know what in financial math basically in people who are doing stocks their best thing is that their wish the best wish that they have is to actually go and rebalance their portfolio very quickly so they want to see if some experiment or some algorithm is running and then can they actually go back and they can try and rebalance the portfolio this experiment is not working let me reduce it let me increase another experiment which is working and it is very expensive there because you always have to pay fees to rebalance your portfolio I told him why don't you try it here it's free and he did it some kind of multi arm banded by the way is that we run multiple experiments so we run I want to say we run about 6000 experiments in a day across different types of lists because we run about 2000 lists and every list has at least 2 to 3 experiments running at any given point we are comparing them in real time and at every hour if I know that this particular experiment is not working on a particular list I will start decreasing the exposure of that model on that particular list and I will start increasing some other experiment and I really like this because I don't think I talk a lot about adoption challenges here I will talk about that in the end but this was one of the pivot points that really helped me get adoption of machine learning people don't believe in machine learning in any organization it is a very hard thing and I am not saying that only for my current company I have been doing this for 10 years and I have fought this battle multiple times it is hard to convince growth managers category managers business managers and actually us also when the self driving cars come it is very hard for me to digest the fact that I can go completely hands off on a self driving car same way they think about it I am not going to go completely hands off this is my business and you know what I am actually going to get bonus done if my category is going to do well or not so they don't let you go full throttle this is the same self driving car problem right I think portfolio balancing helps you because what you do is you ask them well you don't have to go you don't have to be completely off the hook and what you do is why don't you go and curate a list based on whatever assumptions you have whatever mental model you have and then you will run it in real time and then if you are winning then I should obviously give you more credit and I should increase the exposure I have seen this and I have not just seen this here I have seen this earlier also in other places I worked that once you have once you have built that kind of a rapport once you have given them the tools like this then basically the confidence comes and exposure increases and the third one is the most interesting one right we started with notebooks and that was my foolish idea so we started writing our entire pipeline so we were turning about 20 data bytes at that point building the model first gathering all the data building the features and then building the model and then saving the model and scoring the model all of that was done in this very nice notebook environment which is called Databricks Databricks was a very new product at the time right so it has matured a lot since then and then oh we have scored everything and now you know what if you want to put it in production you got to run it every day man so how do I do that so I wrote a very simple scheduler it used to come with a cron scheduler we just utilized that and then we started writing some script to basically augment it and then I took a flight to India and then I took a flight to India by the way I am based out of Toronto we have a fairly large data science lab there and that's where we have built all of this I took a flight to Toronto I am going to launch it and then midway from London I get a call that oh by the way this is not working because the notebooks are actually not meant for it and it just cannot handle the scale and I return back it is an amazing learning from that that never again in my life when I will build something like that I will underestimate the engineering aspect of it so I did not underestimate the engineering aspect from the point of view that oh you should build solid pipelines you should understand how you are going to put it in production but when you have to do it when you have to build production systems you have to build it properly so we built it in notebooks it failed in two weeks we actually converted it into Spark jobs and that was an easy thing for us and then in the first version of it we basically used elastic search to serve this we had a very tight integration with our catalogue as a serving layer and we were using that then we basically started building personalization we built this personalization service so that there was some more learnings of this tight integration so we built personalization service which again I will talk about that in a bit and then finally we ended up building this real time scoring thing so where are we at today right so we are personalization everywhere I think I have repeated this a few times so we always reach out to customer with very personalized list based on some customer assortment we can people don't have to time share list earlier people were time sharing list right so fashion accessories and the shoes guy will do this together basically they will try to put same products in the same list and so on I don't think it matters ok this is another learning right so anytime we have built models basically I try to take some we always try to infer some things from you know how it is actually done in today's world and then see if you can build intelligent systems that can actually do it now in typical world for any product list like this exclusive discount deals anyone who is configuring that they always start from some catalog they select some pool of products and they try to push it there it's a mental model which does not have a lot of complexity and hence it cannot be generalized right and you will always have dependency on humans to reproduce it so now I want to talk about the personalization framework that we have right and this is very standard framework that we have tried to use across different things and the only time something evolves is basically different types of models and the framework is so general that you can mix and match different types of things it obviously starts with data lake because that's where you are capturing all the data the first step that we do is product pool selection right this is a very cool forecasting algorithm that we have built and over the period of time I would say it has gone through 4-5 iterations maybe I have gentlemen here in the crowd who actually built it so what we do here is we start with the catalog and basically we take we build a very large time series of every product right so for every product how much it sold how much it sold in last 1 month 2 months 3 months and so on actually we do it on days level right and then we try to predict if I were to give x amount of stimulation to this product how much it will sell right it is based on some time lag variables it is based on visibility it is based on how much visibility I am giving how much search terms were there for a particular product what is the current discount running on it today how much does it sell when there is no discount running on it and then if I were to give it a 10% discount how much it will sell 20% discount how much and then based on that I basically go and look at every discount that is going to run tomorrow based on that I can say that this is my approximation it is actually going to sell 4 units another added complexity here remember I told you in the beginning any item that can go on a homepage or on a prominent page can make or break the business that is why you need to have the location of the item as a factor in this right so we take that into account as well so based on that I can basically select some products the next thing is we define the category affinities I told you we have 2000 plus precisely we have about 2500 categories where what we do is we have gone through several iterations of the model I think 6 or 7 now the current one being LSTM which is being developed LSTM is being developed now I think we just put it in production and before that it was a semi-connected network here we are trying to predict in which category customer is going to be most interested because remember I have already picked up what are the best products in that category that are going to sell right so I have already solved that problem and when I start doing category predictions which are about 2500 I am not talking 10 or 15 categories I am talking 2500 categories so your shoes and your socks are different categories in fact your sport socks and your formal socks are actually 2 different categories we can do prediction we do prediction at that level and once I have all of that then I use an item recommendation I don't use item recommendation as my starting point item recommendation is my end point this is the last point in the journey that is the main difference most of the time when you talk about a recommender system that is your starting point Spark comes with Spark comes with ALS model and all it does is basically takes your view and product view and product signals basically calculate the interactions and then basically you run some kind of alternating least square method which they have implemented and then you try to find out which product is going to be popular if you have seen this other product Netflix is doing that Amazon is doing that I think that's the most popular thing and obviously people have used category affinity in past not saying we are the first ones but people do it after the fact what we are doing is we are doing it before the fact we start we have always focused more and more on the category we are focused on this model and lately from last 6 to 8 ones we have been focusing a lot on this other model also and partially because this is actually a very high scaling challenge 100 million products is a non-trivial size after this remember I talked about I do assembly because remember I talked about my third objective which was my PTM's business objective which basically says that in this month I want to increase transactions in this month I want to increase GMV and so on and so forth how do I do that? I need to give some kind of levers and that's where once I have built my assortment this is what I am going to show to a customer I need to re-assort it and that re-sorting based on the inputs that I get and this is an optimization problem that we are trying to solve a regression problem on your left a category affinity prediction on the right a collaborative filter which is a kind of a matrix factorization in your middle and then an optimization problem in the end this is our standard framework and this is what we use in all the places I am happy to go into more depth I see I have only 12 minutes left I am not going to talk about this category model is an interesting one we started with a very simple random forest classifier it had about 6500 features and we were predicting only 200 categories at that time when we started so we said 2500 categories no may when we cannot predict that so we said let's go one level up it's obviously a hierarchy and we started predicting from that point right it worked out 200 categories the precision were decent the metric that we use to measure our model always is precision at K and NDCG then we basically said you know I said we will go on every home page so every category page that's where we started building a hierarchical classifier by the way one thing that we added here was in spark MLM you could not persist the model especially the random forest model we have actually built that so we could actually persist the model and then we could use the model if it's stable enough so I'm happy to share that code at some point we built this hierarchical classifier again does not come out of the box we actually built it we said let's take all the major categories because there are certain differences in fashion from electronics so we started building individual models for each one of them and then obviously it was a joint probability distribution in the end if you want to predict at the leaf level or if you just want to predict for that particular category then you can just use only that model then came this this new model by the way this is a very state of the art mechanism for doing category predictions used by a lot of people it's based on a paper from Microsoft what we did was we understand that basically user purchases are a sequence it is like a sequence of words so we use word to wek to build a latent factor a latent space in which we could represent the user purchases so on the left you basically see user purchases and views and then we take some target purchase category and we run it through word to wek and that gives me that latent space this guy this guy over here this is my latent space and then I take that latent space and then I run a logistic regression on top of that and that is why we like to call it semi deep learning this is not truly deep learning because the model is not fully connected so we build that and then finally we were able to get the category prediction this actually works on all the 2400 categories and now we have done another thing where I don't have it here but we basically take the next generation of this model which has been life for about 6 months now it takes time buckets so basically think of a very huge stacked matrix which has this user this latent space it is stacked by 0 to 2 days 2 to 4 days 4 to 6 days and so on and then it is compared against a very large target vector in the end and that is how you run a logistic regression basically a fully connected network also using a MLP but the thing is our data set is so huge that we just could not finish it in time and I need to finish every model in 2 hours if I don't then basically I am late for the right prediction basically the precision at K starts dropping in collaborative filter again we use ALS but we initially we used ALS we give it our flavor by giving it the strength which is the interactions we gave it our flavor where we said that not every interaction is equal so we built a custom function which exponentially decayed the interaction and now what we are doing is and that's why I mentioned the 6 million plus offline merchants in the beginning is because I am introducing all the payments data in here as well where I can literally build a function in ALS which just says that I will have X amount of a weight given to interaction from a digital product and a physical purchase I will have Y amount of weight given to a digital product and a digital purchase I will have Z amount of weight given to a physical product and a digital purchase and so on so I can make all those permutations assign different weights to them and that's how and that's basically our variant on how do we define our interactions this is another use case where we basically had even after doing all this you will always have these corner cases keep coming up this was specifically done for air conditions and TVs, high end electronics type of categories and what happened here was we were not showing the right products so basically what we ended up building was another so similar model the network that I showed you we built a variant of that by using price buckets and brands and this is the TSNE visualization of that you can see the red ones are the ones which are predicted by this model and the green ones are predicted by our general model and the blue is the customers purchase you can see that the red dots are actually in path of how customers actually purchase it this is an interesting one because somebody actually challenged us once this graph is just coming from that email we sent it to them saying that by the way you are buying this type of stuff that's why we are showing you TVs which are very expensive this is interesting so now at every point when we start giving out a recommendation what we do is we are not serving from one model it really depends on the type of list that is being created so we start with a customer at the time of generating the recommendation we look at a few things does this list have categories multiple categories more than 10 categories or is it primarily a list which has only products from electronics or is it primarily a list which has products from shoes and fashion accessories that's a mixed kind of a list and so on right and based on that we have built a system where we basically invoke different types of models one model again that would be another problem with using a vanilla ALS where in an ALS you have model and you are trying to drive you are trying to treat everyone through that same model whereas in this case you can have multiple different types of models multiple variations of deal pool selector multiple variations of category affinity and multiple variations of your collaborative filter and then combine them based on the type of prediction that you want to make okay I want to share some very quick results basically you can see on the left you can see some lifts in CTR but basically 2000 plus lists are personalized today we serve about a billion plus recommendations every day we have seen about a 2x increase in GME per user similar type of increase in number of transacting users number of units sold per user we have served a lot more sellers than we would have served by doing manual assortment right this is our overall architecture I am just going to finish in like maybe 2 more minutes the overall architecture is basically we have a batch part where we do a lot of feature creation and so on then let me go to the next one actually right then we have these 3 individual microservices going back to that framework we had 3 things we had a dpool selector which is in the middle here we had a category classifier and we have we call it white mage it is basically the collaborative filter these are 3 independent microservices completely deployed with their own data sources I can deploy whatever model I want to deploy there and so on so forth right and then we have an orchestrator which is directly talking to something called a serving layer which is an extension of our catalog so our catalog every time it wants to serve anything it will first make a call to us our measurement system is going to determine what type of model needs to be served then basically we will go to that particular we will go to the orchestrator we will tell orchestrator pull up that model for this customer for dps pull up this other model for cc pull up this other model for white mage and then all of them send back the results to orchestrator orchestrator then goes to another db which tells it in this particular list my goal is to increase the transactions so then it says okay so let me pick up the optimizer optimizer or the racking objective which I need to solve for is to increase transactions it re assorts everything resorts everything and then it basically sends it to send the recommended products and obviously we do a lot of validations like is the inventory there or not is the merchant good or not has the merchant been blocked or not and so on so great I want to share some learnings this is my some of the things that I have learned over the last 10 years so don't build stuff for yourself people will not adopt it and that will just lead to more frustration so always build for your customer ask them what do they need this is an interesting one that I have learned don't build things to present at conferences build to add value to business always do that piece because when you add value to the business you will get a lot more ideas and then you will be actually able to present at a conference changes hard and messy adoption is very very hard machine learning adoption used to be even more harder 5 years ago now because AI is a buzz word business managers are coming around there is a top down push so people are adopting it but it's hard find a business partner get executive sponsorship but more importantly find a business partner and they will help you get through this engineering first so models are absolutely useless they have zero value sorry for interrupting you but you overshot your time I am almost done just start small and keep at trading thanks guys thanks a lot we will not be taking Q&As for this session since time is overshot by a lot there is a parallel OTR session going on in room 01 outside move out of the hall take a left the session is about interesting problems to solve with data science also there have been feedback forms distributed to you guys please read the talks that helps us make the conferences better the next session is by Santosh, GSK from Prakto and Santosh would be talking about adopting bandit algorithms to optimize user experiments at Prakto Consulting hello also you need to take care that the forms distributed right now are the evening session forms the forms which you had earlier were the afternoon session forms so like there are two feedback forms for the same day according to time you need to fill them both I mean it's not compulsory but we would really like you if you do the forms need to be submitted outside there is a basket near the registration desk you don't need to move out of the hall right now you can do it while going out I hand it to Santosh hello cool Hi everyone I am Santosh a senior data scientist at Prakto so how many guys have heard about Prakto awesome how many guys have tried online consultation at Prakto nice so for those who don't consult is an online consultation platform where a user can post a health query and we as a platform who replies back with a medical advice so in such a platform if you want to optimize for a user you would often assign all the questions to a set of doctors who are performing who are answering a lot but over a period of time this set of doctors who usually are a very small fraction will get a lot of questions assigned and they won't be able to answer all of them they would feel burned out and at the same time the rest of the doctors will get questions assigned and they won't be engaged with the platform so if you try to optimize for the users you would end up with a bad experience for the doctors on the other hand if you try to optimize for the doctors you randomly assign the same set of same number of questions to each doctor but not all doctors are equally active some don't even open the app to see the questions that were assigned so those questions would obviously end up being unanswered by the fact of user experience so if you optimize for your doctors you would have a bad experience for your users so usually you come across such kind of trade-off at any marketplace even for e-commerce sites like Flipkart or Amazon if you try to optimize for buyers you would have a bad experience for sellers and vice versa so in this talk I present how we use bandit algorithms to actually solve for such a trade-off problem the key takeaways from this presentation are to understand how multi-amp bandits work and how you can use at any of the problems that you work on and how we use that factor concern so let's begin so let's say we have a guy who just came to Bangalore and he wanted to have a good dining experience but he has no idea which restaurants are good and which are bad so he asked a couple of his friends and they suggested a couple of restaurants so he went there and he liked them so he made two buckets one bucket where the restaurants he liked the other bucket which he never explored so the next time he wants to go to a restaurant he wants to play safe and goes to those restaurants which he liked so more often than not he wants to visit such restaurants let's say out of ten times but he gets bored and he wants to occasionally explore new restaurants so he goes to one of those unexplored buckets and if he likes them he likes restaurants so what we are seeing here is a multi arm band setting the restaurants on the right side are your multiple arms and the reward this guy is trying to optimize is to have a good dining experience and the way he is trying to do that is through an exploit and explore strategy so he is exploiting nine out of ten times and exploiting one out of ten times so that he would not miss out on those new restaurants this is the simplest form of multi arm bandit algorithms it's called as epsilon greedy where epsilon stands for a probability here epsilon is 0.1 means that ten percent of the times he explores and ninety percent exploits so this guy really liked this strategy he started exploring all the restaurants after experimentation he was no more a stick man so coming back Prattu has a lot of use cases for patients like you can search for doctors clinics or hospitals you can even book an appointment or you can consult online or also you can automate since so the one use case where you are looking here is consultation online a usual consultation form looks like this where you post your problem also you select a problem area in which specialty you want this question to be asked so we only send this question to those doctors from that specialty once the question gets posted we now have the task of assigning to a doctors that we have in our list so let's say today we have with us doctors Shyam, Alexa, Ram and Bob so if you can see Bob looks like a surgeon looks like to be more busier than other doctors so somehow we have figured out their timings and if you look at the availability Shyam is available on weekends Alexa on evenings, Ram on mornings but Bob has a lot of surgeries scheduled for that week he is just busy for that week so let's say this question comes on a Wednesday morning even if let's say hypothetical Shyam is the most performing doctor we won't assign it to Shyam because he is not available on the week days so we assign this question to Ram instead because he is available on mornings so this kind of using the context is what we call as context tool assignments so how we do that for every question we will what we call as a context tool feature vector the features can be the number of words the time the question came in the age and gender etc and we multiply that context tool feature vector with parameter vector of each doctor to get what we call as a predatory word a predatory word is nothing but how likely this doctor would be giving you a fastest response the higher the reward the faster the response you can get from the doctor you can just pick the doctor who has the highest reward assign the question to him so here in our case Ram has the highest rate of reward so you assign the question to him but the important question is how do you estimate the parameter vectors we look at the previous assignments made to the doctor and we look at what are the true rewards we got from the doctor we then apply simple ridge regression to estimate the parameter vectors so all is good and we made few assignments over a period of time we observed that Bob has got very few assignments compared to other doctors so if you see Sham has got 30 Alexa 50 and 45 and Bob has got only 9 because he was not answering a lot of them so this doesn't mean that Bob won't be answering questions at all he's just busy for the week so we should give him more chances so let's give Bob few more chances how do we do that so let's say you got a new question on a Sunday afternoon so using the contextual feature vectors we have calculated the contextual parameter rewards here now along with that we calculate now what we call as a regret bound a regret bound is nothing but it shows how less information you know about the doctor the lesser the information the higher the regret bound so in this case Bob has very few assignments so we don't know much information about Bob so the regret bound is higher so we add this regret bound the contextual rewards that we already credited sum it up and pick the best so now Bob would get a chance to answer this is a second algorithm in multi arm bandits it's called as upper confidence bound if you have a mirror with statistics it's very similar to you giving an estimate of a parameter so minus sigma the less of the information you know the less of the sample size the higher the confidence interval you have because you're not sure of the parameter value similarly here you're not sure of Bob so you have a higher regret bound so we have implemented this algorithm and some of the results we see was with the random installation of weights we were able to see a 60% reduction in response times within 10 days also interestingly we were seeing a 25% reduction so 25% increase in engagement of the doctors which means that we are able to tap into those doctors who are actually participating with the platform and we are able to assign more questions to them that's for the console let's see some of the use cases where you can apply bandit algorithms so one of the common use cases people refer is with AB testing so user comes to your platform you have two versions of your website let's say you know the true conversions which is like 0.3 and 0.2 let's compare AB testing with the simplest multi-amp bandit algorithm which is epsilon greedy so here the blue curve represents the average conversion of epsilon greedy and the green curve represents the average conversion of AB testing so for those who don't know about AB testing AB testing is a place where you run an experiment for X number of trials you randomly show one version to each user after you finish the X trials you pick the version which has the most conversion rate so as AB testing was doing a random assignments the average conversion was lower than epsilon greedy epsilon greedy being the dumbest of all the multi-amp bandits was still able to perform better than AB testing but after 1000 trials AB testing starts to perform better because it picks the best version which was version A and shows that to every user however our epsilon greedy is still exploring 10% of the traffic randomly to each one of the versions so there are two points that we can take from here if the behavior in the versions changes it's good that we explore and bandit algorithms would be able to adapt to those changes in the version on the other hand there are methods in bandit algorithms like annealing which will make you explore less as you collect more data so that you can optimize this graph over AB testing some of the other applications are if you want to send user notifications and you don't know what time to send so you can apply bandit algorithms to come up with a personalized time for each user and similarly news recommendations so you might be like pretty much across most of the data science journals so you would be seeing a lot of data science articles but what if if you are interested in bitcoins and never come across such articles so bandit algorithms would occasionally explore your interest in new areas which you have never seen before and if you like them you would be like exploring new areas so if anyone from the audience are from in shorts you have a cool new project to try so these are the two references but if anyone who wants to get into multi bandits bandit algorithms website optimization is a very good start of book written by John Miles White and the paper a contextual bandit approach to personally news article is the one that we have implemented so lastly when to use or not use bandits so if you have a short experiment like a facebook campaign or a marketing campaign and you have couple of different versions to experiment let's say you have a message like sign up and get 30% off or please sign up and get a discount of 30% and you don't know which of these versions would give you a maximum conversions you don't have time to run an A B test and come up with a better version so instead if you put bandit algorithms it automatically optimizes for the better version and redirects the traffic to that version also if you have continuous experiments like in consult we have doctors who keep adding moving away from the platform or their behavior keep changing as we were looking at Bob so if such a case such a case is there you can use bandit algorithms which continuously adapts and tries to optimize for the conversion so when not to use for bandits is when you want to run any test and want to infer which version is better than the other so usually bandit doesn't care about what is the version just cares about the conversion so A B testing should be preferred there and like one example is clinical trials questions so you mentioned about the bandit algorithm and the A B testing where the trials where A B testing started improving after a certain number of time frame in one of the slides so when do you think what or what do you tell as a cutoff that you will save that I want to switch from A B to bandit so what is the time frame that I am looking at usually it is used as an alternative to A B testing so A B test all it does is it finds out that version which has highest conversion and just shows that to 100% of your traffic but do we need to show that same version for every user maybe the other version is close enough or maybe if you use more traffic the second version might be better than the first version so bandit algorithm tries to adapt to that so instead of A B testing it is an alternative to implement it is not like after A B you can use bandit yes that is a random example but the number of samples depends on the power of the test you want to have the difference between the options if both the versions of your website are roughly the same then you need to look at more data to make a conclusion which version is better bandit would be better because it automatically tries to maximize for the conversions it does not look at which version is coming out to be good but if you want to have a stable website where you just want to have one stable version then you can choose A B testing and like have that version we will take another question you can take this offline any questions basically when the algorithm converges it is more or less like an A B testing right I am sorry when the algorithm converges so then it is essentially the performance would be like A B testing so when the algorithm converges you would still be having some amount of exploration happening in bandit algorithms always you would have some percentage of traffic which would be assigned a random page or a lesser efficient page just in case the behavior changes you can set that you can actually set that in the bandit algorithms to explore less as you collect more data more questions my question is about the exploration probability two points one is how do you determine the rate that you reduce the exploration probability how do you determine that over time and the second how does the context of the problem influence the size of the exploration probability for instance if it is about news in some ways people might have more of an inclination to explore more so they will tolerate I did not get your second question so the size of the exploration probability how do you think about the influence of the context of the problem on that size so if you take news for instance news will probably be more open say one out of five to have a random article it does not hurt me that much and there is every chance that it is something that is exciting but say in something like getting medical advice five cases if you give me a doctor who is not the best qualified to answer that problem I am probably going to be more upset about that what is going to do more damage so how do you think about the context influencing the exploration probability so if I understand it right how do we define the exploration probability based on the context so how does it shrink over time so if you are sure that all the arms all the multiple arms has got enough samples so you can have simple measure like you can use any log of n or anything very simple like entropy anything which can give you some measure of how much data has already been seen if you have like 1000 samples or 2000 samples already collected that can give you a measure of how much exploration has already done so you can reduce the exploration probability so it is a 0.1 divided by log of n something like that it keeps decreasing as the n keeps increasing so the second question it depends on the context and the conversion probability so how critical is your conversion probability in a health case scenario the conversion probability might be super critical than other context so depends on the use case last question anyone thank you that was a great talk so now we are going to take a short 30 minute break for evening tea we will reconvene at the hall at 4.10 pm sharp we will start in ethics in OTR on ethics in machine learning which will have fairness, accountability and transparency in machine learning also there are dinner tickets available at the ASIC counter the price is 3000 rupees access to dinner and networking with the speakers the dinner is at blue or bowling center in phoenix market city whitefield and remember to bring your badges for registration and check-in tomorrow morning fresh badges will not be issued tomorrow and submit your conference feedback at the ASIC counters your feedback helps us make the conference better I have also information about a lost login code device if you know somebody with the lost device that generates login codes it's from a brand called gemalto so you could redirect them to the ASIC staff and volunteers we will return that device thank you