 their phone numbers or email addresses, IP addresses and things like that. Once the customers are merged, we might have to put in some defaults because a store customer will not have a home address while a website customer will have that. So we also standardize the data. So because this is coming from so many different platforms, we need some standardizations. So this means things like if it's a time-related facet converted to ePOC, if it's currency, maybe converted to a single currency. So the data in Hive, joining in Hive is a very costly affair and storage is quite cheap. So what we do is that we denormalize the data, we do all the joins in the backend and give this data so that we make the life of our customers easy. We also partition this data for the same reason and then do multiple checks on that to make sure that we are consuming the data in a very good manner and correct manner and then finally publish this data. So with this thing in mind, what we did was that this is how we could have designed a system that we take the user identity, create the mappings, do a join with all the source data and publish it in the final table. But as I said, joining is a costly affair and joining 20 sources or 30 sources in a single call will mean that you will require lots and lots of resources and probably that query will run on for hours and probably even days. So join was not something we could have done. So we added a staging table in between and the data from the source would be joined with the staging table one by one and then finally published to the final table. But even this set up ran for around 20 to 25 hours. So instead of having a single staging table, we added partitions to that and the data from each of the source will go to its own partition in the staging table. This parallelization helped us a lot and brought down the time from 20 hours to five hours. And it also made the life of the developers easy because previously they had to do an integration test. They would require a full day to do that while now it would just take four or five hours. We also did a bunch of other, sorry. So one thing that was still very wrong with our entire setup was that we were running all these tasks through Chrome tabs and believe me, that's not a very good option. Chrome tabs are very difficult to debug and manage, especially when there are multiple people working on the same set of tasks. So we needed a good scheduler and we looked at a bunch of very good schedulers and finally decided to go with airflow. There are a lot of good things about airflow and I don't have the time to discuss all of them, but I would definitely tell you the few that we like the most. So the first thing that we liked very much about airflow is the way that you can generate the DAG itself. So unlike the other schedulers, what you can do is in a programmatic way, if you see here, what we do is that we have all the metadata for our sources in the MySQL table that the airflow fetches at the real time and then it iterates over that and creates the task. So that means every time we have to add a new source, we don't have to write code for that. We just have to write the basic metadata for that in the MySQL and the DAG will automatically pick it up at the real time. The other thing is that airflow is highly scalable. So if you see that the scheduler is completely independent and you can have as many workers as you want. So we have a common setup for our entire team and we have different workers for different kinds of setup. So we have an independent setup which doesn't have to care about the rest of the task and we can just run our stuff on the system. This architecture also helps in maintenance because suppose there's something wrong with your scheduler and you want to deploy that but there's some task already running on the worker that you don't want to stop. So this system helps us to deploy the scheduler independently or deploy the workers independently without affecting anything else. And airflow also has a very rich UI which helps in a lot of things like restarting stuff and looking at the logs. All of those things are available at the web server itself. We also did a bunch of other optimizations like we get the data incrementally. So if we have already processed something once we are not going to process it again. That also gives us item potency. We also used compression. So if you see here we looked at the ORC format with a bunch of compressions and ORC plus Zlib gave us a very good result in terms of storage space. It reduced it by a lot and it also benefited us where runtime was concerned. It made our jobs faster. So this was a very good optimization. Apart from that we also for data reliability we also do a bunch of stuff like every time there is a task running we compare the input data as well as output data and make sure that we're consuming all the data in a correct manner. And we create snapshots of our data which means that rolling back is as simple as just deleting a done file. Our system is fault tolerant which means that we make sure that we take care of all the edge cases even if they are occurring just once in a blue moon. Running according to me is the most important thing that you need when running a large scale application. You need to be aware of things that are going wrong with your system even before the user has a chance to raise a trouble request. We also used verdict dv to do some approximate query processing. So in very simple words what verdict does is that it creates an intelligent sample of your data, runs the query on that and gives you a projection for the entire dataset. So how this helps us is suppose you're a marketer who wants to promote the latest iPhone X and you want to create a user segment of people who bought an iPhone in the past say six months. But once the data comes in and you look at the user segment the number of users are very small. So you decide okay let me go ahead and get me get all the users from last five years and if that is also less then maybe get me all the users who bought any Apple product in the last six months. So a marketer always has to run a number of filters to get to the final segment that he wants and if each of the like because the dataset is so huge each of these segment creation takes around eight to ten minutes and waiting that much every time you change a filter is not a good idea. So in comes verdict and what verdict as I said it will create a sample and run the queries in a very short time like five to ten seconds and give you an approximate count of what the users will be with those filters and once you're happy with the approximate number you can generate the actual number and verdict says that it's 99% accurate and there's just a scope of 1% error which is a good number for an approximate count. So with all those things in mind what we finally had was that we create the identities of the user we generate the identities based on the things that I explained earlier. We have a bunch of generate tasks running for each of the sources where we get all the data from the source. We also have a bunch of tasks running for getting the metrics from all the sources. We group the source data by the identities and then compare those numbers with the numbers we got from the source and then finally publish it. With this entire setup we have the pipeline running in three hours we are consuming the data for 500 million customers who are present on the web as well as 400 mobile devices doing 1 billion daily activities on the site and 24 million daily activities on the mobile and so that was it so now on to the most dreaded part of this presentation. Fair warning I may not be able to answer all of your questions but I'll try to do my best. Thank you Soumya can we have an applause please. Okay how many of you have questions can you please raise your hands. Just one that's it okay can we get somebody to give him the mic please. Hello hi can you talk a bit more about the reliability you said like you compared the input and output how do you do that on the last set because your pipeline already takes too much time. So we try to make sure that all the reliability things that we're doing do not affect the runtime so if you see here the metric collection from the source happens in parallel with the generate tasks which means that even if we didn't have those tasks we will still be spending some time on the generate task so this helps us in reducing the time and then once we have the data in the final table that we have so the comparison takes some time but I think that is necessary because we do want to make sure that we have the correct data. Apart from that we also have some checks while we are running these generate tasks and again we try to make sure that all these things run in parallel so that we're not spending a lot of time on that and yeah yeah yeah okay so the metrics that we collect is that the most basic thing that we run every day is that the counts of the number of customers that are there in the source table should always be the same as that in the final table. Apart from that we also have weekly jobs and test suits that we run when we are deploying a new change that compare the exact value of the facets so suppose I'm collecting the data for gender for a customer so we make sure that we have all the customers and the data from all the customers is the same as that in the source but this is a heavy query which we run only once in a week when the load is low or we run it when whenever we're making a new change to make sure that the changes are not breaking anything. So this is regarding so when you mentioned that like you have to clean the data before you put it for the scheduler right so how do you automate that? So one basic thing that we do have to do is look at the data once we're adding a new source and to make sure that we don't have any new kind of stuff that we're not already handling but otherwise things like default null checks and we have for example gender again right so you can have only three or four things that should have been in that column so if we are getting something apart from that we'll ignore that and nulls and blanks are always there in all of the sources and then we do do a basic set of sanity tests before adding a source to make sure we find out all the kinds of values that could be present for that. Can you talk a bit more about the staging tables? Okay so the staging table is basically a kind of replica of the actual table that is there. The only difference is that we have partitions based on the sources so whenever we get the data from one source we'll put it into that particular partition of its own in the staging table and once the data from all the staging tables is available in that from all the sources is available in the staging table we group by by the identifiers like store id and email id and stuff like that so that way the structure is the same as the final table that we have at the end is just that it has different partitions. So just to clarify so basically the joining happens in two stages right right one from the source is joining and then the final one right. Hi Samya thanks for the talk so you said that you're partitioning by sources now each source will have varied amount of data so that means you're generating skew in the system. Yes. So how do you deal with that? So mostly we are since we are getting the data at a customer level the data is not that skewed but sometimes we have seen like for example when we are getting the data for devices then the number of mobile devices will obviously be more than the number of actual customers that are there. So we do have some kind of bucketed joins where we like the first time when we are joining the identifiers the mappings and all. So in that case we try to make sure that we have a good distribution and apart from that we are also since we are bucketing so the bucketing scheme we decide based on what the data is going to come in it. It's going to be a large set of data we get more number of buckets so that the number of items in a single bucket are constant across. Yeah. Hi. So just wanted to know more on how do you all manage the backup and you said like you all maintain the versions of data right. So in warehouse how do you all go about keeping the backup of the whole entire big chunk of data and managing it in versions I mean it's a huge stuff so how do you all go about doing that? So Haive definitely helps us in keeping the backup across multiple machines they distribute the system so that is one thing. Apart from that every time we have a data we try to create partitions based on the date so if I'm adding some data today it'll have a separate partition based on other things as well as the date. So this also helps us in as I said in rolling back so if the data in the latest partition is wrong we'll just delete that partition or maybe delete the done file for that partition and but yeah since we have a lot of data we do have to have a lot of machines to make sure that we have all that data but we don't keep the partition sorry the replicas indefinitely we have like maybe last 10 or 15 months so that if we are rolling back we have a scope of going back three or four steps but not more than that. Hi Somya thanks for the talk I have two questions basically one do you do any kind of enrichment or dedu publication in your staging layer or even after that and two what is your choice of data structure or the data model that you have chosen which makes it easy for all of this to happen So the first question we don't enrich the data ourselves but we have several models for example if it's a store customer right we won't have a lot of data from them that they are providing for example taking again the example of gender a store customer will not be inputting somewhere that okay I'm a male or a female or something like that so we have we do have data models on that where we look at the purchase industry and based on that we try to like the data scientists have done this made these models and we ingest that data and we make that available as well so we're not doing that ourselves but we have a bunch of data scientists from the other teams who are working on multiple models on getting more data more probabilistic data rather than what we already have from the customer itself and the data structure we have a very flat structure most of the data is maybe it'll have a single value or multiple value so that goes just as a column we also have some complex data for example purchase history the purchase history will have multiple things like what are the things that you bought what are the department what is the date when this purchase was made so we try to use struct for that and there also we try don't want to have too much nesting because that'll again make the queries slow so we try to make it as flat as possible so that the final queries that are running in running good time. Hi I have a question how many segments do you eventually create it's first question second question is how do marketers use these segments okay so segments right now I think we have maybe five or six segments per hour that are being generated and the marketers usually use these segments for example if as I said if there's a new iPhone coming in right and so we they want to get all the users who had showed an interest in Apple products in the past they might have bought it they might have browsed it and they use this this segment to send out emails that this product this new product is available or there are also retargeting stuff that is happening for example if you look at something today and maybe if you didn't buy it then tomorrow we might show an art for it on Facebook or Twitter so that it reminds you that you can go ahead and buy the product. Do we have more questions? Having an active audience asking you so many interesting questions is the best feedback isn't it? Thank you for such an amazing nice first talk for the morning can you please give her another round of applause that was so amazing. Thank you everyone. We have Saakshi up next is she here she's here she's going to be talking about how you can improve your data quality just give her 30 seconds to get set up oh just on the minor announcement we have some we have a community space upstairs near the buf area so we have Mozilla, CIS and Datakind who have booths there so if you'd like to sort of explore what how data science is used in a non-profit context and how they're used in community spaces please go talk to them they're very interesting to talk to. Hi I'm Saakshi and I'm going to give a talk about how you can improve your data quality using Apache Airflow Check operators. So first of all setting the context for those of you who are not aware of Airflow here's a short introduction so Apache Airflow is a tool for ETL orchestration so the structure that you see here this is a DAG in terms of Airflow this is Airflow UI and this entire structure is called a DAG these are directed acyclic graphs these operators that you see here these are the operators that actually gets the work done so the jobs run here so while the DAGs describe how you run a workflow operators determine what actually gets done these arrows that you see here these determine the dependency so to the right of it are the operators and to the left of it are the parent operators so only after parent operators are done do the child operators start just to give you an idea about scale at kubol this is a typical DAG in kubol this is one of the most involved DAGs as you can see here this has 90 plus operators that two of eight to nine different types so if anything goes wrong in any of them we might have data corruption issues the biggest problem here is that it's very hard to debug what went wrong and where so this is very error-prone this reminds you of the data quality issues that we ran into so we had issues like missing data so the data wouldn't be coming to us there might be a bug in the pipeline you do which data would be missing we had data duplication issues the query might not have been item pretend when we wrote it we might be processing same data again and again we had data corruption issues and some system issues due to which the ETL itself did not run this brings me to the importance of data validation so applications correctness depends on the correctness of data so if your data itself is not correct then the application is not reliable increase so this means that we have to increase the confidence on our data by being able to quantify the data quality so that we have some confidence some quantifiable confidence on our application correcting existing data can be very expensive and a very cumbersome process those of you who've done it should be knowing this stopping critical downstream task is the data is invalid this was the first attempt over solving the problem of data quality trend monitoring so the idea that we would monitor data we would alert in case we see any anomalies like we are suddenly receiving a burst of data or there is no data for a particular day these were the kinds of trends that we alerted ourselves on but we soon came to the conclusion there is a hard problem first of all it's not real time these trends can only be monitored once the ETL has finished once we have the data so it's not real time we cannot stop it if if there's something wrong also one size doesn't fit all so these different ETLs manipulate data in different ways they have different critical fields they have different data types some are partition tables some are non partition tables some would be adding new data daily while for some we might not be adding so much of data daily and there just might be some editing of some of the columns so one size doesn't fit all generalizing trends over here is difficult and so this becomes difficult to maintain so whoever is writing the ETL also has to understand the entire structure of trend monitoring entire architecture of another project keeping them in sync working on multiple projects at the same time so it's very difficult also this is not foolproof like there might be a lot of false positives over here or we might be missing some alerts so we noticed that if we were to delegate the task of data validation to the ETL itself then all the three problems that I've mentioned here I'm sorry all the three problems that I've mentioned here about it not being real time generalizing the trends and difficulty in maintaining multiple projects can be solved because it would be specific to an ETL it would be real time because it would be running along the ETL itself and it would be easy to maintain because the ETL writer doesn't have to work on multiple projects and just have to write a particular validation for their ETL itself this is when we came up with the idea of using assert queries for data validation so just like we have assert statements in a unit testing framework we could have assert queries for data validation while this is simple this has generated immense value for us here while we were exploring the implementation we came across these check operators let me talk a bit about these check operators so these operators expect a sequel and a pass value so if the output of that sequel is not within some predefined range of the pass value then the operator would fail so this is what our approach has been we extend open source Apache Airflow Check operator for queries running on Cubol platform we run data validation queries and we fail the operator if the validation fails so after this we have the flexibility of running of failing the entire DAG if the operator has failed if it is critical enough or we can continue on the path if the check is not very critical enough so this this we can tweak the operator using something called as trigger rules in Airflow so if if need be then the subsequent task can succeed can begin to run even if the parent operator has failed let me digress here a bit to talk about Cubol data service because we need to know what Cubol commands are from here on so this is a cell service platform for big data analytics we provide Apache tools like Hadoop, Hive and Spark integrated onto a platform so you can submit your queries on these engines so now when i talk about Cubol commands these are essentially the queries running against these data engines coming back to the check operators this is how you create a Cubol check operator the command type here is Hive command so the query is going to go run against Hive engine and the output of this is going to be matched against the pass value if it doesn't match or if it's not within some predefined range then it would fail now when we were starting to use these operators there were some limitations that we encountered and there were some enhancements that we had to make for our use case like for example we wanted to compare data across multiple engines but the problem was that this pass value needed to be defined before the ETL starts this use case was pretty important for us because we were doing a data import from our amazon RDS instances into our data warehouse so there was no way of knowing beforehand what the expected value in warehouse tables would be this means that we cannot this means if the pass value you're forced to hard code the pass value at the start of the ETL then it's not possible for us to compare and get the value from source engine so the solution here would be to make pass value an airflow template field so this way it can be configured at runtime the pass value can be injected through multiple mechanisms one of them including getting it from another operator and once it's an so if it's an airflow template field we can do this so the approach would be we run query against our source table we we get this pass value we inject it in the check operator running on the destination destination table this is how we can compare data at runtime in two tables this is how you define an airflow template field so the pass value that you see here this is a ginger template so this is a very simple example just for a demo case just to give a demo so one minus one here would be valued at runtime so actually what we do is we get this value from another operator at runtime and so we're able to compare across engines the second is we wanted to validate multi-line results so currently the operator considers single row for comparison that is the first row but for our use case we wanted to run group by queries and compare each of the aggregated value against some pass value the solution here so in our use case we wanted to compare again second column essentially not the first row so cubo check operator we've added a parameter called results parser callable the function pointed to by this parameter holds a logic that returns a list of records on which the checks would be performed so it's upon us which row to use for comparison which column to use how how we want to parse it's upon us so this is how you give parser query result so here we'll be returning values of second columns against which the check would be performed now how we've integrated these operators with our ETLs so this is the first ETL data injection i've already talked a bit about this so here we're importing data from ids into our data warehouse for analysis purposes now when i talk about importing the data it's not just simple plain copy of ids that we are importing because we've added some optimizations we don't want to import entire tables daily so we've added some filtering logic over there we've added some absurd logic in hive and things like that so historically we face issues like there would be mismatch with source data there would be duplication data would be missing for certain duration due to some bug so the checks that we've employed here is we added count comparison across two data stores source and destinations using the approach that i've mentioned before so these checks have helped us verify and rectify our absurd logic the second is data transformation so these repartitions are days worth of data into rd partitions so at the end of a day we should essentially have 24 hours of partition so historically we face issues like data would end up in a single partition field you would have wrong ordering of values you would our data would get corrupted so the checks that we've employed now is that after a days worth of run we check if the partitions getting created are indeed 24 we check the value of critical field that is source do you see here this is repartition yarn so the source here should only be yarn and nothing else if there's anything else and we know that something is on this so these checks have helped us in rectifying our repartitioning logic this is the most critical etl that we have in kubol here we're doing cost computation for our customers which is measured in terms of kubol compute unit r the current situation is that we're narrowing down the granularity of computation from daily to hourly you want to tell them hourly that this is your cost so these checks have helped these checks have helped us in monitoring new data and alarm in case there is any mismatch with the old data because here the source of truth is our old data this is the last detail that i'm going to talk about so here we're parsing customer queries like kubol commands and output their table usage information so the output information like which tables are being used frequently which columns are being used recommending the column while they're writing the queries or some of the some of the tables frequency is too less so you're not using this table anymore and things like that so historically we faced issues like data would be missing for a certain customer account there would be data loss due to different engines and different versions of those engines so our customers they operate on different engines and even if we were to fix an engine they're working on multiple versions of the same so we face issues of data loss due to not being able to parse such queries due to failure of syntax checking in such queries the checks that we've employed here is we do a group by an account id and check if any of them is zero then raise an alert that means we're not getting data for that account that means something is amiss second is that we do a group by an engine type and account id if there's any higher error percentage then we raise an alert so these checks have helped us in gaining insights into the amount of data loss that we're facing because because of not being able to parse the queries these have also helped us in being a feedback mechanism so like we know which which engines data we're missing we're able to incorporate that in our parser and make it more robust features of this framework include they're able to plug in different alerting mechanisms so it can we can have an email operator or a slack operator or different kind of alerts that you want we have dependency management and failure handling so if the checks are critical enough then we stop the pipeline we stop the etl there and there and if not then we continue on the path they're able to parse output of assert query in a user defined manner runtime phishing of parse values against which the comparison is made and we're able to generate failure and success reports so this need not necessarily be a alerting mechanism we can use it as a reporting one as well lesson that we learned here is estimation of data trends is a difficult problem and see it needs to be tackled with very with precision and the second is delegating the task of validation to the etl itself can solve a few issues for us source code has been contributed any questions just an announcement for the question we have our next talk by Priyanka who's going to be talking about user response predictions so do hang around for that do the questions very quickly actually thank you for the talk and yeah i have a question when you said that like how do you differentiate between critical and not so critical validation problems and when there is non-critical do you use that for for the analysis or like health checks yeah and another question is when when when there are issues with like after a day worth of running and then then you get to know like something went wrong so what kind of database do you use and also is there a possibility to roll back like in case because maybe like further days will be coming into your pipeline and then how do you handle like old days and then new days all intercepted okay i'll answer the first one uh so like i mentioned if it's critical like if there's something that we cannot reward like uh we're computing ucuh and we update it to something called a zora so we cannot take it back from from there so here it's it's necessary that if if anything is wrong we stop then and there so this is example of this is one of the examples of a critical area the last one that i talked about uh where we're using it essentially for analysis purposes but for making our parts are more robust so that that's one case uh where it's not critical enough and we're using it over there does that answer your first question uh second one uh can you repeat one so my second question was uh how do you reward so when uh you have a day worth of run and then you get some issues but then the next day would also have maybe maybe in the pipeline already yeah so it's so uh we're using non critical for the transitioning phase like uh so till now two something was wrong but when we added check operators then we know that something is wrong so this is transitioning phase now we want to analyze the results of that operator so it's possible that our checks were too uh too strict so so we want to analyze and after that we can stop the etl if if we analyze and see that okay something is wrong then we can stop the etl hello um thank you for the crisp and useful talk um do you have any recommendations on how to do reporting i mean if you have a lot of issues right then how do you like go back like you can't just put all of it in a slack channel or something right so how do you do like reporting like are there parts are there grades yeah okay um so uh we have something called it's on success and on failure callbacks over here and all the operators in etl so uh you can essentially add all of them to a task instance uh we can do an excom pull excom push and put it to someplace in db and after that we can have an email operator or something or an alerting one and we can collect all those results from there and uh give it to the user so this is one thing that we were planning on doing hi hi so where do where have you defined your validation rules sorry where have you defined your validation rules and is it configurable how easy or difficult is to change our validation rules as in like you define that if you need to check check some put some condition based upon that right whether it's right data or not right so where have you defined all these rules and uh and how easy to change those rules so these are essentially defined in the query itself the query that we're running for validation so uh there some part has to be there in the etl also like uh i mentioned uh parser uh logic over here so this one right so uh we output if you're not able to pass and we output that information in the table itself and then we do a group by in the query and we essentially extract all those errors and we use them for our reporting for our alerting do we have more questions is that it great thank you Sakshi for a crisp and useful talk as our audience member said uh next we have Priyanka who's going talking about user response protection uh people at the back would you like to come sit down because there's going to be a 40 minute talk uh i think i can see a few empty seats here on the right on my right side uh just a quick reminder uh there's also an airflow bof happening at 1130 in case you just joined us so that's happening upstairs hello hello everyone i am Priyanka from walmart nardis plan news i work with a senior data scientist with the computation advertising team there i'll be talking about how we do user response prediction at scale so uh first off what does an advertising ecosystem look like so there is a user who will visit an advertiser's site say walmart.com and do different kinds of activities on our site the user could look at item pages some category pages maybe add a few items to cart and then we even usually the uh users will eventually go away right um and then they start browsing other websites on the internet at this point as advertisers we want to re-engage the user and get them get them back onto our site so we take part in options and submit bids to these options and these bids are computed at user level so we are trying to do a user level retargeting because as advertisers we do not want to bombard users with ads you only want to shoot relevant ads users who are actually interested in our product so how do we go about doing that so this is where response prediction comes in so we want to predict if shown an impression will a user actually end up clicking on our ad or if the user clicks on our ad will they actually end up purchasing from our site so we just click through it and convergence it respectively once we have some estimate of the user's intent we'll then go about bidding in these options and the two key aspects in this problem are the data that we that that is there which is the real essence of the problem and secondly the algorithms that will be used to train over this data so i'll be talking about all of these components as we go further so we have some idea of what response prediction is right uh now how do we go about formulating this problem so i have a user for which i want to predict if this user is going to purchase from my site in some time from now right so the first part is defining the features itself right for that we'll have a look back window so say we have 30 days look back window and now over these 30 days i'll collect all the possible features that i can about this user so these features could be like interaction signals so i know the kind of product that the user was interested in i know the kind of categories that the user was browsing and maybe also the ad to cut right even user history can be really important so if i know this is a user who's already purchased from my site so i know he's a loyal customer and might purchase again right even item signals could be really important uh for example if i know i'm price competitive on a certain item uh then that means i am the best choice for the user right so that user might actually end up purchasing from my site even contextual signals like day of week time of day could be really important so now we have built the user features the next part is getting the labels itself right so again we define a prediction intervals so say i want to predict if in the next seven days the user is going to purchase from my site so if the purchase actually ends up happening then the label is plus one and if the purchase does not happen the label is minus one so at the end of this exercise i have a data set where i have user features which could be say x and i have labels which are say y here and this is a binary classification setting where the label could be either plus one or minus one so now that i have a data set the next problem is to train models on it so this is how the pipeline looks at walmart right now it is a pretty standard pipeline so we have lots of inputs coming into a spark pipeline there these inputs are essentially all the site history of the users attributes of the products that the user have been browsing and also the campaign history so all of this goes into a spark pipeline and which first aggregates the data so does some preprocessing over it adds an ml classifier on top of it now this ml classifier could be say xdboost random forest or this digression etc and finally there is a hyper parameter tuning there so which essentially ensures that i learn the best model out of this data that i have once i have the best model i just purchase that model so why we are using spark here is because of two reasons spark is a distributed processing platform and we have lots of data and these two work really well together so that we are able to do all the data aggregation and processing and modeling really fast secondly spark allows us to store not fully this ml classifier but this whole pipeline as a model so which essentially means when i'm taking this model from offline to online a lot of the processing is i do not have to repeat a lot of these steps i do not have to repeat aggregation and processing etc so all of this is already taken care of just give raw raw data to this pipeline and out comes the course so so now we have a good offline model and the next logical step is to take it and deploy it online right the most important thing to understand here is that models course are not equal to bid models course are just probabilities which tell us how likely it is for a user to purchase from our site right but bids they are actual dollar values which i will end up paying if i win an impression so how do we go about converting these models course to bid the first step along the way is calibrating these course because these course will now be used to compute the bids which will finally take part in the option which means i have actual money going here right so the we want it predicted probabilities to be as close to true probabilities as possible multiple models can be used here so for example isotonic regression flat scaling etc now that i have good course the next part is to scale these course into final bids that will take part in the option so multiple factors will play a role here including where is this inventory that i am bidding for is it a mobile inventory or is it a desktop inventory because mobile inventory is usually much cheaper than desktop inventory so for 30 cents you might get a click on mobile but not on desktop even objective of the campaign is going to be really important see i've been running a revenue campaign to get as much revenue as possible and suddenly a marketer comes up and tells me they want more footfall in a certain category so my problem is now to get more footfall in electronics category and also there is this revenue campaign going on so that means i want more clicks in a certain category which means i will need to increase the bids in a certain category but the user intent has still not changed right the user scores still remain the same and now i suddenly have to bid more so the scaling function will need to take that into effect also publisher signals could be really very important so if i know certain inventory or creative has worked well for me in the past i would like to exploit that information and maybe bid more on that sort of an inventory which i know works well for me so this is the part where i have been able to build a model which finally gives me bids which can take part in the auction and not just this course what are the other challenges that i might have to face when deploying it online so the first challenge along the ways that we are dealing with real-time processing so as soon as the person does any activity on our site we want to be able to target them off in any of our site they might visit any other website and we want to be there to show our ads to them so our models cannot act at the bottom all the processing needs to happen really very fast also the model that we are trained was a batch model right so all of the data was sitting right there and we were trading a model over it and now suddenly in an online processing what is happening is we're just throwing streaming data at this model wilderness integration really works so we have faced issues here so when we deployed random forest for the first time we saw a lot of delays happening in the pipeline even when we tried XGBoost for the first time we saw a lot of memory leaks happening so we deployed solutions around this and eventually things started working but this is also a potential point where things can break now over to A-B testing so this is the real holy grail and a lot of models seem to promise quite a bit in the offline testing part but a model only works if it proves it's worth in A-B testing because you always need to A-B test just you know have only a segment of the population being transferred to a new model and if it really scales if it really works just scale up the segment like the first percentage of segment that is going over to a new model at this point I would also like to call out a win that we had last year so Walmart had an A-B test against a really big third party advertising player and the goal of the campaign was to get as many new customers as possible and Walmart's in-house campaigns really won by a significant margin and this was a really big win for Walmart so we've been discussing a very standard binary classification setting right and lots of classifiers already exist for the same what sets our problem apart from others is the kind of data and domain we are dealing with and so I'm going to share a few nuances of what we've encountered I'll start with the purchase funnel itself so millions of users which visit our website each day and most of them just look at home page category page search page and then they just go away these are very light and thin pages people do not do any other activity just maybe visit one or two of these pages and then they just go away some of them actually end up looking at an item page which means they have more intended they're interested in a concrete item right and even a lot of them go away uh you have them actually add up things to cart which means they are looking to purchase right even fewer finally complete the purchase so if you look at the funnel uh as we go down this funnel the data is getting sparser and sparser and this is going to be an important nuance to keep in mind while modeling the problem also how users interact across different devices might be quite different so on desktop for example we see that very few people add items to cart but a lot of them end up converting whereas on mobile we see lots of worstly traffic people keep adding things to cart but very few of them actually end up converting so the kind of traffic we have across these two devices even for the same user is very different and that is also something to be kept in mind when modeling this bit is about how labels itself should be formed so are you trying to uh build a convergent model or a click model because a good convergent model might not be a good click model and vice versa so for example we've seen that uh segment of users they tend to be really good clickers uh if your goal is to get clicks on your side just go to them and they'll get so many clicks onto your side but they do not tend to convert so you need to be really sure what is it that you're trying to model and this is a very important design decision that needs to be taken much beforehand also the setting that we've discussed till now seems very ideal for an advertiser right it seems like like as an advertiser I know everything that the user has done on my side which is not really true because in a realistic setting users have multiple touch points with an advertiser so uh users have multiple devices and multiple browsers through which they finally interact with that advertiser and as an advertiser what I see is a multiple partial views of a user so the same user on desktop might seem like an avid shopper but on mobile they might just seem like a casual browser um so had I known the whole story that this was the same user across these two devices I would have known that this user is definitely going to purchase and is just looking to add a few more items to their cart but because I do not have that information I do not know that these two users are the same what ends up happening is I have a lot of incomplete data in the system also there are a lot of noise sources in the system um so suppose a user has a low connection feed so their device might not be able to send out a few signals and I might lose out on those signals right even uh cookie churn is a really big problem about 65 percent of the cookies are deleted monthly which means that even for the same device and users after a certain point of time I will not be able to track this uh so this essentially means that the data in the system is very noisy and also very incomplete uh to put things into perspective so Prithu did a study where they claim that about 31 percent of the transactions involve two or more devices also if we look at the user centric view of activity as compared to the device centric view we have about 40 percent increase in convergent rates that means incomplete data is a real problem but only five percent of the advertisers have a complete consolidated view of users other 95 percent do not there exist some probabilistic methods to stitch user profiles across devices but uh complete consolidation remains an open problem so this brings me to the point where I'll talk about the kind of optimizations we are working on to actually deal with this problem of noise and data completeness uh because the current classifiers that are used in user response prediction they assume that the uh data is precisely known which we just saw is not really the case right uh so what we propose is to characterize uncertainty in the data and that this will lead to robust classifiers which will be immune against any data perturbation so how we characterize this uncertainty is using principles of robust optimization uh and this results in two uh algorithms robust factorization machines and robust field-aware factorization machines this is a paper accepted at the www 2018 and uh uh is my co-author Survi Punjabi is also sitting here in the audience um so instead of me telling you how the solution looks like we'll build it together today so I'll first discuss the state of the art like what factorization machines are what field-aware factorization machines are what robustness really means and how do we incorporate this robustness in these uh highly expressive algorithms to obtain their robust variants so let's start with the state of the art so this is a binary classification setting where we have user features and we want to predict labeled as plus one or minus one right uh for a family of classifiers the optimization function looks something like this so we are trying to minimize a loss over a vector w that we are trying to learn and now this loss has two components one is the empirical error and the other is the regularization penalty so uh what empirical error is trying to ensure is that the predictions are as close to true labels as possible and what regularization penalty tries to ensure is basically just tries to regulate the complexity of the classifier essentially um you're trying to learn as simple a classifier as possible and the phi function here is going to be really important uh so what so this is sort of a transformation function uh it defines how the features are going to interact with each other and it takes the help of this w vector that we are trying to learn uh how we define this phi will result in many kinds of classifiers we'll also see a lot of them today so let's so just keep in mind this phi function is going to be really important and we are going to play with it throughout this talk so let's start with logistic regression this is a classical algorithm very famous and what it tries to do is so what it says that let's learn a vector w of length d where d is actually the number of features so the phi function it defines is just a linear interactions of these features and the interactions are made by the vector w that we are trying to learn so a very good phi function here and this is very scalable because the number of parameters that is just the order of d because the w vector is order of d right this is very interpretable so if you want to understand how important the feature is just look at its corresponding value in the w vector and you'll have some sense of how important the feature is the problem with logistic regression is that it does not capture the pairwise interaction effect what that means is essentially say a certain category is only browsed on mobile and never on desktop so this sort of a feature interaction so pairwise feature interaction will never be captured in a logistic regression model so this is where poly 2 comes in so it says let's try and capture these pairwise interactions between features and we'll use a now a matrix w for that so this is a matrix that will try and capture the pairwise interactions so for any two features j and k the importance of their interaction will be captured by this jkth index in this matrix now we have order of d square parameters is being learned because this matrix w is order of d square and the five function looks something like this so pairwise interactions means we are saying for all possible features j and k we are taking xj into xkth and the importance of this interaction is given in this jkth index of this matrix so all is good we are able to capture order of the order two interactions and we have a new five for this thing right but the problem here is twofold first we are trying to learn order of d square parameters d is usually of the order of millions in response prediction especially for advertising domain so that means we are trying to learn order of millions square parameters which is definitely not feasible and also on top of that this parameter matrix debut is going to be highly sparse let's try to understand why that happens so in in advertising response prediction so we have lots of categorical features and they are called as fields okay so publisher could be a field brand is a field device is a field and each of these categorical features can take millions of values right so publisher could be cnn go and million other publishers that are possible right even brand could be nike adidas and millions others device could be desktop mobile ipad etc but when we want to use these fields to train models we first convert them into features by one hot encoding them so what one hot encoding means is essentially that for a certain impression the publisher is cnn say so in that case in cnn i'll just have an entry of one and boog and every other publisher will have an entry of zero so we one hot encoding into features and cnn would never really interact with any other publisher right so essentially cnn which is one feature doesn't really interact with millions of other features right so that means they do not really interact and therefore this parameter matrix is going to be really sparse so this is where factorization machine is coming so this was a seminal idea proposed by steven rendolin 2010 and what he says is that you know let's just learn a latent vector of dimension p per feature and now this latent vector will capture every interaction that this feature can have with any other feature so let's see how that looks like so now we have for each of the d features we just have a latent vector p that we are trying to learn so because we had d features so we have essentially a matrix of ordered d into p that we are trying to learn right and now if we want to capture the interaction between any two features j and k how do we go about it so we just take the latent vectors of these features so say vj is the latent vector corresponding to feature j and vk is the latent vector corresponding to feature k we just take a dot product of these features and this gives us the interaction between j and k a quick primer on dot product so say these are the two features for which these are the two vectors on which i'm taking dot product i'll first element wise multiply these two vectors and which will give me a new vector so 0 into 5 0 1 into 2 2 4 into 8 2 8 and again 0 and once i have this new vector i just sum up all the elements in this vector so 0 plus 2 plus 8 so 0 gives me 10 so this is the final value this is the final weight of interaction between features j and k so just revising now so we have order of d into p parameters that we are trying to learn now p here is much much less than p so p here is much much less than d so if d was of the order of million p is just of the order of maybe tens or hundreds that's it so we've reduced this parameter matrix by quite a bit and the feature feature affinity is now given by the inner product of the latent vectors that we are trying to learn so the five function now looks something like this so we have linear interactions just like we had in logistic regression and we also have pair wise interactions now so again x j into x k the pair wise interaction is weighed by the inner product of these latent vectors so this is good we have a highly expressive model now and the order the parameters are now just order of d into p right now let's move to an even more expressive model which is the field aware factorization machine so just let's recall how the fields used to work so we had publisher brand device etc as fields and we used to one heart to input them and get features which was finally used for training the models right but when we are trading these models we when we when we are using these features we forget the fact that a lot of them used to belong to the same field right so this is what field aware factorization machines aims to do it aims to keep this information intact so how it does it is essentially learning a latent vector for each feature and field combination so instead of now just a latent vector per feature we'll have a latent vector per feature field combination so let's see how that looks like so for each feature and field so say there are q fields and d features we'll have a latent vector of dimension p and so that means essentially we are trying to learn a parameter matrix of order of d into p into p so now if we want to capture interaction between any two features say nike and go we'll just take two latent vectors and take a dot product over them and these latent vectors are now vector corresponding to nike and field of vogue, vogue field is publisher so a nike and publisher and then vogue and field of nike so because nike is a brand so we have vogue and brand and now just take a dot product and you have the interaction rate between these two features so order of d into p into q parameters being learned here now and the fire function evolves to something like this so again we have linear interactions and pairwise interactions and it's just that the weight of pairwise interactions has changed so now the pairwise interactions are given by a feature and field of the other feature so the dot product of those vectors it now gives us the interaction weight so this is an even more powerful algorithm as compared to factorization machines the only problem being that the number of parameters being learned here are even more but these have become quite popular both fm and ffm are quite popular because they've won not only Kaggle competitions but have really done well in production settings as well so ad rule has a blog and pletio has a paper about it you can go through those papers so now that we have some idea about fm's and ffm's let's try and introduce robustness in these algorithms so robustness essentially has two key ideas first is uncertainty so we need to define uncertainty associated with data points and the second idea is redefining the optimization function itself so let's first define the uncertainty so uncertainty looks something like this so say this is how my data points used to look earlier and this was a classifier that i was learning now when i introduce uncertainty over data points this that means essentially having hyper rectangular manifolds over these data points and now this data point can reside anywhere in this hyper rectangular manifold and so we've defined uncertainty right and we can see how that it looks like and now let's look at the what the optimization actually wants to do so robust optimization seeks to learn a classifier that is feasible and near optimal even under the worst case realization of this uncertainty what that means is essentially how the optimization problem is spring so let's look at the optimization of a general classifier so we have a loss minimization form we have we are reducing loss over a vector w that we are trying to learn so we had seen this form of form earlier as well but robust factorization has a mini max form so we are trying to reduce loss but now if you look at loss which is the each data point has an uncertainty associated with it and we are trying to minimize loss with respect to w but there is also a maximize term here which is we are trying to maximize loss over uncertainty and then trying to minimize this over w so in our paper we use box type or an interval uncertainty what that means is essentially uncertainty of each feature is independent of every other feature so if i if i have certain uncertainty over one feature it doesn't really impact other features at all so if you recall so fm's had linear and fair rise interactions right so we define uncertainty over linear interactions as mu and we also define uncertainty over pairwise interactions as sigma and we introduce these uncertainties in the phi so now we have a new phi function for robust fm and if you can notice so in the linear interactions we've introduced this linear uncertainty mu and in pairwise interactions we have introduced the uncertainty sigma so we have a new phi which essentially means a new algorithm altogether and now let's look at the optimization problem so we have a robust minimax formulation right because this is robust optimization and we'll introduce this new phi that we have just defined and we have this new phi and so this is the robust optimization formulation for fm's right but most of the solvers that we have like gradient descent and every and so on they solve only a pure minimization form or a pure maximization form right but we have a minimax form so what we do is next we reduce this minimax form to a pure minimization form and how we do that is essentially we try to upper bound this loss with respect to uncertainty so we just get a worst-case loss in terms of uncertainty and the uncertainty terms now go away now we just have a normal minimization form which we can use any solver to over this and get the solution similarly for robust field-aware factorization machine we'll again define a new phi with linear uncertainties and pairwise uncertainties and again we'll have a robust minimax formulation which will then reduce to a pure minimization form and we have the robust field-aware factorization machine so now over to the experiments that we ran so we used C real-world data sets from crityu and avazoo the data sets were regarding true rate prediction and convergent rate prediction and we provide a spark-scala-based implementation the code is open sourced and link is available in the paper you can check it out also the results were very promising so what we see is that we see a significant reduction in log loss in case there is noise in the data but if there is no noise in the data and still you are trying to use the robust formulations we see a slight hit in performance but in noisy cases it is definitely something that you could use also RFM and RFFMs are generic predictors they are not restricted to the computational advertising domain we've also proved this in the paper we've actually on we've run this on even credit card for detection data set and there also we're getting similar results now over to the three learnings that we've had over the period firstly data is supremely important there are so many layers to it you just keep feeling it off and you'll have something more to learn each day and keep improving your model secondly keep your goals really high but you need to start small and you do not want to be obsolete by the time you've finished a model and you need to keep reiterating because a lot of things you learn through the process so many things will actually not work out which is also a good learning. Also EBITES are the real litmus test model works only if it has proved at worst in the EBITES in an EBITES finally I think innovation is extremely important because each of us in our small way is trying to solve a new problem and if we innovate we not only add to our understanding but also to the understanding of the community in general thank you I have so let me repeat the question so I had mentioned about predicted probability versus true probability so what does that really mean so a model will have certain predictions right so if you look at the those predictions so a model will give out a probability essentially when you have classification a model essentially tries to say how likely how what is the probability of a plus one or minus one so that is predicted probability of the model right and then we have true probability in the sense that so we now know that the model has given me say we now have a distribution of the probability from the model now what we essentially end up doing is that we distribute maybe say into a few buckets this probability the predicted probability in a few buckets and for each of these buckets we will see what is the true probability according to the data that we have so all for all of these samples that fall in a certain bucket what is the true number of ones and zeros so that sort of thing so this is the truth so is it close to the truth or not something like that thank you hello my question is that you are you have built a model for example after looking at the past data and then how are you monitoring day-to-day that how is it going means sometimes the prediction because there will be noise or new scenes coming up maybe because your user is there with some reason and you don't know what exactly is going on i think there are some uncertainties you talked about between the two variables but you have to monitor this day-to-day new uncertainties are coming which are not seen by the model how are you actually doing it means in a regular basis so we train the models from time to time so exactly because of this thing because there are new distributions we don't know why those distributions are changing so we we have to keep repeating the modeling exercise from time to time in order to learn like that on to the new distribution the new pattern that we are seeing i think that is what we do right now sir follow-ups offline please she'll be available yeah hello yeah so can you please come to that field aware factorization slide can you please come to one field aware factorization slide yes so your fields are like publishers brand device right and your features are like cnn nike adidas so what only so you are fixing your number of features always to be your features are always d for example in publisher it will have d in brand also you will have d in devices also d all of these total accounts to be like once i have this whole set of features here the order you are taking as you can get the total of that summation di is d basically summation of di different intervals you are taking d1 d2 and their sum is d oh so uh these are fields so there are two fields so you could maybe say i don't know q i or q 2 2 3 yeah so this is finally done yeah hello so i wanted to understand you short the result that the algorithm was performing better on the you can say data which is shifted or not unexpected data so first and the first thing how did you identify that this data is you know lossy or unexpected how did you differentiate between the expected data and unexpected right so uh we took this data and for training uh so essentially uh but we trained both for we took this time some sample of this data and we over that later we trained both the factorization machines and the robust factorization machine and for testing we uh introduce some noise into the model so that was a model so we model that noise as we increase the noise levels how does how is the performance of these two classifiers uh the detailed experiments are in the paper uh we did i'm not doing the detail experiments here because hi uh thanks for the presentation i have a question like after the ab test how are you incorporating the results to improve the model further yes so uh essentially what ab test so there is an idea that you usually test to the ab test uh you're right and you only come to know whether it worked or not like at least uh i mean maybe somebody has better ideas about that uh yeah just you didn't come to know that this did not work and then you need to like go back and figure out i think there is where intuition plays a part you then need to figure out uh what would have gone wrong and where you could have improved between that and start again hi uh so you have uh used factorization machines to capture uh your feature interactions right right uh did you get a chance to experiment or at least have you come across any study uh what happens if you replace factorization machines with embedding without encoder because they also in a way try to do the same thing right i mean instead of using linear uh interactions they're more than non-linear interactions um can you repeat the question so if you're using factorization machine just to capture the interaction among the features correct right uh we could do this using embedding also embeddings are auto encoders because there's sparse not vectors right uh you want to capture interactions along they don't want to lower dimension projections with it right uh i think uh there a bit different uh in the sense that uh this finally it becomes very linear right even if uh there are fairway interactions this one becomes very linear uh whereas if you talk about auto encoders or embedding uh they're not so linear uh i'd say so it's hard to um uh they might be more expressive in that sense i might i might say sir again uh please connect with the offline she'll be available she'll be attending as well hey hi uh thank you for the great talk uh my question is regarding the offline training once you have ab tested and you're happy with the model and you have deployed it uh your offline train it maybe once a day twice a day whatever the frequency you decide on in that do you tune your hyper parameters regularly or once you have settled on a set of hyper parameters that you're happy with and proven in ab testing we just let that be and just fetch a new data run it through the same model again and just probably we uh we train with the hyper parameter so it's not it's not like the hyper parameters don't remain at least uh maybe uh the boundary we sort of uh uncomfortable in a boundary so that might sort of stay the same but yeah we check with different hyper parameters hi Priyanka um thanks for the talk um i actually have a question about the topic you brought up pretty earlier in the talk about um how you uh how you sort of make your model run uh near real time um can you i mean just a bunch of three four abstract questions i mean do you have pre-computed features what sort of databases do you keep could you tell me a little bit more about that yeah we keep a lot of pre-computed features for that uh for example a lot of the item attributes even uh users previous history and stuff like that all of that is pre-computed and kept it uh the like right places so for example we have uh some some things we just store as files so we have a spark streaming solution so some of the uh some of the stuff we just store as files in the same machines uh across the cluster uh some for some of the data we have Cassandra stores uh so for example all the user information which is happening in like real time it has been stored in Cassandra uh so and uh we've defined aggregates in Cassandra itself like before being pushed into Cassandra a lot of the aggregation happens uh just as soon as a thing as soon as any activity happens a lot of the data is aggregated and then put into Cassandra uh so that um in real time uh there are very few competitions that we'll have to do uh for the modern state okay can we have uh the people who want to ask questions raise their hands how many more are there okay there are quite a few more we're out of time uh let's give a round of applause to Priyanka once again she'll be available you can connect with her offline I have a couple of announcements before uh we break for beverage uh birds of feather session on spark users uh will be conducted by Rohit Raghutam and where they're in the boff area which is on the first floor that will be from 1215 to 115 please join there thanks okay thank you good morning all I'm audible to everyone fine so um hi uh I'm Puneet and I work as a VP for data engineering at XRATOM uh XRATOM is a startup in Pune um which is into uh big data uh we're into services as well as uh into products uh so today we'll be showcasing one of our product uh with uh the use case uh covering as uh machine learning use case covering uh digital uh propensity uh uh regarding me uh I have overall 11 years of experience uh uh I've worked with Oracle I've worked with uh Pobmatic I sit with amdocs and pieces uh I've been into big data space for now like seven to eight years so let's go over the agenda so we'll be talking about digital market propensity and where exactly it is applied uh thereafter we'll be covering uh the legacy implementation and the challenges which uh uh were present uh in the original machine learning pipeline uh after that uh we'll be talking in general regarding the challenges in building a unified platform uh so it is uh ETL machine learning platform uh after that we'll be introducing uh extremes uh which uh is the product which using which we'll be showing this particular use case uh we'll be talking about the simplified architecture for uh extremes features and the benefits uh I'll be showing you one live demo uh for uh the digital market propensity and uh the top uh differentiators and uh the road ahead for extremes as a product okay so uh most of you would be knowing about uh digital propensity uh so it actually uh predicts users purchase trend uh not only uh based on his uh uh you can say uh activities which he is doing online uh but also using his demographic information uh demographic information is very much needed uh to do uh you can say uh uh propensity modeling which is more relevant as compared to the uh ones in which only uh the uh user activity which is the clickstream data which is being used uh so uh in in this particular uh digital propensity we rely not only on uh the customers uh activity but also rely on his demographic information so the demographic information could be uh his salary his uh marital status uh the household size which he's having uh you can say uh uh the language which he speaks and all those information uh so uh in simple terms if you try to understand digital market propensity it is like if uh you are a customer uh then uh uh how likely are you going to purchase a given product uh uh based for a given brand or a given price range uh in that particular month so in short this is what a digital market propensity is so it can be used uh this particular machine learning model can be used uh is and is used in uh search and browse recommendation cases uh to enhance the browser experience showing the relevant items to him uh not showing the one which is actually out of range uh and he cannot even buy those or not showing the brands which he is not even interested towards uh uh then we can even go for discount counter optimizations wherein like uh based on his uh purchase propensity towards a price or a brand we can give him uh specific uh or we can specifically target the discounts instead of giving a generalized discount which doesn't applies to many of the uh customers there uh it can be even used for ad monetization part for relevant targeting and then the futuristic shopping experience for tagging customers uh one similar to what we are seeing uh outside where a uh mirror is placed wherein you go and uh uh tag yourself and then find out uh uh or it actually lets you know like how do you uh go about uh uh like it is uh to uh you can say enhance the in store uh uh you can say uh customers uh uh experience there so uh once it is able to uh target that customer uh like uh find out the particular customer and it's his propensity towards the price and brand it can uh guide that customer uh for the relevant uh you can say items which can be shown to him okay so talking about the legacy implementation and challenges so this particular uh market propensity uh model actually uses logistic regression uh wherein uh it was used on 30 million customer base uh with 200 plus millions activities for uh I can say uh on the clickstream data and the in store data so we used three different data sets here one was the clickstream data uh second was the demographic information for the customer and third one was the product data so these three data sets were used uh for uh the logistic regression model here uh the we used uh five feature components here uh which was product category uh so the models which were generated uh were at these five feature set components uh so product category was one of them product sub category was another so when I talk about category it is like jewelry clothing uh when I talk about sub category for clothing it could be like tops a bottom shirts uh then uh comes the age factor so age buckets was there and then the gender and uh so uh so we created all the possible combinations for these feature sets okay so uh taking into account these five different combinations uh it usually uh comes to around for each given customer uh there were around 12 to 3600 plus uh different combinations uh now if uh and this particular model was trained for four months of training data so you can imagine like 3600 rows plus the date part which actually takes it to 300 k plus different columns uh is the size of the column favorite table which would be created so uh given such a huge size uh size of the uh we can say denom table uh with uh a pivot of 300 k plus columns uh the pivoting was a problem uh you they were finding it difficult to create pivot on such a huge and large data set uh so talking about the process first so it was like the demographic data bucketing was done first so uh when i talk about demographic data bucketing it is like for age we divide it into different buckets for based on the salary we divided him into different buckets uh for household size we divide him to different buckets uh for the languages which he speaks uh so all these feature sets were uh bucketed uh second uh part uh in as part of the feature engineering for this particular model was uh the clickstream data uh was used uh and we use a shiny aggregation here so shiny aggregation when we talk about shiny aggregation it is like uh records so if i do a particular activity in a given month uh it has to be uh we have to create the record in reverse chronological order for all the combinations out there so and then uh after doing so if i buy a particular product in this particular month for last four months i have to create that particular record and then do the all possible combinations of aggregation for that particular customer so third part of the feature engineering was generating the product and demographic pivots which was very huge and later on they used to vectorize and then standardize the data set standardization was necessary because there are most of the cases where the number of product views is much higher than the number of products which are actually purchased so we have to scale it to the same level uh now the challenges which were there in the legacy implementation was uh only five percent of the customer data was used because of such huge size of the pivot uh we only uh reference the top 20 features uh for doing the feature engineering and ML model training it took 18 hours to train the data uh we had varied scores uh because of the sampling issue uh because uh and it is liable to happen so these were the feature engineering issues which were there okay so we have talked about a digital market propensity and we have talked about the legacy implementation which was like rating creation of pivot vectorizing it uh pivot was an issue there now let's talk about uh challenges in building a unified platform in general and thereafter we'll go over the optimizations which would be necessary for which we had done for optimizing this particular pipeline using this particular application so uh in unified platform so when I talk about unified platform a platform which actually not only lets you do etl for batch processing it lets you do uh stream it lets you develop and create stream processing pipelines and also lets you uh create machine learning pipelines as well and when it when I talk about machine learning pipelines it not only lets you create the models but also allows you to score the models as well there are products and tools which actually lets you do uh the training but again they lack features to actually help you do the scoring uh easily you have to go for some changes without which you it becomes difficult for you to go for the training part then there are lots of low level programming still required even though uh we say that there are interfaces provided by spark such as spark sequel but still you have to go and write a lot of low level programming codes there so uh third is like there is a lot of time which is wasted in debugging like where exactly uh what failed in which stage uh what was the reason for the failure at a given stage uh so you have to go and look for uh that particular issue of the failure in the logs uh now uh then again there are gray areas on the fault tolerance what happens when my if my running pipeline which is a streaming pipeline fails uh then there are still issues there connecting multiple targets in one single pipeline there and then security uh is another important feature which needs to be present uh then uh there is no collaboration so collaboration is there only there on the notebooks but when it comes to application uh we don't feel or see any collaboration which is present like if i am a developer then uh i can share my code with someone else and then he can reuse from the uh that particular code okay so uh extremes uh so in short it is uh enterprise ready self-serve unified so uh batch and uh streaming platform and also it lets it lets you develop the machine learning pipelines as well in the same canvass so the benefits of using uh extremes is uh it let lets you do uh the development very rapidly and the reason is the simplified drag and drop interface similar to i can say one of the uh the top tools in the market it is almost or it is better than those tools which is there right now uh second is the abstraction the complexities are sorted and you have for all uh cases where we found that you have to go about writing custom codes in spark for doing that particular tasks especially related to stateful aggregations or doing machine learning uh modeling and scoring uh all those you can say complexities are sorted here uh low cost because you get to develop your pipeline in very fast and uh also uh it lets you provide other added or additional features which you have to take care when you go to production so it is not only the pipeline which you need to develop but also uh the features like uh capturing of metric information like uh information which is useful for DevOps to know okay if a job is going to fail or not uh second is like uh what happens if a wrong record comes in will uh how can i skip my particular record and do not let my job fail uh so all and and the for the tunings so all these things are very much necessary other than the pipeline which you need to develop and then there is no uh vendor lock-in because it is all open source or the custom code what we have written so these are the benefits of using extremes uh regarding the features we have 20 plus io connectors which are tuned uh which can allow you to pull data or push data to the sources and sync uh at scale uh we have uh 50 plus operators to do your etl processing activities beat related to uh i can say extracting the data or you can say joining the particular data sets or doing enrichment of the data or so all those operators are present beat union rollups cubes everything uh we have 45 plus estimators so they are almost all the estimators which are actually uh provided by spark uh which can be easily used as a simple stage configured we just need to configure the hyper parameters there and it is you are ready to use them in your model and all the 15 plus models supported by spark is also uh you can say be easily modeled uh on top of extremes okay so regarding the features of extremes uh it lets you schedule your workflow uh it has a common marketplace wherein we have defined pipelines which are which serve for a particular use case like today we are talking about digital market propensity so that particular my pipeline is already ready to use you can just download that particular pipeline and use it for your you can say use cases just by changing the data sets the logic almost remains the same there uh then uh comes to uh so it has a unified batch and simplified pipelines you can visually create the ml uh an etl pipeline so when i talk about visual creation of ml pipelines it is not only the drag and drop interface but also uh understanding uh how exactly the different hyper parameters which are changing in your machine learning model for different estimators or the models what is the impact of change of a given uh parameter then uh there are 112 plus etl and ml components connectors for the big data system uh very intuitive dashboard for metrics and monitoring which is missing in spark right right now and then uh the most important part comes to is SDK for developers so if you have your own custom code base you can just put it in uh in the custom plugin uh stage there in this uh uh in the canvas so the input would be the data frame and the output would be the data frame so any custom jar uh which you have written it can easily be embedded on the extreme so even uh when it comes to the migration uh from your own old legacy systems to extremes it is very easy for you to migrate okay so let me show you uh some of the features uh like of uh extremes here so this is a simplified uh drag and drop interface wherein talking so i'm creating a pipeline here uh which type of pipeline so batch also covers machine learning uh pipelines you can put in different components now i'm using Kafka as one of the connectors here to read the data from uh Kafka uh i can do the configuration uh i can provide the name of the i can say schema out there uh which is the extractor so there are text extractors json xml and all other file format extractors out there where you just go and define the schema uh and you have your uh you can say sync ready for uh uh reading with that particular model uh you can even do the joins uh with the data different data sets here specify the type of the join and what type of join it is whether it is broadcast or the other join supported by it and then finally you write to the target uh this is how it is easy to construct a pipeline uh for scheduling you just have to go and provide uh the name of that particular pipeline specify the time when actually that pipeline needs to be run okay then it comes to that uh so if you see here uh it's showing green means that that particular pipeline is running uh if it is failed then it will show in the red uh this is the monitoring dashboard which shows the advantage of using this dashboard is it shows you uh the metrics across all different batches uh which are on which is very in uh useful for you especially for the streaming pipelines wherein looking at these metrics you can easily identify whether your job is going to fail or not because if the input has increased and your memory footprint is also increasing or like if it is decreasing but still your memory footprint is increasing you can easily find out whether you uh you can say a job is going to fail or not uh uh not not only it has directly linked to the yarn uh cluster as well where you can go and see the logs uh for each particular stage uh or the target you can see uh you can the sunburst metric chart which where it shows for each job stage and the tasks uh the different metrics uh when it comes to security uh it has active integration with LDAP uh and then uh you can go and create the users choose the users from LDAP and then you can at custom level you can provide with for that particular user uh whether which access he needs to have even if like uh if you see here okay so when I talk about access it is uh like even for creating the sources if there are different types of sources you can spill restrict that particular user to have access to that particular uh source or the target or the data sets of the custom plugin stage for the IO connectors you have all different IO connectors for Kafka, Kinesis, Raidstream, for targets you have Cassandra, Hive so all these connectors are readily available for you to use uh you once you define your uh syncs or sources you just have to drag and drop that particular sync and then you can specify the schema and then do your ETL and write it to the output it is that easy for you okay now coming to the audit and collaboration part so each and every activity which is being done on top of extremes is captured and it can be shown as audit which user has created what component which user has deleted which particular pipeline everything is audited and shown to the users uh second is uh second coming to the collaboration part so if you have access to a given pipeline you can request for that particular pipeline access for that user once that user approves that particular uh uh component access to you be at source level pipeline level or the target level any of the components uh you can go and reuse that in your pipeline so here i'm providing access to one of the user who has raised the request once it is provided to him he now he can go and use that particular pipeline to him so it is like uh so consider for the case where in like a lot of developers who needs access to a particular source or the target so there could be a DevOps guy who can uh allow that particular access to those developers based on his need so this is the audit part where in whatever changes we have done is uh being shown to that particular user uh now coming to the error handling part uh so if you see here uh you can just go and uh so when you create a pipeline you can just choose the sync where you want to see all the errors or the error records which do not satisfy your particular pipeline logic uh to be written to it could be a console or any kappa target wherever you want to write so uh once the pipeline runs if there were some few error records uh then what exactly happens it it writes those the records error records to those syncs with uh the details like in which stage that particular record was rejected what was the reason for rejection of that particular record so what exactly happens is like in one shot you get to know what are all the features uh or you can say etl mapping logic which is missing in your pipeline which is defined and in one shot you can do all those changes instead of like doing iterative uh you can say changes for each an error you find out while you run your program okay so coming to the extremes architecture uh so on the left you see the different data sources to which you connect to on the right are the data syncs on top of that you see the different uh components related to administration and the pipeline life cycle where in administration we have installations uh which is very uh made very easy there it has integration to security LDAP and Kerberos you can submit job to Kerberos clusters uh you can import and export the pipelines from marketplace uh for uh given you can say use cases uh you have the audit uh there to know what exactly any user has done uh you get to develop your pipelines very easily using the drag and drop interface it has error syncs it has uh very uh you can say informative uh metrics and monitoring page it has budgeting support scheduling and the notifications as well like uh based on the monitoring system if i if i know that my job is going to fail uh in a streaming mode uh that notification can be published there okay so uh uh without spending much more of time let's start with the market propensity modeling uh on top of extremes okay okay so uh here i go and define the machine learning pipeline which is the name of which is propensity i go and specify that there are no error syncs as of now it is a batch pipeline uh you can go and specify the different spark parameters out there uh so the first data set is the clickstream data set which needs to be used for your market propensity model uh so you drag and drop the click stream uh you can say uh source uh there after once you have once you connect to the source then you go and define the schema uh so for click stream uh here we have used uh four columns which is like customer id uh the product id uh the uh and the event what kind of event it was whether it was purchase whether it was view or whether it was add to card and in which particular month that particular event took place so once you define your uh uh clickstream schema it extracts only those corresponding records out of it uh there after you go and join your existing uh uh data set uh you can say clickstream data set with the product so you go and define the product data set it could be present on hive or it could be a text file uh a file present on sdfs so you just need to provide the path out there specify the schema of that particular file so either you can uh add the columns one by one or you can even infer the particular schema of that particular file uh by just clicking on uh uploading a sample file for uh uh that data set so for uh so for product we have used product id uh uh and the product category subcategory and the brand uh these are the three columns which are needed uh for feature engineering later so thereafter there is one more data set which i which i talked about was the demographic uh data set uh this is a bucketized uh data set wherein uh each record is present at customer level with different buckets age buckets income buckets household marital stages so all those features are bucketed there uh so here i'll just use that particular file okay and then uh i just upload that particular sample file and it creates uh the schema out of it so this is how easy it is to even uh infer the schema of a file which has uh lot of columns especially to this particular data set which is bucketized okay so now we have uh clickstream data we have product data we have demographic data uh then thereafter we go uh and do the join of product data with uh the clickstream data once it is joined i can even project the columns so there are a lot of columns which might not be needed so i can project those columns only the ones which are required and leave the rest uh at that particular stage this is the stage wherein uh which i was talking about the custom jar stage where in like if you have any of your port base which is present there you can just specify the name of that particular class and the jar path and that is used here so it just needs the input data frame what was what needs to be passed to that particular jar and the out uh you need to specify what is the output which i'm expecting once that particular custom jar is run so in this particular part what exactly i'm doing is uh i'm i'm using the product and the uh clickstream data set okay and then i'm doing uh i'm vectorizing that particular uh uh the records at vectorizing the records at atj id level that is the customer id level for each combination of p1 p2 that is like product category sub category brand the kind of the event uh event month age and gender so uh this is almost like a combination of 300 plus k rows so those many number of uh uh columns were present in this particular uh uh custom jar there is one more change which was needed since spark has issues like it is more inclined towards creating a dense vector so it was taking a lot of time and this particular uh case when we did some optimization okay and then what we did was uh we specifically specifically asked it to only use uh sparse vector there using that not it not only reduce the size but it was also faster so so that particular custom logic was embedded in this particular custom jar and we have a vector which is created at customer level for product and clickstream uh thereafter we go with standardizing that particular vector which is creating created because for product views there will be a lot of uh counts other as compared to the product uh purchases and add to cart so once that particular vector is standardized then we go and join that product uh vector with the demographic vector here with the demographic data set which was present we join on top of the customer data set and the resultant is the demographic uh you can say columns uh the product vector at customer level so here we again select the columns which are needed for uh the creation of vectors so when you want to create the vector for the demographic data set you can even specify which all columns input columns you want to use in there and the resultant is the demographic uh vector column which is present now the further part is the projecting the column so this is a lot uh demographic columns out there the there is demographic vector there is product vector so what I do is I just project it and I push forward only the customer customer data the demographic vector and the product vector after that I go and use vector assembler wherein I bring uh the demographic and the product vector in at one level in in one particular column so I use a vector similar here I specify the input columns as the feature vector and the demographic and the output column as the assembled vector here so this is all the feature engineering tasks which we are doing and you can see how easy it is to do the feature engineering there after I go and use the logistic regression model uh wherein I specify the name of that particular model uh for uh for the training part you can specify the split uh configuration here and then I specify uh the model configuration wherein I where exactly all the uh you can say model artifacts will be saved which is the feature column uh would put the label column for this logistic regression model machine learning model and first core since for scoring uh we are using uh so you can even specify like since it is using uh test range split here you can even specify the regulation and elastic net uh perms as an array and it will consider each and every configuration and do the scoring for you in one shot instead of doing one by one for you uh for each run after that you can even specify the evaluator wherein like once your model is being trained it can even evaluate for various different configurations specified in the parent grid and then takes the best model out of it in the validator stage and uses we use that particular model for scoring at a later stage and finally it is like you can use whatever target you want to later store for your score data in that so this is one single pipeline which can be used for uh training as well as scoring so the pipeline which I am showing here is all like originally it was like 5k lines of code which was written over six months with seven developers and this is one particular pipeline which was created in one single day on just one particular pipeline page and it scales to uh the level which is like 30 million customers with 200 million records and multiplied that by 120 days of data so once you have created the job you can schedule uh your job wherein you can specify what time you want to start your uh training you can specify the mode which is like whether you want to run this particular pipeline is a training mode or a scoring mode so once your pipeline is complete you can see the different uh so in the model phase we specified uh the evaluators and the validators so you can even go and check those artifacts which are saved there for each and every model so uh so they're there for each different version so there were seven iterations in this particular model and for the seventh iteration I can see the best model which was created uh the estimators configuration the evaluator uh uh and the metadata and the sub models which were the models which were not actually picked up for the uh best model which they are also saved out there in the sub models uh for visualization parts for the model analysis where exactly it helps you the data scientist is like it shows you the different charts for different classes which is provided by spark so here we have used a multi-class metric wherein you can go and see for each different version what was the scores which was generated on that particular training data set okay and then you can easily infer like like it becomes easy for the data scientist to understand like what changes he has made for a given parameter and what is the impact on this course using these charts which are very easy for anybody to go and understand not only the matrix which is being shown but it also lets you know if there were like a lot of iterations out there in various iteration what parameter got changed and what was the impact of that particular parameter change be it related to any of the estimators which were used uh on that particular pipeline model so here if you see i can select different versions and what was the change in different versions which has happened and for the same corresponding version you can see here what was the different uh so you can see here uh uh what was the change uh across different versions related to accuracies it also lets you know the application id which was run for that particular scoring uh for for that particular training which version was used for that particular model what was the parameter that got changed and what is the impact of that change on your model so using this it lets you uh proceed in the right direction instead of like haphazardly going and changing uh parameters which uh lets like uh which actually uh uh becomes very difficult for you to arrive and you take weeks or months even to come at the best model for your particular use case okay so the feature engineering optimizations for this particular use case were we used a vector and especially sparse vector instead of pivot we removed the skewness in the data by using standardized scalar for product views then there were unknowns in the demographic data those buckets were not created so we created those buckets that further improved the uh you can say score uh during the training phase the custom shiny aggregation code base was reused uh in which you do uh you project one particular record in reverse chronological order for all different months and then do one single aggregation for all hierarchies which are needed instead of doing multiple aggregations for all hierarchies uh so the final uh output what we got received uh after this particular machine learning model was migrated to extremes uh so we were used to uh we got to use uh like uh train that particular model in three hours instead of 18 hours uh all four months of data was used there was no sampling required and that sampling logic was the one which was creating differences in this course uh there was 30 less resource consumption and uh 5k lines of scalar code which was uh written by six or seven developers and the data scientists uh was there on one particular uh pipeline canvas okay now coming to the uh extremes top differentiators so it lets you create batch continuous micro batch and machine learning pipelines in one canvas uh it gives you the operators for data wrangling data cleansing feature engineering data transformation scheduling matrix and monitoring and collaboration for you so you only need to focus on your logic development using the drag and drop interface and no need to worry about uh other things to be related to the monitoring we treat to the failure handling uh like this uh uh uh easy way to understand uh do your feature engineering and understanding how parameters impact the scores for your different models uh so and coming to the kick starts and reusability part so we have a marketplace for uh real-time business use cases we have use cases created for most of the cases which you can just go and download and reuse for your particular uh uh domain and use case there uh it is platform agnostic it can be used on cloud it can be used on premise hybrid it is just a client code we just need to install and you can submit it to any of the clusters there so there's no restriction out and then you can plug in and reuse your existing code so it is not that once you're migrating to this particular platform you are losing all your artifacts which were initially running in production uh so these are the use cases which you have solved which is like this wi-fi analytics consistent uh nptb and offers across different channels omni channel recommendation yeah and activating uh real-time behavior with micro segmentation uh so introducing the team extreme steam out here uh so Sandeep is the one who is the CEO of the company and the creator of this particular product uh uh me as the VP of this particular uh product and then we have Chitral Ankit uh as the lead developers out here and Vishal uh Nana and she as the developers on the back end and for front end we have Kiran and Dhanashti you can check out this particular product uh webpage uh on extremes.io and you can learn about more about this particular product uh since I only have just 40 minutes which already got over okay any questions okay so if anybody has any questions we can get in touch with me or my team who's sitting in the second row out here right thank you can we have a round of applause for him please uh so we're transitioning to the privacy track now we have Jyoti speaking next uh Alok Prasanna Kumar who's going to uh help take you through the next few sessions uh just a minor announcement there are a couple of mistakes on your printed schedules apologies for that so the data engineering bof is at 435 in the evening and not 235 I think that's what's printed and Jyoti's talk is printed twice so please do check the online schedule in case of any doubt yeah I'm gonna hand over to Alok uh have a good afternoon thank you Shreyas uh good afternoon everyone uh before I broadly introduce the topic which is of course data protect uh data privacy protection and public data um and also our speakers and what we'll be talking about today I'll take this opportunity to break a bit of news which was told to me by Srinivas Kodali the B and Sree Krishna committee of experts is going to be giving its report today at 4 p.m. we are told so those of you are expecting to work on it your weekend has been ruined congratulations uh but anyway on that note let me just talk a little bit about what the session is all going to be all about what our speakers are going to be talking about and why this is relevant those of you who are aware of the debate going on about privacy you would have noticed that a lot of the discussion about privacy in India is happening largely on two axes one what is our right of privacy against the government when I say are I mean citizens individuals persons of all sorts what is our right of privacy against the government uh to an extent a lot of that debate has gone on we have fairly settled principles we have understood what we can and cannot claim against the government a second axes axis of that debate has been what is our right of privacy against companies who have our data you know entities whom we give our data to in some sort of contractual duty in some sort of commercial consideration for some sort of commercial consideration what are our obligations that is still developing that is still ongoing that is still something that is developing as we go on you know there are various issues there as well but there is a third axis to that debate and something that is most fascinating and very under explored it has an impact on the other two axes as well because there are principles that you will realize that have a bearing on how everybody gets to deal with data and how everybody deals with privacy as a whole and that is how do we deal with each others claims to privacy can I claim a right to privacy against you can you ask me for my data am I say you and me I literally mean individuals can you all as a group as a society as an organization as public at large ask me for some details can I demand to know something about all of you as an individual as a researcher as a lawyer or something this is something that is going to keep coming up again and again in many contexts we have with us two speakers who will illustrate some of the important points that are going to come up in this in these debates through two different contexts Jyoti is going to speak to us Jyoti Pandey is going to speak to us in the con about this particular debate in the context of the right to be forgotten some of you may have heard of this in the context of the EU and Google but Jyoti is going to tell you about how this is a much wider debate that is going on around the world and even in India as well she's going to talk to you about the various implications of this right when it can be claimed when it can't the second speaker is Sushant Sinha he's the founder of a wonderful website called Indian Kanun I say that as a lawyer because for most of us lawyers Sushant is like the wizard of ours we are all we all use Indian Kanun as this amazing website which provides us information about cases which most people didn't have access to unless they were willing to fork over very large amounts of money to a few monopolies but Sushant has sort of broken a lot of those and most people are shocked to find out that Indian Kanun is run by this one guy in Bangalore and then it's like a real wizard of ours moment for a lot of us to know that all it really just took was this one guy writing code in Bangalore to break a couple duopoly really in the field of law and legal reporting but this has come with its own problems and I will leave it to Sushant to explain more in detail the kind of issues that can happen when you put publicly available public data what is data which is ostensibly public out into the public domain and why that these are two different things is something I will leave Sushant to discuss and describe and it's a fascinating experience that he has had over the last few years running Indian Kanun as well so without any further ado let me first introduce Jyoti Pandey she's an independent tech policy researcher she has previously worked with the electronic freedom frontiers and with the center for internet and society and she will be speaking to you about the battle for privacy and the right to be forgotten so over to you Jyoti. Thanks Alo. Good afternoon I'm Jyoti Pandey as Alo mentioned I'm an independent researcher working in tech policy for not very long just four years now so I'll be talking about right to be forgotten but before I start on this very interesting topic I just want to know how many of you use Twitter over here can we have a show of hands great so everybody's aware of how Twitter throws up these lovely examples of how we deal and interact with technology right so let me give you a really interesting story so two weeks back or three weeks back a hashtag called a plain bay and pretty plain pretty plain girl started trending on Twitter so the story goes like this Rosie Blair a woman was traveling with her boyfriend and she wanted to sit they had different seats so she wanted to sit with her boyfriend and she convinced this stranger to swap seats with her this happens with us in a fairly regular basis what happened next is the interesting part of how technology mediates our interactions with each other so the two the girl who swapped her seat for Rosie ended up sitting next to another good-looking man and it turns out that these two strangers had a lot of things in common they were both into fitness they were both cute they got along and Rosie continued to talk about this she went on to tweet and take pictures of this interaction and initially everybody thought that you know this is really adorable you know chance encounter strangers meeting how very cute but then Rosie took it a step forward she started reporting on their conversations every elbow touch every photograph being exchanged was being tweeted and of course these tweets went viral because what do we do on Twitter we follow these strange stories so but and but the during the time that this interaction was unfolding people on Twitter started reacting in very different ways some people felt that this was meat cute and this is perfect this is exactly what social media was created for other people started noticing that this could be creepy behavior because the girl had not consented to being recorded in this manner and having her interaction being projected to the world and of course Rosie knew that there would be some privacy implications because even in the pictures that she was tweeting about these two strangers she had blurred their faces so what is this oh and it doesn't stop here at one point these two strangers got up at the same time and Rosie went on to speculate that they may be hooking up in the bathroom which is of course uncalled for and no it was just that speculation so what went on as the tweets went viral of course everybody involved in this incident got really famous with tv news channels trying to get them into their studios and Rosie her boyfriend and the man all enjoyed the attention and regularly went on the tv circuit gathering the attention and loving it the girl did not want to be a part of it but that did not stop a bunch of other internet strangers from trying to hack her identity to figure out who this anonymous pretty plain girl was so what does this story tell us about the attention economy it's the perfect example of how scattered moments presented without context are becoming more and more rampant as we interact on these platforms um people build detailed but very selective profiles of us based on the angle and the interest that they are coming from um oversharing intruding and exploiting others for content is actually rewarded in this economy so the reason I started with this story is because when we talk about a right to be forgotten it is important to understand why there is a need for such a right in the first place and like this story demonstrates there are many examples where we might not want a certain representation of ourselves online and should we then have the power to control it or is it fair to let others decide for us how they should think about us just by the fact that we have consented to be on the same platform as them so the right to be forgotten look at this man he he's Mario Casteja Gonzalez and he is single-handedly responsible for the right to be forgotten there's another interesting story here and this is the most interesting part of my presentation it's all numbers and words so just you know be awake now um so in 2009 Mario googled himself and he was very surprised to see search results throw up a reference to public debt he owed which led to his house being auctioned and the laws in Spain he is from Spain and the laws in Spain require publications that are to issue a notice for such an auction to attract bidders um this happened a few years back and when he was googling himself in 2009 the debt had been cancelled and the auction was no longer relevant so he felt that you know um when you type his name in that search and with this reference coming up it was hampering his reputation so he approached the publication in question and asked them to so what had happened was this publication had recently been digitized and so he approached the publication and said hey could you remove it because it is no longer relevant the publications had no sorry can't do law requires us to publish it um this is in the public interest so then he approached google and google said of course we're not gonna do we are we can't do this we are we don't meddle with the content we just you know link to stuff we don't publish he then approached the court and the court since Spain granted him the right to seek that google delist um the search results so that when somebody types his name those references don't come up google appealed and the case went up to the european court of justice which again held that yes he did have the right to be forgotten um but so the the spain ruling in this case was limited it said that search engine operators are controllers of personal data so in data process you have data processes and you have data controllers so this was really interesting because the ecj interpreted search engine operators to be controllers of personal data and it granted individuals to the right to seek to approach the search engines and seek that they delist URLs about them personal data was the main stake here search results when you type somebody's name um and if that name comes up in the search results that should be removed and the information which is and because in the case the argument relied on the fact that at the time of when the publication published this order for the auction it was relevant but 10 or 15 years down the line it was no longer relevant and yet it was tied to his reputation so the idea of relevance was very important in this case so the court went into some detail and said that information that is outdated inaccurate inadequate irrelevant devoid of purpose and no public interest could be uh could form the basis of these right to be forgotten requests now where did the court come up with this right um so europe has really strong privacy protection i mean of course i hope everybody in this room knows that we do not have a privacy law uh but europe has a codified law um in 1995 interestingly they were trying to update the european directive and the conversations of right to be forgotten were already taking place in these discussions to update the data protection regime and parallel to this out of the blue the european court of justice decides that yes there is a right indeed but for them to be able to grant this right it must be it must come from somewhere and they found that right um to be forgotten to be a part of data protection regime that existed in europe at that point which is the right of erasure the right of erasure interestingly applied only to proprietary databases so for example if you signed up in the 90s you signed up for a magazine brochure and you stop subscribing then you can write um to that person be like hey i don't want to i am not accessing your services anymore please get rid of my um personal details it did not apply to search databases um the right to be forgotten extended the right to erasure and brought it on to the digital platform so um the right to be forgotten of course throws some really interesting question it also points to a very important philosophical divide where we believe that any information once it's out on the internet should not be taken back it should not be tampered because it is out there in the public if it is lawful and if it is correct what right to be forgotten does it it creates the mechanism to take lawful correct truthful information and and judge it on the basis of relevance which of course is driven by context so you can see how this right could be open to misinterpretation or really broad interpretation and that is exactly what is happening so the european court of justice their rulings apply to all european countries and we and of course following the spain ruling there was a trickle effect and countries started you know folding and looking at how they could bring this right for their citizens so um there are very many interesting examples here so this slide looks only i'm sorry for the font being so tiny but i wanted to kind of give a broad view of how fragmented this one right is within the same jurisdiction that already has a common and somewhat strong sense of privacy and data protection so in spain um because they said relevance and truthfulness were um the spain decision made a distinction that uh with passage of time newsworthiness diminishes which was the case because the order was no longer relevant netherlands um the court of appeal a similar case came up and they said that uh if if it's a criminal the criminal records continue to be relevant no matter how much time has elapsed so the time factor then contributes to relevance and um as we know with any kind of right you know how it is interpreted by the courts and how they define the parameters around it is how it develops um in spain again initially the 2014 judgment only applied to search engines in 2018 recently they have expanded it to apply to search databases of newspapers and archives so already the scope of the right has spilled over from search engines and gotten um infused with newspapers and archives um in italy interestingly um there is a conflict where data protection is being used um in a way to stifle journalism and i'll draw out more of this tension later in the talk but the italy example is specifically interesting because um you're the journal the journalists usually have an exception because there's a public interest uh you know for news being available in the public records and most laws that you know give you control over information always have this balancing test in so in the uk for example if somebody goes and says there's a right to be forgotten and i want to exercise this right the journalists can claim this public interest exception and say no this is in the public interest and we'll continue publishing in italy the data protection regime does not apply this exception does not apply to journalists and therefore the um one of the many consequences of the legislation and how the national uh framework is that the right to be forgotten has trumped so the right to privacy has trumped the right um for maintaining public records in that jurisdiction again germany dealt with the question of so a lot of criminal and the courts have been really clear that you know if if you have been convicted or if there is some sort of a criminal record then there makes it makes sense for the public to have knowledge of that record because it somehow relates to your reputation and how they should be judging you um equally again um what it did establish was that if there is negative publicity now so the negative publicity was also something that costeja argued for he said that you know the links appearing were hampering his reputation so italy against so basically the way the right to be forgotten has been interpreted in italy it essentially translates into that any kind of inconvenient um uh review or if you write like something about me that i do not like and even if it is published in newspaper i can exercise a right to be forgotten and have it removed from the public records so you can imagine in the hands of the powerful and the mighty what a great censorship tool this will be um in france um so another interesting thing that the spain judgment did add a couple of other jurisdictions have also replicated is specify exclusion protocols that the search engines can use in order to um specify to uh so the publishers should use these exclusion protocols to specify to search engines certain content that they um that should be excluded from their um automatic indexes um and this is something so in the italy decision this was argued in the in the court and the judges completely ignored that you know this technical possibility to exclude this search result existed they went with the broader application of the right um so if you notice i haven't included uk over here uh even though it's part of the eu because i have have two very interesting examples that i wanted to bring up the first is a man convicted of benefits fraud in 2012 for he forged documents as proof of his innocence went to google and said that uh there are some 300 URLs please delete and you and google complied with that request and deleted 293 URLs this guy was so confident he went again and saw delisting of other links which were for a conviction for forgery which is when somebody in google realized that oh hey wait a minute if he's been convicted for forgery maybe the identity documents he showed for the first request could also be forged and that is exactly what it was and then they revoked it so um but the why this is extremely interesting is because even though we trust google which is a huge private platform to create to create a verification mechanism which would be strong and they would have resources to you know um dedicate to this process had this guy not been greedy and had he not sought delisting of those other links they would have never found the first um you know the error so the scope for fraud and the scope for dishonest of application of the right to be forgotten is very very high even in a country like u k so we can imagine if this right is exported to india what is going to happen um the other thing that i wanted to point out is that we have two kinds of three kinds of broad action happening on right to be forgotten so you have the european courts so you have the courts that are interpreting and reading this right into existing data protection uh frameworks that are there at the national level you have some countries that are actually introducing or revising their existing laws and actually incorporating a clause that says hey there is a right to be forgotten so if india comes up with a right to privacy the chances are pretty high that it would include a mention of the right to be forgotten if not a specific right um and um but the third more interesting way this right has been negotiated is through data protection authorities who themselves are interpreting this right and giving it to citizens so um so basically because e u has already interpreted this right exists the data protection authorities are then now finding google for saying um so in the u k a former bank clerk imprisoned for stealing money from elderly people's bank accounts was convicted and um the data protection authority actually ordered google because remember we're saying that you know convicted criminals their record should be there for the public record but here the data protection authority went above that standard and actually ordered google to remove um actual conviction records so what is happening in north america this has spread of course it has not been limited to uh european union the us does not have a law um there is a bill which is currently being reviewed it's in the second stage it's the second time it is being reviewed um california has something called the eraser law for minors which seals juvenile uh internet records so these are both legislative things the courts haven't really dealt with the right to be forgotten in the us yet um in canada which is a very interesting example the courts have um the data protection authority like i mentioned has uh come up with a report based on a consultation independently so proactively the data protection authority there is seeking to bring in a right to be forgotten for canada and they have recommended that instead of uh de-indexing a link lower the rank of a search result so essentially make it really hard to find um they also have the canadian courts have also specified that search engines to use geo identifying technologies to actively block canadians from accessing the right to be forgotten link anywhere so in canada um the supreme court and more recently uh supreme court uh had a ruling in uh equistec where it upheld that uh google should remove search results for specific websites not just in canada but everywhere in the world so the equistec case has got nothing to do with the right to be forgotten but it's got to do with how google implements its um removal of search results and it's the extra territorial application of one kind of uh content removal practice from google that google practices and um this would have implications for how right to be forgotten is also interpreted so if canada may now building on this decision be like hey not only geo identifying technologies should block canadians accessing that uh link within canada but make sure that you know this link is not available to any canadian anywhere in the world which essentially means that you know you're speaking laws of other countries in mexico again the scope of right to be forgotten has been broadened um there's there's actually a legislative effort in mexico on to introduce this right um so again we see the trend that a lot of powerful people are using this right you know to negotiate their reputations online and of course when you create a tool like that uh who would have the time the scope the resources and the energy to you know utilize it to the maximum and we need to think about these things when we consider certain rights so this transportation mogul um the deep i mean he's um he had um so it's got nothing to do with the right to be forgotten but the data protection authority again went uh well above and beyond the scope that has been defined for this right and suggested and has initiated sanctions against the search engine uh for uh cancellation and opposition to the processing of data so you know again there's this conflict is the search engine a controller and therefore has to comply with uh right to be forgotten request and then what mexico is doing not only a controller even if you're processing you have to still comply with right to be forgotten request and that is really tricky because um we'll come to that later anyway so um the courts another dpa order to remove links in another case the courts in mexico have another dpa order so the dpa is really overreaching in mexico um and in another case the court has ruled that when persistence causes injury right to be forgotten request must be honored but injury can be open to interpretation so in asia what's happening indonesia is the first country in asia to actually adopt a right to be forgotten law by revising the electronic information and transactions law um japan the courts have rejected the existence of such a right they said that it can only be allowed when the value of privacy protection clearly outweighs that of information disclosure china too has uh concluded that there is no right to be forgotten under chinese laws um it also specifically noted in the case that autocomplete search words did not infringe the complainants right to his name and reputation so if i type is jyotipa day of ad if it says a fool then you know i can't be like hey google your following me a fool so i mean um it's really interesting how and in which context these cases are also coming up in each of these jurisdictions so you know we have criminal cases we have the powerful who want to protect their reputation we have people who are you know involved in crimes but were not convicted we have people who want to transform public records in south america um i've taken only three countries but there are a lot of countries actually in south america that are contemplating or actually trying to work with this right the highest constitutional court refused to recognize right to be forgotten um and held that newspaper was not required to remove the report about a woman involved in crime but not convicted as i just gave that example um they did however uh say that you know the the publication would have to use exclusion protocols uh in pedu the court has ordered newspapers to remove links to uh investigations on drug trafficking carried out by law enforcement and um so there is another tension that is developing where uh public officials and um people who hold important offices certain information about them should be in the public because you know they hold these important offices uh the right to be forgotten is also negotiating those standards uh or actually watering them down uh in brazil uh so brazil there is a case pending uh where in 1958 a woman was murdered and in 2004 this major tv uh network made a show on it and then the one of the relatives of that woman has gone and student said that there is a right to be forgotten you need to remove that um the supreme court just tell that search engines need to scrub new stories about fraud involving appointment of woman as state judge the woman basically became a judge through fraudulent means and this was reported and now the courts have in brazil ruled that you know this judge has the right to be forgotten though as a public official she probably shouldn't so right to be forgotten india what's happening here um and shashan's in her in his next talk is going to be um covering a lot of this but uh as alok mentioned we have uh the shrikrishna committee uh report and it'll be exciting to see what um what they actually say because in the white paper they mention a sector they have mentioned right to be forgotten and they mentioned taking a sectoral approach to it which balances free speech and privacy it's a very broad uh um mention of the right they haven't gotten into any specific but they probably mention it because in the privacy judgment that came out put us mommy versus union of india injustice in one of the judges justice calls opinion he specifically he mentions he references this right in in a couple of really interesting context so he talks about um uh young children and how they may um have put up certain information that they regret and they should have the right to take that back um but we do already have existing laws you know for protection of identity we have juvenile so india i mean what i'm trying to basically explain is that the justification of this right in that opinion um could be handled through other existing legislation which could be tweaked or improved there is no um the rational use in the opinion to me don't suggest that we need to introduce a new right but we'll see how the shrikrishna committee has interpreted that um in there are three cases that have happened the karnataka high court of personal details of a woman from a published judgment so divorce cases um and then you want to get remarried and you google somebody's name and a reference to their divorce proceedings comments and it's a it's a sort of an embarrassment so families approaching uh women's identity that's a really important issue in india and usually courts are very favorable towards uh handing out judgments that seem uh you know um privacy conscious about women um in gujrat um so the gujrat the case was interesting uh the person who approached the court said this uh judgment was classified as unreportable and yet one of these online legal forums has gone and published it the judge made the distinction that you know uh publishing on an online legal database is not reporting i mean but we see you know how these fine lines and the nuances and the rationales being used to uh negotiate this right in each jurisdiction is very dependent on the context the existing laws and how the courts actually favor uh privacy versus public interest so um google has of course been um at the eye of the storm and it recently published a report of the implementation of right to be forgotten requests within europe um so 2.4 million URLs have been requested by europeans and it has complied with 43 percent of these combined france germany uk generated 50.6 percent uh and the usage of right to be forgotten varies estonia filing most requests per capita and grease the least so again within europe it's a very uneven score um we have um so who is making these requests the total volume of previously unseen requesters has declined year or year on year um the vast majority of requests are coming from individuals especially individuals with a sizable online presence uh make heavy use of right to be forgotten minors which is what was driving justice calls opinion are only five percent of these total number of requests and um this is despite google haven't taken extra efforts to create um a knowledge about the existence of this right within um that age group in EU because they are mandated to do so so what categories of information are people wanting to get removed there are two broad categories personal information in web directories or social media histories and legal history included in news archive and government directory pages uh the request categories are also pretty and this is not an exhaustive list uh google if i would urge you to go through that report it is pretty interesting and actually uses some brings highlights some really good examples from the countries um personal information uh in web directories is um something that has so we can see why people are requesting um legal history and news so one broad trend that has developed over the past four years of implementing this right that the personal information that are in directories and social media those requests have gone down but the legal history in government pages and news articles have gone up one of and google suggests that one of the many reasons for this could be that uh you have more control and platforms have stepped up the privacy practices or are giving user more control about the data that they have published on these platforms so as these practices evolve in the overall digital economy uh the requests are falling so we it's a it's an interesting insight that you know how the overall economy and how we shape it uh would then reduce the need for such a right to exist in the first place um in terms of news article and government pages you don't really have any control over them once it's published it's their policy so those have those kind of requests have remained pretty stable um this is the categories of sites again the social media content is 13.9 directory and information are 18.8 news sources government pages and break up so what are some of the trends news related content increase social media generally declined broad variations and categories of sites requested in different countries in spain um 10.6 percent of requested urls targeting government records missed them from law that mandates government to publish public notifications um the spanish law to inform missing individuals about a government decision that directly affects them and publish government decisions to absolve an individual from a criminal sentence so if you've been convicted and your sentence has been changed then um that needs to be published so when there are these laws that require your personal details to be part of the public pages these kind of in those jurisdictions these requests are pretty high the requesters in france and germany were most concerned with information exposed on social media um these variations actually illustrate the challenge of a one-size-fits-all privacy policy and this is what we're saying you know when we say the europe's right to be forgotten um it's very important that we distinguish that there is and it should be the case or the right to be forgotten in india is looks and takes into context the different legislation the way of working our bureaucracy the chances of fraud um so yeah if we have if we import the european right into india we are going to create trouble for ourselves what does the right to be forgotten actually do it creates great power for private corporations um it is basically giving them the right to adjudicate on your fundamental right it makes them the arbitrator of um relevance and legitimacy also interestingly um it the spanish decision created the obligations for a specific class of intermediaries but now um again like i said the scope of this right has been broadened so much that um hosts and hosts are different from certain engines so twitter and facebook would be host and google is a search engine countries are pushing to have hosts also comply with right to be forgotten request um when private corporations are deciding and um thinking about your fundamental rights they're doing this behind closed doors you have no control over it the opaque decision making and compliance will become de facto rules because if google with its scale and resources sets the sets the standards of the procedure or how they will go about it it's very likely that other smaller companies will adopt it um also interesting challenges that come up what happens to information that needs to be retained for auditing and technical reasons um and the territorial scope of right to be forgotten so canada for example use of technologies to block link um also really interesting um which i haven't mentioned here is that even the scope beyond personal information so they're targeting newspapers and archives and not just search engines it's gone beyond that but if the definition of databases expands and if it is really broad then technically any hosting function or any platform that has a search database i mean if you don't define the parameters of the right and it is only specific to databases i mean imagine the implications for any kind of service that has a search function built into it how does right to be forgotten impact press freedom historical integrity in many many ways basically what it is doing is um it is putting the right to know in direct conflict with the right to privacy it also impacts severely what should be considered public record do we have the right to go back and revise public records um it's creating a right to alter lawful information that is already in the public domain um it also impacts historical integrity of published work archives and history the duty to preserve information for public interest data about a private individual could become relevant in the future so you know today uh kusteja might not be important but after this right to be forgotten ruling my god his name is so important i mean you know um so relevance and context changes and then we take information off it's no no longer part of the public um the public no longer has access to it and that is a form of severe censorship uh maintaining public domain requires transparency and accountability and when we are outsourcing all of this work the private corporation um it would be very difficult to maintain those levels of transparency and accountability it is also important to distinguish between certain parts of the process so are we asking the publication to delete the original newspaper article or report or are we say or are we just asking the search engine to uh delist or are we asking the search engine to just rank the search results lower depending on which course we adopt the right will develop its own framework and evolve in a different manner so two interesting um examples i'll drop up in five minutes so identities have fraud and exclusion after anand's talk and all of the adhaar conversations we've been having identities have fraud and exclusion have been bearing really heavy on my mind and given the uk examples it would be interesting and in fact very very scary to think about such a right being brought in india without having really strong measures for the verification of identity so google has to prevent fraudulent removal request the key criteria that google um sets up is the verification of identity now imagine as right to be forgotten expands and if in india it's a really broad right that applies to all sorts of search databases then how would we ensure that every database that has to comply with the right to be forgotten request is actually following a very um strong and robust verification procedure uh fail-proof mechanism for verification can lead to dishonest application as we saw in the uk example um it also creates scope and incentive for individuals to game the system so tomorrow you know i could commit fraud and then use a right to be forgotten first steal somebody's identity commit a fraud and then use the right to be forgotten to remove like all traces of that identity um so it would be very very important to limit this right to a very very specific narrow context even if you are thinking about bringing this into india um the threat of malicious actors firing deletion so what is happening so in italy and in europe you see a lot of cases of bad reviews of restaurants of people who are in the service industry i imagine when you give a tool for someone to um you know um i could genuinely have a bad customer experience but the corporation could you know then just seeker or that individual could just seek a right to be forgotten request and remove those bad reviews so in this economy when part of it relies on our attention and part of it relies on the reputation we built um how would the right to be forgotten impact both of those things um beyond search engines right to be forgotten applies to all types of databases that index personal details would definitely hinder innovation lack of records of transactions in some cases may prevent coordination for example if you have bloodbanks and they want to coordinate and you don't want your name then it becomes really difficult for you to um without those personal details coordination in public emergencies becomes harder uh financial data fraud um again removing traces of forgery or convicted people for people who've been convicted for financial fraud would also create challenges deletion of data may leave individuals vulnerable as exclusion from database would lead to access to services in the life but tomorrow if um and i'm hypothetically thinking this uh say if india has a right to be forgotten and but the caveat is that if you exercise your right to be forgotten you cannot use this service because our service relies on building on your personal details to personalize that service for you i mean in which case the right to be forgotten would effectively be a moot uh right so tailoring the right for a very specific context for a very specific class of intermediary i mean ideally this right should not exist but we don't know how that will go another interesting example and there are and i'm actually thinking through this and we lovely to hear people in this audience who have more knowledge about blockchain i have i have no idea about blockchain and how it works and i tried reading it it seemed very very complicated but thinking about the idea that you can take back your identity or you can revise um records uh what happens to technologies like blockchain that rely on decentralized and distributed um sorry which rely on permanent tamper proof records so um who would be the authority that would grant the right to be forgotten request in this context who would be the um the authority or the uh stakeholder in that chain who would actually implement it who would authenticate and verify how would that uh impact technologies like blockchain transactions cannot be deleted and tampered with the way technology has been designed in blockchain uh the right to be forgotten doesn't sit with sit well with this and it would stymie the development of such solutions uh would it also reduce transparency of transactions um and what is happening is a couple of people in europe have been exploring options so storing personal data off chain and storing transactional data on chain and i don't understand how that works but again from my research on this what it seems to suggest is that it actually increases the complexity of the system and therefore makes the entire technology open to attack vectors and makes it a more vulnerable so what's the way forward resist adoption of right to be forgotten explore other measures to control personal data bring through procedural fairness so frameworks like notice and takedown and if we could create more um individual specific rights around the framework that also copyright law in india has a notice and takedown um the it act has a notice and takedown scheme if how could could we think about strengthening these existing frameworks um clear procedures in event of non-compliance including going to courts so um dps should not be able to step in and order google the court should be adjudicating and again there is a tension there because if ultimately the data protection authority is the regulator for your personal data then um what kind of rights are we granting the regulator with respect to right to be forgotten which also relates to content uh clarify the role of the data protection authority creating exceptions for companies that host so interestingly the european has a new privacy framework the european um global data protection regulation gdpr um and there it includes a right to be forgotten but again none of the privacy experts can actually uh clarify if the right to be forgotten is limited to search engines or if it also applies to host the twitter and facebook and when europe where the right actually evolved is not clear on this issue when we are importing it into india without our privacy strong frameworks um it's going to be problem so i'm leaving you with some questions and these are questions for me and for you to think over does the right to be forgotten solve the issue of countering untruthful or outdated information online does the right negatively impact press freedom archives public right to know how can we ensure right to be forgotten does not become a tool for censorship what are the unintended consequences on fraud and identity theft is right to be forgotten encourage your reputation economy so in uk france in germany um there is interestingly a bunch of corporations that are filing these requests because we have online celebrities that don't have the time to do it so they are outsourcing it to these companies and then economy has developed around right to be forgotten so will we create a similar economy if we expose this right here and how does the right to be forgotten impact development of new technology like blockchain or immutable or other immutable led thank you thanks a lot joti uh since we are running a little bit short on time so what i suggest is that we take all questions after sushant's so that the audience has a full perspective of the issue so all questions for joti and sushant we will take at the end of sushant session as well since that'll be just before lunch and we can go a little bit over time also as required so please see me have another round of applause for joti as well now i'd invite sushant to speak about the specific issues that he faces as the founder of indian kanun just in introducing sushant sinha a little bit he is the founder of of course indian kanun as i mentioned he obtained his phd from the university of michigan where he his phd was on internet security specifically the title was content aware internet security he has worked previously at yahoo and he'll be talking about some of the issues that that he faces while running indian kanun given that he has put up a whole range of public of data publicly available and what are some of the issues related to privacy that come up in the context of this particular issue so specifically in the context of court related data and judicial data that's what sushant will be talking about so i'll leave it to sushant to take this forward am i audible i am sushant sinha the founder of indian kanun it is our website which provides a search engine for indian law indian kanun.org today i'll be talking about right to privacy versus right to know the challenges and the way forward so let us start with when this problem started this was started in 2008 with just 50 000 supreme court judgment and 1000 central laws what i did was i aggregated all the laws and the court judgments into a common platform the courts were publishing their judgments i took them out through a crawler and made them searchable on a common platform the the search engine as it is it provided ordered results by a manager of relevance which was more relevant to the legal audience the all the documents were joined together so what if you're reading a court judgment and it cites a different court judgment you can click it and read it so it improved the document discovery as well so what alok was pertaining earlier was earlier all these documents were inside paywalls or inside court websites which are very their own silos and you can't really go and read them search them do anything much with that what indian kanun was that simplified all that process by 2011 we had almost half a million users per month so as with the website the court judgments came on the website they were searchable on the google as well they also landed the legal legal judgments were publicly even earlier as per the law but were inaccessible to general people landing on indian kanun also made the availability on google improved and as as such was searchable on google as well over time the website ranking on these search engines improved and when you search someone's name you would find the indian kanun court a court judgment related to them on indian kanun that so as what happens is i started getting a lot of emails that some court judgments need to be removed from individuals initially most of these court judgments were related to matrimonial cases and somewhere request some threats some talked about my long term karma here are two examples i will give you which will kind of present this in a more nuanced way so here are two requests the first request is pertaining to a person who had an involvement with a girl and apparently didn't work out and court cases were lodged against him and later he was here his family was in a problem as he was not he they were not able to get a marriage for a sister done the second thing is also kind of interesting because this person had a court judgment which talked about his behavior psychological disorder and then the companies some companies actually set them on record that we are not going to have with that issue so if you think about the two mean these cases are more much much more nuanced in saying that something should be available to everyone to deal with the requests some policy changes were done on indian karmu in the beginning more what we received was mostly complaints regarding people ability to remarry up because the decisions regarding their past marriages were all available on who on indian karmu so as a policy response he started blocking matrimonial cases from generic search engine we were not removing any court judgment but only blocking from the google yahoo using the robots.txt and that is started doing in 2012 the second kind of request that he started seeing was mostly related to job loss that i'm not able to get hired because there's a court judgment related to me in a which talks doesn't doesn't talk favorably about me also the second was since a lot of people are actually business man in india the one problem that was they were facing was there was a check bounce case against me and no one is ready to do a business with me now for indian karmu no policy changes were done to accommodate these two kind of requests of course when you say that you will not comply with the request they file cases so things move on to the courts first case was in fact filed in 2009 and i was not there i was not in india even and the court disposed of saying that we can't do much with the inter country request so we are not going to take anything alha andra high court dropped that request totally many court many court cases were filed for name removal some asked for complete removal of court judgment some asked for name anonymization and some were flatly denied like the gujarat high court i'll come to that the primary argument in these court cases were that they were people were facing hardship in their life due to recording of what happened in past in these court judgments and the as jyothi talked about also the gujarat high court judgment it was an interesting case so someone filed a court judgment citing a very specific high court rule saying that the non-reportable judgments cannot be taken by a third party without an explicit approval by the court it was a kind of a very interesting because it challenged the court based on its own rules surprisingly the mean i didn't have any representation from my side so it was an expati case the gujarat high court dismissed the request totally and said that there is nothing in the request to show that this we can do anything about it our principle-based approach was filed in delhi high court it was based on right to be forgotten and it was removal of a particular name in a court judgment it was based on the the principle was based on previous decision of the eu court but the eu courts did that with privacy law and here we don't have a privacy law so this is more of a problem to grant such request on a principle like this so of course the question that comes to your mind is is let's suppose supreme court gave a court judgment on oh you guys have a right to privacy so why is it not a violation of privacy to report everyone details in such detailed manner it is also important to note that supreme court uh the judgment was important in that sense but supreme court has been has held right to privacy as a fundamental right for last four decades starting from enka gandhi versus union of india 1978 supreme court has repeatedly upheld individual right to individuals turning your personal choices and the fact that they can control dissemination of information over their personal information but as with all right the right is not a absolute and it's set to one set of restrictions so what are these exceptions to the right to privacy the r rajgopal case which was a 1994 case supreme court laid out the contours to this right to privacy while dealing with the biography of a convicted criminal so an author was trying to write a book on a convicted criminal r rajgopal and r rajgopal challenged that on the basis of r right to privacy and supreme court gave a detailed uh court judgment detailing what the right to privacy is it held that the right to privacy ceases to exist in matters of public record including court records so if you write a book based on public record uh that is based on court judgments no one can do much about it the second part to the right to privacy is that people also have a right to know on what facts and argument court came to a decision you know that you need you need that because appellate court decisions are also precedents in future cases so if you want to argue your case you have to produce precedents saying at least you did the same thing in this case the facts and circumstances are similar so grant me a same same relief and lastly court or public repository of facts i think i'll come to later which is something that inhibits this right to privacy let's come to the right to know supreme court has placed the right to know as part of freedom of speech in a in a court judgment 1991 sp gupta versus union of india it says the people and a court the people of this country have a right to know every public act everything that is done in a public way by their public functionaries they're entitled to know the particulars of every public transaction in all its bearing the right to know which is derived from the concept of freedom of speech though not absolute is a factor which should make one value and secrecy is claimed for transactions which can at any rate have no repercussions on public security so as you can see the sp gupta case supreme court laid out a pretty broad framework for right to know that every public functionary and of course courts are also public functionaries you have a right to know how they do things how do they arrive to decisions court judgments being an integral part of it you can't do much about that so people have a fundamental right to know about public right to information act is a enabling statute to give that right to a meaningful one stress with respect to the uh public authorities here are some instances i will talk about where right to know in court judgments has held uh do things a person was arrested by kerala police after one man googled his details and read a court judgment on indian kanwan about him so apparently this guy was trying to fraud this person and he googled and he found this court judgment detailing what he has done in past he alerted the police and he was arrested so individual identity of people are important it is not that they are not recently a journalist figured out about lok nithi foundation and how they were abusing ps to for further government agenda in supreme court so even when the petitioners are groups it is also important to uncover them many times and of course the use of precedent is one thing you can see that this is actually happened much earlier when muhammad adil hussein who took amu to alabad high court when he was so illegally suspended by them he self represented himself in the court used indian kanwan and then succeeded so precedent value is available to every citizen when someone uh something illegal happens to you you can get your own right and go to court you can self represent yourself so you the availability of court judgments are also important now after establishing people's right to privacy and right to know let us see how they cross each other so once a personal dispute a commercial dispute reaches the court it becomes a matter of public record and are people entitled to know all such dispute or entire details of such dispute just for the sake of transparency of the courts to me what it appears to me is that this blanket exceptions to such public record is too broad we need to balance citizens right with people's right to know and court should also respect people privacy to some extent now let's talk about channel the different challenges to modifying court order the court judgments are repository of facts they actually record acquittals conviction in particular offenses they also contain detail about whether if uh in income tax authority it levied you with certain things and uh led levied you with fines and whether you contested and won that case or not uh there are many times when you are actually um through the government through the state government through the union government for denying of your own rights and these are important uh repository of facts one thing you would also notice is that many of the things actually court does things like partition decree succession decree are important i mean this thing actually i realized after uh my father one wanted to sell his ancestral home and the person who was taking a loan to buy that home actually had to show the the partition decree that will happen in that in that essential home and there was a will preceding it and whether that will was provided to by the court or not so these documents are not just document for things but also for the when we are talking about 40 years back these documents are also important for as a matter of the for your father for justifying your property rights for justifying a lot of things even personal matters like divorce decree adoption deed maintenance amount all this goes in court and these are actually uh enforceable so um which so one thing i want to make clear is when one of the things that people hate me for is that i just don't change judgments and that is primary the reason that we cannot just go and remove people's name from it because it has a lot of importance in it the second set of challenges what documents are remembered by party names when you say as s p gupta versus union of india you know which case you're talking about you when you talk about like kedarnath was a state of br you know which case you're talking about when you talk about k7 and the bharti versus union of india you know which case of remember if all the names are changed by something like a b c versus x y z oh in that a b c dash some one two three who will understand which court judgment you are talking about so that is an important way in common law systems where actually court judgments are themselves are law and many cases of rich history behind it for example if you you would have heard the court judgment in which kedarnath versus state of br which challenge the sedition law whether sedition law is constitutional or not and it appeals a sedition law in a very in a very narrow way but if you go back and read about kedarnath there is a lot of rich history behind the behind that case and similarly the one of the most important judgments that has come out from supreme court is the basic structure doctrine which is the k7 and the bharti versus union of india if you go and read about the mud the mud which k7 and the bharti was actually operating and why this eminent domain was questioned then that nithya i mean it is an important history that everyone needs to know which which case today becomes a matter of historical importance tomorrow you never know and any mangling of the names will impact historians and even ordinary citizens to understand how these cases came to court so let us talk about removal of sensitive information since court records are public records we let us do some proactive removal let us not do a this is something i will advocate is what the courts need to do the proactive removal is important because data once released is almost very difficult to erase from the internet there is also a need to develop a principle of harm to exclude victim names this currently applies only to such a protection is only applied to rate victims and that too only in the trial court judgment so if there is an appeal happens in the so in the high court or supreme court and somehow this name comes into their court judgment they are not protected then the identity is no longer protected so this should be expanded to cover victims of molestation sexual harassment and in all court documents not just the trial court one thing that a lot of people talk about is let us abbreviate the name let us talk about partial name anonymization in certain civil cases like divorce child custody child adoption while i think those are fine approaches but as i said it will impact court documents as repository of fact once you have removed the names no one knows what case you are talking about you cannot use that judgment to say see this is how i adopted my kid and let us suppose someone challenges the adoption or something like that and all once you remove names like that it will also impact how people have to cite a certain judgment one more thing i would like to mention is there are many things that i've heard is about minors being harassed in schools because their name appeared in court judgment and this is kind of sensitive because the people who are studying in class 10 11 12 since their name appeared in a court judgment related to some let's suppose divorce of their parents and this the children are not that matured they go and harass the younger kids and they are being they are under more pressure and not that psychologically developed to deal with these pressures so one thing that we need to adopt is that we know there is no need to publish names of minors in any court document you should always mention nth son on daughter of whatever the names of the parents are but at least the minor should be completely avoided there is also an issue with regulating with court judgment writing and i have proposed that we need to regulate these court judgments the way these court judgments are written one thing we need is the court court should put its case finding conviction and acquittal at the top of judgment so this is because many times people say i was acquitted court acquitted me but who will read the judgment but if court records records its own findings at the top then at least people don't have to face that problem there is also a problem with the judges of copying verbatim facts from the pleadings into the court judgment which is i think a mistake because then you are copying a lot of allegations of one another which people make in cases just because you don't like the other party so only fact which are relevant to the judge to arrive at a decision should be placed in public domain also people should stop assuming that going to court means that its decision will be private once you go to the court based on what the laws are it will be published and it will be publicly available so don't think that oh this is my private decision this is a judgment this is my judgment why are you putting it or without my permission it is a court decision it's not your decision now let me come to this redaction after publication so right to be forgotten i think it's a bad principle as it fits a private party with a private party it says the like google has to determine whether the content is actually valid or it is important or not and how is the private party going to decide on what principles in fact google doesn't do any manual review of urls to be blocked whether it actually confirms to this to the european union finding they don't do that i have myself made many people have claimed that a particular indian kanun link shouldn't be displayed and they actually delisted from the european union websites and and i have complained many times and there is no review there is no manual review by google and it makes sense to them why would google invest time and money to examine content whether something is relevant or not u court has said removed it they just remove it and it's the significant harm to the people's right to know which jyoti talked a lot about more about what can happen with this reputation management and people trying to remove urls and so the right to be forgotten should not be based on this private party versus private party people approaching google people approaching union kanun it should be the by a law the law needs to draw bright lines and quotes in their administrative capacity need to decide these issues rather than leaving it to each individual publisher it should be based on the principle of harm that is demonstrable or to correct an error that caused certain facts to be leaked in judgment that's all so the i don't think the redaction should do more than that which comes to the last slide which is privacy in public data sets so i think still the court judgments are still the tip of the iceberg because it is the first public data set that is available in a electronic format now more data sets are coming up one of this is this first information reports which are the firs these are also public data sets as per the supreme court order so what is a fir fir is people lodge complaints against known or unknown persons with full details about the crime in police station so i mean you some of these things may be true some may be concocted so basically these are completely unverified allegations at least in court judgment there is a judicial mind applied to come to what is correct what is not correct firs there is no judicial mind it is one person claim against the other or unknown it is in fact being published in delhi and many other cities on a daily basis so i mean another data set is that almost entire government as the sp gupta case showed almost entire functioning of government is public so things like that in future when tenders project land details land use land conversion all of this becomes public in a searchable digitizable format i wonder what happens when a search engine is built on top of that and people can search it very effectively thank you and thank you sushant joti if you could just join us we will have a round of questions you can direct your question whatever questions you had to joti as well i think we can have both the speakers take the questions so since a lot of the topics covered were common you can just mention whether you want joti to specifically answer or sushant or if you'd like either of the speakers to offer their views you can go ahead and answer so if we can have any questions yeah ma'am yeah so i think char you had the question or oh sorry yeah the question sorry yeah so this is the question how could you just also introduce yourself i'm Venkatesh and i have a question regarding the sort of like the dividing line and at least in certain cases the way i see it i mean i may not know all the details but at least in certain cases it seems fairly clear cut right if it is a personal if it's a personal civil case it's not a case brought by the state it's not in public interest right then why why should we not anonymize the names i i i noted the fact that you mentioned that citing it becomes difficult x y z 1 2 3 but i don't think that that many sort of individual personal battles like court cases that are necessarily cited that much a and b even if they are yes in those cases i think anonymity should trump um the the inconvenience or the whatever the awkwardness of the way in which these things get cited so i mean two parts for one is that of course i mean what i think is that courts in its admin there should be a lot defining what needs to go in a court judgment what needs to the parties which parties need to have anonymized not anonymized but court documents as you know is also a repository of facts so until unless you are ready to no no mean someone goes to a court gets a divorce decree it's a matter of a public record so until and unless you are ready to completely forego that that is something you mean the government or the courts have to decide i think to that point i mean but you could also think of it for example you are probably thinking it's a divorce of these personal ones well what if it's a private party involved in corruption cases and then what if these private parties also are managing public projects other it's not so again like shishan said we should look at whether it is the individual bringing the claim forward the groups bringing the claim forward or whether it's a corporation bringing the claim forward i get your point that you know there is a need for anonymity but in many cases anyway the searching function is really really hard like as a non-lawyer you you really can't like look for legal documents and then these names or these references become a really easy marker for you to be able to find information that aside you know in some cases the fact that this action has taken place should be there i mean there should be a very high threshold of when we remove records that are part of the public or at least there needs to be a very very strong legitimate justification as to why the need for protection of anonymity in that certain context so maybe and we do have laws like that you know where shishan will probably tell you more about it so there i mean there is a specific law in IPC which talks about not disclosing your marriage in your second marriage and it's a crime not disclosing a first marriage that is different right that is so i'm saying if you do not disclose your first marriage while dealing with the second marriage it's a crime perfect we have no that doesn't necessarily mean that we should necessarily display the full details of the first marriage that is something i mean i don't have so we shouldn't be conflating those things right we're talking about publicly displaying personal battles we work businesses corporations corruption everything just personal battles yeah but why should we find out the names of the i mean let's suppose i want to find out whether it was the first marriage of whether my first why no if you're you should disclose you have right as you pointed out that is necessary to do but it should be displayed should be necessarily how would somebody find that data if it isn't displayed so say i get married and this person has come into my life and they've already been married before um and they don't reveal to me that they've been married what is the avenue for me to get gather that data through using my own resources if these uh if the record of his first uh divorce is not there in the i'm sure you can sorry can we just move on to others as well because we can continue the discussion because there's a bof after this so we'll just move on can we just hi you mentioned about how blockchain is complex i think is that the mic please hi you spoke about how blockchain is complex understand i think it's the other way around it's far more easier to understand and reason about than the policy and the laws and the legal framework so let's say you want to build a blockchain solution which could be which could be solution to many of these a privacy problem just like what's happening is there's nothing we can do about fighting fake news because sent to end and so it kind of solves many of these problems because you can just raise your hands thing it's a public fact of public record it was nothing i can do about that it's just over and i can't help you so if you were to even build a solution of that i think how could you interpret many of these existing things in intervening courts and then changing policies changing laws and then the super court powers that exist in the system so first of two which is also a question to you which is would you consider building such kind of a blockchain solution wherein you can say that there's nothing i can do because it's a fact of the card and it will increase you know i can't intervene in the system because it will modify and the system exists like this so that's a really great premise for the question but actually i would beg to differ from your opinion that if you know if whatsapp problems had been fixed by the fact that it was using end-to-end encryption technology all of the legal notices and the government orders and the media flag that it has been getting for the lynchings and fake news would not be happening and there will be efforts to you know over-regulate that platform because we have a very censorious attitude towards speech in this country but having said that i'm no expert on blockchain what i found really interesting while prepping for this conference was that we are introducing rights and we're introducing rights in with respect to certain focus and a vision that we have about empowering citizens but at the same parallel you know in parallel to this legislative framework of empowering individuals there are technologies that are doing the exact same thing and what i found really odd was that you know the implications of the rights being developed to empower citizens and their implications on the technologies that are being developed to empower citizens are not necessarily speaking to each other and i i see no way for me to be able to answer your question till i see more you know integrated thinking and coming together of people from both those sectors to actually think through these issues because that is exactly what is happening a lot of decision-making is happening in silos where government officials are like oh my god lynching is happening let's regulate them but like are we bringing everybody in the same room and being like if we do introduce this what would be the repercussion one point to add we don't want to be in a situation where tomorrow the government says we're not going to allow any end-to-end secure encrypted messaging service because lynching is happening right i mean that's the worst possible outcome of any discussion so i think the point which if i understood jyothi correctly was to say keep in mind that these two cannot happen on parallel tracks they will have to in today today at this moment think of how this developing this technology is being developed without understanding that a parallel conversation is happening of how about this right to be forgotten that may suddenly crash into this and we may end up with the worst possible solution i think that's the sort of just the point that i want i mean broadly government has far more powers than technology right i mean it can just shut down an entire city no internet today in in rajasthan so i mean government is way more powerful bitcoin is a distributed ledger for currency management if they shut down bitcoin you can't do any transaction to bitcoin so i mean don't underestimate the power of government as i said he can they can just say drop encryption totally from all devices so i mean we want a more nuanced and people meeting up rather than extreme steps yeah sorry i'm i'm here uh this hello can you hear me yeah so i do understand that you know the right to know or the right to your public information should be available and should be people should be able to mine it but don't you think there should be some legislature you know some kind of laws against how that data can be used right because for example yesterday there was a talk on getting the electricity bills of people using that to segment people and then use it for propagation and that propagation might be really harmful and may even result in riots i love that example personally um that was an eye-opening example for me um you're right we need uh really strong protection we need very strong boundaries for who is collecting our data what uh purpose they're collecting it for how they're going to continue using it for how long are they going to keep it but this is exactly what the shri krishna committee has been working on for the past and alok would be actually somebody who would be the best person to speak on this i invite you to give us some views i strictly can't but let me just sort of point out by saying um i think this the committee report will be discussing this point uh they're going to submit it today in the evening as i mentioned earlier and uh this is one of the things that uh it whatever bill they're going to recommend is probably going to be suggest going to address this point um there are probably going to be more discussions about it it's going to be an evolving framework i i will not say today that we'll have one law and the law will address all questions possible um it is going to have to evolve because people are going to respond to the law the law will have to respond to people and vice versa so i would say that let's wait for that report when it comes out today the principles i think is what is going to be important more than what the exact law you're absolutely right we need a law for this but more than the law the principles on how this law is going to be made is very important what are the institutions we're going to create to enforce this law equally important i think we miss out that institutions part in a lot of our discussions so i think it this discussion it's unfortunate that this this thing is happening one day before that report came out but um yeah that's a very valid point and maybe the maybe the a lot of the answers will come out in that uh um there was one at stenoise hi uh so much let's have a larger question um i'll be both uh a radical archivist of sorts opening up information and chatting down information for privacy so the larger question right now is uh transparency is required so is privacy but as you mentioned both of your talks we need a theory of harm to say when should transparency be uh weighed up or when should privacy be weighed up what are larger societal needs versus what are individual's privacy requirements what is what tops what so now the question is most of these discussions have been academic and only because it's affecting individuals right now with the government's mandatory data collections requirements with the digital india program that people are very agitated now the important question is how do we ensure that these debates are discussed more often and the how do we evolve the theory even in the academic circles both from a technical point of view and as well as the legal point of view and in terms of technical points the concept of redactions for example that is not there in the current publications the court record publications for example i mean the i mean we should have more talks more debates around this we are having one today certainly more health but i mean the in the academic should do i mean i think this is a new upcoming area where academics should invest time and effort into delineating this and have been in touch with some of the them talking about what needs to be done and what needs to be done and some of this what you have seen today some of the proposals of what needs to go to the court judgment and what should not is actually from coming from that i'm going to be a bit controversial and actually say that we owe nanda nilikani and the india stack developers a whole lot of thank you for opening up our eyes to the creative wonderful ways that our data can be used and weaponized against us so and i'm also very glad that these conversations are not weren't happening even five or ten years back for a self for one selfish reason and for not so selfish reason selfish reason being i wouldn't be a part of it uh i'm selfish reason that i don't think as a country we were ready to take on these questions i think the fact that adhar happened the fact that we are contemplating and negotiating with ourselves what our digital identity should be what is the government imposing this identity and then you know linking databases and that identity to all manner of services but i think a lot of a lot of what this has done is it's opened up our own eyes of how we want to be seen like you know do we want to give up our data do we not give up our data is is this app collecting our data is our location button on we had a birds of feather session on a i and and you know it was really interesting we started that session by asking people what are the measures they actually take to protect the privacy online and i was so amazed to hear about the variety of answers so i'm glad that this conversation is happening now i agree with you that we need to move a little from the academic discussions and bring a more practical approaches but i think more than that we need to may open these conversations from the same silos of the same 10 people sitting and talking about it you know i hope they are bored of hearing each other and invite us in i mean one last point i want to make is that the important part about law meeting the technology is also that many times the tech people do very interesting exciting stuff but then catch up with the law and the law has its own way of punishing so i mean a prime example is someone like Aaron squads who actually was doing phenomenal work he has a deep understanding of society politics things like that but the way the law soon within him and he had to do a suicide is just shocking and something that the tech people need to understand about the law there should be more meeting of the mind of law and technology so just very quickly lunch has begun so if you'd like to go ahead and start begin lunch you're most welcome but we will continue this discussion because i see there are a lot of quite a few questions so we will continue to take questions but if you want to go ahead and start lunch the lunch has officially begun we can take a few more questions i'll just take one there yeah so hi Jyoti and Sushant it was a very nice talk and kind of a eye opener for me like these things are available in public domain today thanks okay so like the fact that site Indian Kanun which shows all the legislatures is available today is a very good thing so yeah so basically my question was a reverse of what we are discussing here like right to be forgotten i'm trying to ask the question from the other side what if somebody something has happened to me something has been reported around that i'm only talking about public legislations right now nothing private legislation or something so if somebody some person has done something and that is not coming up in in his google search then i should have an issue with that so so it's a reverse question so i think i am am i able to convey the question yes i mean this is exactly the debate you're having that with alok yes people right to know needs to be it's a fundamental right and we need a statutory protection around that too the only one that we see right now is right to information act perhaps we need broader to be fair i've argued for a while now that the right to information itself needs to be a fundamental right sorry i've said the right to know and the right to information see right now it has been interpreted to be part of your freedom of speech and expression i have said that the right to information and the right to know should be a fundamental right of course subject to privacy and some other considerations so that is something that needs to be proactively put out there and but that's that's a that's a very valid point but again there are some there are issues about how do people find out right what do what do people have a right to know what do they have a right to know publicly available information and who do we put the obligation on i would say the obligation should be on the government not just on other private entities so the obligation should not be here as sushant pointed out there is a right to information act everybody talks about really section eight of the right to information act which is the exceptions to me the most important provision in which one which everybody forgets in the right information act is section four which says that the government has a duty to proactively put out information which it fails to do to me that should be the most important part of the act and we should think of ways of making that more robust and that is really which something we need to work on and just to add to follow up to this so basically right to forgotten is something like you are having another legislation on top of the existing legislation to nullify something so basically if you have both the legacy legislation in public domain then is that sufficient to say that right to forgotten has been exercised allow jyothi to sort of answer that and also i i mean i think it's basically very similar to what like the ledger technologies are there something has happened then something corrective action is also publicized on the same port same portal or whatever yeah so the and this is this concern is coming out actually in gdpr where you know they're saying that if you let's take blockchain for example if you start implementing a right to be forgotten request on blockchain it is going to be part of that transaction history in some way or the other which defeats the purpose of the right so again it will depend very much on how that right has been interpreted what are the parameters and boundaries that have been dry that have been constructed around it just to add to the point about rti you would be very interested and if you follow the rti debates you'd be very interested to know that a bill um so the shri krishna committee is coming up with a set of recommendations now usually these recommendations are principles as alok pointed out but part of working through the idea of coming up with these principles or codifying those principles or kind of a template for how those principles would apply would be a bill which is usually annexed as a sample or as something that needs to be considered and mulled over as part of these recommendations and one of these versions of these bills which we don't know is the final version or not has been leaked and the caravan has been reporting about it and it is absolutely shocking to see that there are provisions there that are seeking to water down the existing rti act so not only is the government not fulfilling its obligations of providing data publicly but whatever provisions we have there is somewhere within our government machinery some kind of coordinated action or thinking about watering down these existing provisions i think as citizens we need to our heckles need to be like up and we need to be really aware and checking out what is going on with our right to information so i would urge you to follow these discussions really closely uh dvij at the back so okay hi um firstly thanks for your great presentations they will like really informative um learned a lot more about than what i knew about this um and my question is to both of you and a little bit of a comment actually i was wondering if it makes sense to actually conceptually frame uh the right to be forgotten in two different ways right um one is a general right to deletion which you've seen evolve in privacy principles that you know anybody who holds information over you as long as it's your information you should have a right to access it control and delete it um the other evolved and this is where the right to be forgotten framing i guess kind of came up in in in recent history is about um the ease with which this information is accessible or searchable and indexable which is why in in europe the context is the right to be de-indexed um but in your presentation you gave a lot of different other examples like um the right to be forgotten against a specific newspaper which is definitely not the context in which it involved in europe right um so so if you see it only from that second context in um the context of the right to be de-indexed of the right to not be so easily searchable raise the cost of search yeah um does it make sense to then start talking about and obviously the the um technologies that you're using for searching our platforms right so your facebook um twitter google whatever search engine uh so do we start do we need to start talking about how we regulate the ways in which these entities enforce the right to be forgotten if we're going to place the responsibility upon them um because right now it's done in an absolutely ad hoc manner um no actually it's not done in an ad hoc manner so google so there are a couple of issues with your statement which i want to just step in and correct um facebook and twitter are not search engines google yahoo are these are hosts and the kind of obligations so like they're both they're both platforms they're both letting you access information you know but they are different types of information you're interacting and engaging with that uh information in a different way depending on the platform you're accessing that information from so the obligations and liabilities for these platforms have been constructed around the role they play in allowing you access to that information now the search engine becomes a really important factor in accessing information because that's your first point right you want to look for anything you type there and that is why if my reputation connected to the search url you know it's very significant for me because anybody who types my name will see that whereas you know on these other platforms like twitter and facebook there are certain other order of mechanisms that can create uh you can tweak to create more personalized controlled over your data which a search engine doesn't afford you so fundamentally these platforms are very different and we we should be very clear about how we um talk about it um your question being that how if you think about just raising the costs of search and a lot of this thinking is actually happening in EU so you see the distinctions even in the Canadian judgment which I find very interesting I mean in one uh extreme end of the judgment they're talking about using geo identifying techniques and I found that was a tricky bit but apart from that I think it gives a really elegant or if somewhat an um a movement towards an elegant solution where it's saying that you don't remove the data but you rank it further down you know based on the request so it's not escaping the public domain but the the effort required for you to access that um increases and I think more and more as judges and courts begin to think about these challenges and issues it's only a four year old right you know and to be fair google did do a very robust consultation they invited academics so they do the procedure of everything is algorithmically done I don't think they are manually looking at these requests and as with any kind of algorithmic decision making it's opaque and it requires some sort of you know oversight they're not looking at the content whether it fits the irrelevant and those criteria for delisting they're not looking they're only verifying the identity if the person has requested they're delisting it so I'm sorry to interrupt you sorry but but but I mean let me I think this debate on private versus private is a very bad debate we shouldn't be having it because the most common thing I hear is take money and remove my court judgment and this can happen with any platform in what is my right by why should I not take my money it is good I mean I have I will be much more richer I should just collect money so this is this kind of extortion shouldn't happen first of all and this whether once you give up like you gave a right to google to delist and started delisting almost entire content is delisted I wonder how you works with that kind of thing so I mean I personally disagree with this private versus private if you want to do frame a law on what should go into public record what shouldn't go into public records there should be a law for redaction that should only court should in its judicial capacity should exercise that private people should not exercise yeah and that's that's actually what my question was is that with content moderation at scale at scales and if you've received if google received two million requests last year um that's in a massive scale for even like algorithmic um you know deliberation of that would have its own problems as far as I know actually from three years ago I had a conversation with someone at google who told me that they have like 500 people who look manually at these requests it's possibly changed I don't know um but at some point it was a manual um deliberation so which brings me to my actual question um don't stop sorry I mean continue after lunch because same people will be there but I agree with what Shan said basically we're giving too much power to these platforms to make public deliberations which I think is something we need to take into consideration yeah thank you everyone um and I'm afraid we'll have to end the discussion now but we'll continue after lunch yeah sorry and there's one announcement to be made uh just quick heads up there's one bof on elastic so if you're a fan of elastic please head up upstairs there's another one on driving data uh data driven thinking in your organization that's also happening upstairs the bof on data privacy is going to happen in 225 so I I hope to see all of you here financial portfolios and are you ready yeah I'm good all right good luck can you give him a round of applause like set the stage for him so hello everyone uh I think people are still moving in but we'll start so my name is Anand Gupta I've been working with Morgan Stanley for quite some time now it's been years um today's talk will be related to the world of finance and how we can use deep learning methodologies to it so finance has always been a very fascinating topic I mean if you go a decade ago right everybody was graduating wanted to join some you know fancy hedge fund on millions of dollars so that was the you know the catch word the buzz word back then but sadly it is not now so but then one more area of people got fascinated to that world and that was the academia and the reason was that the world of finance is full of you know equations and formulas and statistics right from the word go even decades ago that was the case but even today if you ask a layman right I mean that is what that they have the perception that you know oh you are a financial analyst okay you must be working with you know black souls model and model and that is what they understand but underneath all of that right I mean the fundamental principle is very simple it is based on the concept of demand supply so let's say you have a product right and there is a buyer for it and a seller for it and now the buyer will have the bid price and the server will have the ask price now these are going along right and as soon as there is a match right that thing is locked and this is what is happening in all the exchanges around the world when you see the prices that is some something that a buyer and a seller has agreed to come across with what defines a product a product is anything as long as you can have a buyer and a seller so nowadays we have you know derivatives on whether you know whether it will rain tomorrow or not so you have products like that also but will not go to that extent so today we'll be discussing on the simplest product that we have that is common stock so let us take a look at some of the stocks so this is a stock on illuka resources this is a mining company and it specializes in some rare earth elements also the second stock that we have is nvidia i mean everybody is aware of it it manufactures gpu's and some system on chips the third stock that we have is electronic arts i mean we all remember our fifa and you know cricket days but this company is one of the pioneers and you know the largest you know video gaming companies in the world and the fourth we have bitcoin i don't need to tell anybody about bitcoin the insanely high prices have made sure that you know everybody knows about it but what did i display these four stocks or four products there must be some relationship now if you look at the relationship right what i can simply say as a you know a layman analyst that illuka is a mining metals company it provides raw materials to nvidia's to manufacture its gpu's and those gpu's are used in electronic arts gaming consoles and also for bitcoin mining so can i derive a hypothesis that if the stock price of illuka increases the stock price of nvidia will increase and then similarly for nvidia and electronic arts in bitcoin i mean that is just a hypothesis right we need to verify it so let us take a look so if i look at the returns of electronic arts in nvidia we can see some kind of a relationship right i mean they are not exactly the same but they are going hand in hand to each other just keep this visual in mind on the other hand if you take illuka and nvidia they are not as good hand in hand as was electronic arts but almost they are in the same direction now if i take bitcoin i mean there is something wrong here bitcoin is definitely not in the same league as nvidia it has a lot of variation and nvidia doesn't so now our hypothesis that okay if illuka increases nvidia increases correct if nvidia increases electronic arts increases correct but the other one with bitcoin incorrect now there are many companies and products and stocks around the world there are around a hundred thousand stocks and it is difficult for an analyst to try to understand all of that there are some other knowledge graph areas that are trying to solve this problem but they are not completely up to the mark yet so we are still relying on statistical methodologies for now so we need to have something else so today's topic what i'll do is i'll quickly cover through the basics of finance so that will include stock returns covariance correlation portfolio risk return graph and then i'll move over to the deep methodologies so stock returns so stock returns is nothing but the you know the amount of profitability that you can get from a particular stock so if let's say i have a stock yesterday the price was 10 today the price is 11 so the profitability is 11 minus 10 by 10 so 10 percent now if i do this calculation for 10 days what i get is the one day return rolled out to 10 days so that is what we get as rolling returns now this is you have done it for one stock and but we would like to relate stocks to each other and that is where the concept of coherence matrix comes into the picture you all know about variance right we have a vector we can definitely find the variance what is variance it is like the deviation from mean squared okay the same thing can be applied to multiple stocks so for example if i have if you if you take a look at this particular matrix so we have stocks as columns and prices as rows and we apply the coherence matrix to it what we'll get as a three cross three matrix what it is saying actually is that you know the stock one's relationship to stock two is zero stock one's relationship to stock three is minus 2.4 the covariance factor this is a diagonal the diagonals relate to the variance of the stock with itself and you will see that there are repeats because it is a triangular matrix i'll i'll come to how we can use this one problem that we see here is that if you look at these two values minus 2.4 and minus 5.6 so for stock three the relationship with stock one is minus 2.4 and with stock two is minus 5.6 can we say that you know it is like you know it has a into two relationship no we can't say because we don't know the upper and lower limits so that is why coherence matrix is not a good idea so something else came up and that was a correlation analysis in which we scaled the covariance with the standard deviations themselves and now what we have is a neat little matrix where all the values scale from minus one to one so now if you look at the same values right minus 0.49 and minus 0.65 we now immediately know that this is the relationship we can say that okay this is a 50% unrelated and this is minus 65% unrelated so now let us bring back the same charts so this will be a puzzle problem for you all so we have electronic arts in nvd right so the correlation for that came up to 0.115 so now if I ask you what will be the correlation value for bitcoin and nvidia what will it be compared to 0.115 less or more okay so let us take a look at the value it is actually 0.2 so now now you immediately see there is some problem with correlation analysis right because it is statistical in nature what we see here visually our mind has a lot of neural networks in place that is able to gather that information but when we look at values they look at only points they don't look at the directions so now again I'll just wrap up the covariance and correlation part as I mentioned earlier you know covariance 0.06 0.06 does not make sense the same value here is a bit different so some takeaways correlations can sometimes be deceiving we already saw that correlations are better than covariance for making choice decisions so now okay we have we have spoken about stocks we have spoken about you know how we can analyze them but what next what we would want is to earn money out of it and for earning money we need to buy stuff and we don't buy a single stock we buy multiple stocks so that multiple stocks constitutes a portfolio the word portfolio self-explanatory but we would like to represent it mathematically so how do we represent a portfolio mathematically so for that let's say we have 100 rupees and we invest 30 rupees in stock one 50 rupees in stock two and 20 rupees in stock three so the corresponding weight vector will be 0.3 0.5 0.2 so this is one of the vectors that will be using for the rest of our analysis so the vectors defining portfolio the weights vector we have already seen that returns now returns is something that you know that is from the market itself the stocks that we have chosen and we have the covariance matrix of the that is based on the returns vector so is everybody clear up till this point I think it should be okay so now so as with any problem that we are trying to solve we need to have metrics around it how do we evaluate so how do we evaluate a portfolio that I have is it good or bad so we have two simple metrics one is return the portfolio should give me profits second is this is a new concept that I'll be discussing now risk now in the financial parlance risk is nothing but the deviation from what we expect and the expectation is nothing but the mean so as if let's say there is a stock that deviates a lot from the mean then we don't want that stock because we don't know at what point at how far away will it be from the expectation we want a stock that is remaining stable right like you know government entities so so so that is what is the definition of risk we have very simple formulas for calculating the risk at the portfolio level and as well as the returns so one thing that I want to ask you is from the portfolio perspective out of the three vectors that I showed you which one is under our control the weights because everything else is fixed if we have decided which stocks to buy so now this brings an interest interest in concept right if I choose different weights how will my risk and return change correct I mean I would like to have that you know minimum risk and maximize returns that is the that is what I want to expect right so what we do is we do a simple simulation on the stocks that we have with different weight vectors and what we get is a you know a graph of this particular you know shape the reason is that I mean the world is you know cruel so you cannot have returns without taking risk so that is why the graph also moves this way so as soon as you try to increase your return the risk will definitely increase let us look at some practical examples so this is a risk return portfolio for your the three stocks that we have discussed earlier now if you see this particular area marked in red ellipsoid right this is the sweet spot that we want to get because this is a spot where we have decent return and low risk and as soon as we try to increase the return right the return increases less but the risk that we take up is more so that way this is considered to be the optimum so I'll just so now what I'll do is let me add one more stock to it so I added more products like apple we see that the risk return graph became enriched we have more points in this particular sweet spot that we have so now this means that you know this is simple right I mean I'm good man I'll keep on adding more stocks and I'll just keep on simulating so let me add bitcoin to it whoa something very weird has happened the reason is that bitcoin had a lot of variance and because of that the covariance matrix that it produced right this create all the problems so it is not as simple as you know just simulating all the particular stocks so we definitely need some other solution so the two things that we can have as a takeaway is that for constructing a portfolio we need two important things what stocks to select and the second is how much of those stocks do we need to allocate to so these are the two determining problem statements that we need to solve so I'll just quickly brush through you know the current portfolio construction one is efficient frontier I mean see finance is a very old concept these things have been there for decades if not you know just years and these are very old they might have been written around 1960s 70s if not used at that time so we create the same risk return graph and what did do is that if let's say you have this area of interest that you want right they ask you to create a tangent touching this particular frontier and where the tangent touches it that is your optimum portfolio so this is a you know efficient frontier the capital market line paper that we had okay so now we'll move on to some interesting topics right we understand that you know there are some so I'll just browse through the topics that we'll be covering why deep portfolio what are auto encoders deep portfolio stock selection deep portfolio latent features and deep portfolio portfolio reject why deep portfolio we have already seen right the relationship between the stocks are not entirely linear in nature we need something that can hunt deeper into the data points that we have a lot of data points but we don't want simply to you know you know plot a line around it use deep learning to come up with a good representation of the market remember that chap who was who was facing problems when we had to work on a lot of stocks I had displayed a you know a man there so he needs to get a condensed representation of the entire market to be able to analyze it better I mean we cannot say that okay we have 100,000 stocks and we will you know take last 10 years of data you know the matrix becomes too large so we want a condensed representation now for solving this problem right I mean I could have chosen any other neural network also but I chose auto encoders and the reason is it is simple and powerful I feel that auto encoders are the most efficient way to get patterns out you know in a unsupervised manner and I'll for those you have attended the sessions for the past three days this must have come across frequently but I'll just go through the explanation if let's say you have an input data okay and your output also is the input data which we are trying to recreate what we do is we pass it through layer one layer two layer three and we try to see that okay is this recreated perfectly and then we have the back propagation algorithm everybody's aware of that and we try to minimize the error between this and this what is interesting is this particular part this is like so if you remove once you have trained your auto encoder right and if you check off this particular section in this particular section what you left is only with these three sections now if you look right using just this data you are able to recreate this correct this means that all the intelligence in this particular section is stored here obviously otherwise it would not have been able to recreate so this means that it is the condensed representation of the intelligence that is there in the input data set itself and this is exactly what we are trying to leverage so again we have this as the input data we have the latent features and there's recreated data with small you know corrections there but we are fine with it okay so now stock selection as I mentioned right we have a lot of stocks at hand we would want to understand which stocks to choose for our benefit for our portfolio construction right so for for this representational purposes I have chosen stocks from snp 500 500 stocks and I've taken last 10 years of stock price what I do is simply create this particular matrix where we have stocks as columns and returns as rows so each column represents a particular stock price for example tata steel so this particular column represents all the prices for tata steel right from a particular area that last 10 years we run it through auto encoders okay and we try to minimize the error now after the auto encoder has been trained what we now do is we try to see two things what are the stocks that have been recreated perfectly and what are the stocks that have not been recreated perfectly at all you know the two sides of the spectrum intuition behind that is that stocks would which would have been recreated perfectly are the ones that move the market they are like the market makers they represent the market better and what are such stocks that represent the market these are in the financial parlance known as the large cap stocks you know stocks you know which are not easy to you know drop you know with certain trading agencies right and then we have the top you know the bottom 50 stocks with the most RMSE so these are the ones which were recreated least efficiently so what we are trying to do is we are trying to test our hypothesis whether our output of top 50 and bottom 50 is able to you know get that pattern out from the market data so what i've done is this is you know the risk return graph that we had shown for all the stocks so if you look at it right this is the return this is risk an interesting thing to note here is that the risk starts from 0.5 here just keep that in mind okay it starts from 0.5 and the return goes up to 0.06 okay this is definitely the you know the initial state now what have we got out of the auto encoder so these are the stocks that had the least RMSE which means that they were recreated perfectly something interesting has happened here if you look the risk has decreased drastically risk is now starting with 0.1 but the return has also drastically decreased which is exactly the kind of pattern we see with large cap stocks i mean stocks which are which represent the market better now let us look at the bottom 50 stocks go the interesting part is again you cannot increase the return without increasing any risk i mean you have to increase the risk to get any return out of these particular stocks so these are what are known as high beta stocks so these are the you know the small cap mid cap which has a 10% 20% fall every now and then so now using a simple auto encoder with around 20 30 lines of tensor flow code and open source data we have been able to segregate this and without any financial knowledge as I said this was just a hypothesis it could be tested out later okay there is a paper that tells that for constructing a portfolio you need both of these so the ideal way is to take a mix of high performance stocks like these and take up you know combination of stable stocks so now what we have done is using neural networks we have segregated the stocks that we want to use for our portfolio construction okay the second thing that you know the second application that would like to highlight is the latent features okay what we have done here is we have made a slight twist what we have done is initially in this particular problem statement we had taken stocks as columns and returns as rows right we'll now just reverse it we have now returns in columns and stocks and rows and then we'll run it through the same auto encoder you know nothing fancy this latent features that we get here is something very unique I'll tell you the reason the number of rows here is equal to the number of stocks I think everybody's agreeing to that the number of is anybody having any issues there the number of rows here is equal to the number of stocks but the time series data has been compressed to a smaller dimension which means that all that daily movement of prices you know for 10 years all that intelligence has been compressed to a vector space of let's say 50 which is a smaller space right but it is still able to explain all that you know 500,000 days worth of data so that is what the power of auto encoders right I mean it can you know find patterns you know the latent features around it now we can do some interesting stuff with this particular okay so now we need to test it out I mean I can see anything but we need results right no no this is not similar to principal component as such in a way it might simulate the principal component but they're not principal components as I will not be able to verify that I mean we'll have to run the PCA and then see whether the vectors are the similar or not but we are trying to arrive at the same thing but PCA is linear in nature so again it will have the same pitfalls as the correlation analysis so now let us let us give us a problem statement we need a problem statement right to work so what are the stocks that are closest to Nvidia which are like Nvidia this is a common problem that you know that are faced by you know people outside the financial industry also right I mean tell me all the users that are similar to this user tell me all the product that are similar to this product I mean stuff like that right so what we'll do is we'll try to find out the similar similar stocks with our original time series data which was not compressed so for Nvidia we found Expedia, Walt Disney, Universal Healthcare Services XX I mean it is fine these seem to be these seem to have no relation to Nvidia and we can say that these are least correlated but when we find the most correlated stocks to Nvidia we again have excel energy, striker, commercial these don't seem to be related at all there's something wrong with this okay now let us do the same exercise with the latent feature matrix that we had here yeah we'll do the analysis with the latent feature matrix again the least correlated electrical systems Dupont, Verizon, Coca-Cola, Brighthouse again we are good there most correlated Facebook, Broadcom, AMD micro devices, Freeport mining, Netflix I mean I don't need to tell you that they seem to be related to Nvidia right okay so can you can imagine the power right I mean we have not written a very fancy code okay we have used open source data and we are able to gather these patterns out okay there is one more puzzle that is left to be resolved right you remember the correlation thing where you had told that Bitcoin should have been more correlated but we found that you know the should have been less correlated but we found that the value was 0.2 so let us check that also with Nvidia in EA the correlation came out to be 0.118 but if we do the same deep latent feature the correlation came out to 0.99 so I had used this particular matrix for finding the correlation between Nvidia and electronic arts so if you go back right if you look at this right the 0.115 right although we thought that it is very similar to each other right but we found out that okay there was something wrong so when we did this analysis that particular puzzle was also solved now the point is that this thing this uses of neural network this uses of auto encoders the reason that it is powerful is that we don't require a lot of knowledge you know the auto encoder itself comes up with a lot of hidden latent features that can be used now all right you might say that this is fine man but I still have stocks in my portfolio which are tanking and if these hitting all time high what do I do now I can't just bear the loss and just buy out all the stocks right I mean that is not possible so what do I do then so we have to do a portfolio shuffling portfolio optimization and that is a very challenging topic because what is happening is let's say you have three stocks with you right and I tell you that okay based on the previous analysis these are the three new stocks that you should be invested in I don't care what you have now it is not possible for you to absorb all the loss buy all the extra buy the buy the new stocks and it's possible the new stocks themselves will incur losses right I mean you don't know so what you have to do is you have to convert this optimization into a you know you know into some problem which can be broken down into factors so the factors that you want to touch with is lesser transactions I mean if let's say you have three stocks you sold all of them you bought the new stocks the total transaction that we have is six okay we would like to lessen that that will be our loss less than a number of total stocks one also one other option will be you keep the three stocks that you have you again buy the three new stocks but that that will increase the total stocks to be six we want lesser number of total stocks again needless to say maximize returns and minimize risk so what do we do so what we do is I've written a tensor flow code and we what we just do is we use this particular loss function where we have this returns and the depth portfolio so these have to be this has to be maximized so we have a negative sign here this has to be minimized so we have the deviation this has to be minimized this has to be minimized so this becomes a loss function and you can use any optimizer and what you get is if you take the returns which includes the returns of the original stocks plus the new stocks the weight vector what you do is initialize zero to the initial vectors okay and then you if you pass to the optimizer you'll get a nice little updated weights section and this is what you're looking at right this has taken into account all the different factors in this entire thing right you will see that the application of neural networks to the world of finance has been more on the creative side it is not that you know you just use LSTM and solve the issue you just use a RNN and solve the issue it has to be more creative in nature based on your requirements you yourself come up with the loss functions that you require right you can add more loss functions here if you want that okay I don't want people to invest more in the technology sector so that can be you know a constraint that you can add here so further advancements I mean the market map as I mentioned right I mean we want to condense all that information to a smaller data set so this is again a very challenging problem and we are writing a paper on it we'll be publishing it soon so this is one area that we are working on the second is again as I mentioned right metric to be used for portfolio construction currently we are using returns and risk but can we use something else can we use some other regulation parameters so these are things that you know we are currently for this particular presentation thank you questions hello yeah yeah so when we did the correlation analysis between bitcoin and nvidia there was a positive correlation whereas in the feature vector it clearly pointed out it's much more negative how can we put it in the business language and what is being captured here especially I have one hypothesis maybe I'm wrong maybe you can tell me if it is right or wrong yeah so that is so that's why I'm attending this conference right the externability part so to be very frank right I mean so that is a problem that we all grapple right at least in financial institutions we need to explain every single prediction that is why they don't like this you know this this black box models right so one way you could think off is that the correlation matrix not only takes only takes into account how many times the values are the same but when the similar thing happens in the later it does not take into account it's more like time series this rises and then this rises correct that's not necessarily taken in correlation matrix correct maybe that is captured in the deep learning exactly so that is why autoencoders are powerful right I mean not autoencoders but neural networks right the nonlinear thing is also captured and that is why see I did not have this conference for predicting stock prices so that is not a problem statement if somebody wants to go into finance to be solved man it cannot be solved it is stochastic in nature so the problem statement that we should solve is how we can help analysts or maybe if you want to become a stock trader what will help you take decisions okay so those are the problems okay I want to know that if let's say I see that Tata has risen by 20 percent right can I write some algorithm which will tell me with probability that if Tata rises by 20 percent what are the other stocks that follow suit that will be a more you know better problem statement to solve rather than you know predicting that okay Tata stock prices 20 what will it happen to that cannot be solved you have to you know add all the intelligence of the world right it is like I had a question regarding the optimizer useful at the end so it could have also easily said that just kill all the previous three stocks and buy all the three new stocks so okay so okay I'll help you answer your question only so what are the four factors that we are playing around here lesser transactions lesser number of total stocks maximize return minimize risk so this third stock right this third this fourth so this is also important right no the point is that although on the onset you might seem that okay you will add these three instead of just remove these three stocks and add these new stocks what I want is I want to maximize the return also so it is possible that you have a holding of a stock that has a good risk return profile so it will it will force you to keep that okay but then it does not it cannot ensure lesser transactions right it might still tell me that I have to sell those three stocks and buy the three new stocks yeah I see that is why we have four parameters so these these are the complications right the combinations of what we can do we want the optimizer itself to solve it and one more thing so that auto encoder so basically the hi my name is Nandu hi do you or Morgan family have you guys done any work on momentum trading or sentiment trading or anything of that sort so I'll not be able to comment on that but momentum is a very old concept as of now I mean momentum is is like it was black shoals merton model so the three factor model when the farm up inch came up with the fourth that was the momentum factor right so momentum is a very old concept it will definitely be doing that the sentiment thing is something that everybody's trying to do to be very frank and we don't have direct access to traders so I'll not go to comment on that such but definitely that is a problem that everybody in the financial world is trying to solve it is not easy by the way hi maybe it's too early to ask have you made any money till now using this so define money man I have joined the organization I'm making money so that money I'm making but that from direct trading we are not allowed to trade so you have not deployed this model yeah no no so this model deployment in our system right yeah no no this has not been deployed as you can see right I mean we cannot deploy it as it is right we need to that is why we have these next to do actions right so unless we you know arrive at the most efficient model we cannot have deployed in production this was a very simple you know intuition on how we can approach the problem it requires a lot of work lot of research lot of back testing because you can't you know you need to have back testing on at least six months one year because the cycle of finance is very long I mean things that happen in February will only happen in the next February stuff like that right you are you aware of any other organization who has actually deployed deployed using neural networks and many of my many of my friends in other organization I don't recollect see I'll tell you again as you know this gentleman asked right black box models are not favored we need to explain that why you know we should trade on this so that unless that particular aspect is solved I don't think it will see the light of the day very soon hello Anant mostly on the reconstruction side so did you also check into the auto encoder how well are these reconstructed firstly and secondly what all kind of auto encoder did we use for the entire process okay so the testing of the auto encoders right I mean see we did not apply any you know label data since we did not have any label data I can only visualize and see whether they are correct and that is why I checked the risk return graph and the risk return graph is proof enough that the results that I got through the auto encoder is correct does that answer your question applying to the last part of getting the frontier and all right so even the last part before that did you do anything on the reconstruction side how well are the reconstructions and all are there or is it just capturing some important information oh yes yeah because we are not dealing with that because we don't want anything to come out from the reconstruction in the first problem statement the first problem statement we just want to understand which ones got reconstructed so let the model itself learn that it it makes more sense to reconstruct this rather than this so this is I'm able to reconstruct it easier compared to this sir he'll be available offline you can just catch him offline great response to your talk let's have a round of applause for Anand we have the evening beverage break now so those of you who want to step out can step out and we will be continuing Q&A because we have time please join us at 435 for the cough which will be happening over here thanks you don't want to continue Q&A we'll just do it offline yeah just come to the front no those of you who have questions