 I think many of you would already have heard about Anand but I thought I will tell you a little bit about his diagram. Anand is a BTEC from IIT in Madras and then he did his PHD at Stanford with Jeff Alman. And during his PHD, Anand did some fantastic research. You would have heard of Anand in a group of starting companies and so forth. What you probably haven't heard of is that he did such excellent work during his PHD in short span. The three of his papers won the test of time for one. That is ten years later after a conference. People look at papers published in that conference and say which papers after ten years have had the most impact on the field. And three of his papers have won that for one. And then three different conferences. BLDB, Sigmar and IITB. After this he was doing in his spare time because in his main lifetime he was busy starting this company called Jungly. Younger once here may not have heard of it but he would have. Which he saw into Amazon and worked with Amazon. Then he was doing venture capital company. And then he started this company called Cosmix. Which is probably one of the coolest companies you've ever heard of. Because they pioneered things which you see appearing on Google today. On Google, when you search you see the usual links but you also see these fact points in the right which are increasingly popular. Cosmix had that about 11, 12 years ago. More than 10 years back. Cosmix pioneered this. So they went on to do other stuff on product search and they sold it to Walmart. And now Anand is a full-time venture business. And here are some amazing insights into all of the stuff which has happened and is going to happen. So it's a pleasure to welcome him back. And by the way for those of you who are PhD, Anand dropped out of his PhD but went back and finished it later. So it's quite a nice coming point to research. But before Anand takes our part I wanted to say thank you. Sorry to take your time but I just couldn't resist the temptation to say something for Anand. So this I mentioned is dropping out of PhD. Fairly good number of people at Stanford particularly who have established a new terminology called ABT. It's called all but thesis. So they do substantial amount of PhD but drop out because of other lewers in Bay Area. One of which is of course starting up a company. There's a higher kind of thing. The others are good offers from other major companies and so on. So there are many such people including some of our own gold medalists by the way who never finished their PhD. So Anand is made of thunderstaff. He did that. I recall the days when I set up the IT business incubator where a whole lot of youngsters attempted to start their companies. Several succeeded some did not. But there was a big debate at that point in time that to become an entrepreneur you need completely different kinds of skills and technical knowledge is not necessarily a important element of what you do. That has been proven as nonsense many times over. In fact when Rukh Swaaz Patil had come here he mentioned how he was not only a great researcher but a teacher and when he set up his company extraordinary amount of cutting edge technology was required in whatever he was doing. Same is the case that is proven by Anand Taiman again. So he has been using cutting edge technology in whatever he does and therefore cutting edge technology high end research and entrepreneurship are not at all mutually exclusive. In fact they are completely inclusive and that is the message that I wanted to do. So when you interact with Anand, bug him not only on the research problems but bug him on how he has successfully translated the technology innovations into the business. That is one thing. The second important point that I learned from the interaction that I had in BLDB is that while the field of researchers is actually frittered because the silos that we built around ourselves by giving different nomenclatures such as database, networking, artificial intelligence and whatever in real life when you have challenging problems to sort all these silos need to merge with competencies across these silos necessarily required in actually coming up with an innovative solution. This is another thing that he has proven himself and I think he articulates it well when he describes the data driven revolution that is happening. So please listen to him very carefully and bug him as much as you can before he runs away because his time is important but I hope that we will all benefit tremendously by getting a different view of life when he talks about the data driven. I will not steal his thunder anymore. Unfortunately, I have to rush back to some urgent work as usual but I wanted to say these few lines. Thank you so much and thank you very much Anand for sparing time in coming. Thanks Sudarshan and Professor Patak for the kind words of introduction and I thought it would be an interesting segue when I arrived here I drove in from Bandra and I texted Sudarshan who kindly invited me saying I will be here at 11.15 at the campus and lo and behold I was here exactly at 11.15 and Sudarshan commented on this and said you know how come you could predict exactly when you would arrive and given bombay traffic and so on and the secret is simple I have learned to stop worrying and trust the Google and Google maps as we know uses a lot of data to make predictions using knowing traffic conditions well and often does much better than a local in telling you exactly when you are going to reach and that is actually a microcosm of the theme of this talk. Sometimes data is better than humans and humans and data need to learn to work together and love each other so that is kind of the broader theme of this talk. This is a talk that I put together for conference the BLDB conference in Delhi for a keynote and it is sort of directed at an audience of researchers and haven't had the time to actually modify the talks you know since it just happened a few days ago so apologies for that and I will try to sort of make this more relevant to students as I go along and you know whenever possible otherwise please bear with me. So the title of the talk is Data-Driven Disruption right so and what do I mean by data-driven disruption and so there is sort of lots of examples of this phenomena all around us by disruption I mean a sudden or abrupt change in the status quo of something usually for the better right so and you know and there is plenty of interesting things like online recommendations or the AlphaGo beating the Go champion or self-driving cars or bots or intelligent assistants things of this nature which are things of this nature these are sudden these are abrupt changes in the status quo usually for the better and behind all these kind of innovations that have happened in the last 10 years or so there is data data is what is driving all these massive disruptions that are happening around us and we see this happening you know quite regularly these days and so when you see something of this nature right you want to step back and ask why is this happening and of course the reason this happening is that there is a lot more data in the world now right so the amount of data in the world is growing at some insane rate it is growing 50x over a decade and it is probably an underestimated probably it is actually growing faster and these numbers are kind of hard to comprehend what is you know 40 zettabytes right so it is hard for us to imagine what that means so I will use a little kind of factoid that might bring this to you know bring this light and this factoid is that for the first time right people who work with data have more stuff to study than astronomers right now there is more bits of data in the digital universe now then there are stars in the physical universe so if you want to you know study lots of things data scientist rather than an astronomer so there is tons of data and where is this data coming from well it is coming from many different places a lot of it is coming from our camera phones and you know various sensors that are embedded in our mobile devices a lot of it is streaming video things like Amazon Prime and Netflix a lot of it is coming from satellites in the sky lots of digital imagery and tons of it is coming from things like Internet of Things embedded in machinery and so forth so overall we are creating tons of data 1.7 megabytes of data per minute per person that is a lot of data and it is orders of magnitude more than what we were doing just a decade ago right so there is lots of it and it is growing very very fast and so when you have tons of data what do we do we use this data to do interesting things right and so we have created a new class of applications over the last 10 or 15 years that I will call data driven applications right so a data driven application is just an application that uses data to do something interesting right so it creates value using data so here is kind of the outline of this talk I am going to talk about the evolution of these data driven applications over time over the last 2 decades or so and we will see that there have been 5 generations of these data driven applications over the last several years and then we will get to the main part of the talk where I will talk about lessons and opportunities and these as you know as Sudarshan mentioned my perspective is sort of not the usual perspective of most people because I sort of straddle this thing where I am not good at any one particular thing so I am sort of at the intersection of the startups of venture capital and of academia so I am trying to bring a slightly different perspective that sort of encompasses all these different points of view and the main people that I am trying to address are those of you who want to go and start companies right so I hope many of you will become entrepreneurs and hopefully there is something in here that you know that gives you some ideas or inspires you to go do that and the other kind of person I am addressing is those among you who want to do interesting and relevant research that is motivated by real life problems that we face in Silicon Valley these days of the cutting edge of startups right so these are the two groups of people that I am addressing here and I will try to sort of mention which one is which but usually these things overlap and the key theme that we will see as we go through the talk is this idea of disruption versus optimization so just stay tuned for that so now let's without further ado get to the first part of the talk which is this evolution of data-driven applications so if you think about how data-driven applications have evolved over the last 15 or 20 years they have sort of evolved with the kinds of data that have been available right so and so you know it is kind of you sort of look at the most valuable data that is available and then build applications to take advantage of it so let's start with the first generation so this was back about 20-25 years ago companies started building these databases database systems and they used it to operate business functions so to automate business functions like sales and inventory and payroll and this is using classic SQL queries which I am sure many of you have learned and which I saw some examples of earlier today as well so that was the beginning of companies collecting data in databases just to automate routine business functions but as this data sort of accumulated in these databases some smart people realize that you can use this data not just to automate sales or inventory or payroll but actually to create competitive advantages for the business itself and that's kind of the origin of the earliest kind of data-driven apps which are things like market basket analysis so this famous apocryphal example of diapers and beer which I am sure some of you may have heard of sort of originated around this time if you don't know what that is ask me later at the end of this talk and so lots of examples like that but the I'd say the most interesting example of from the first generation where companies were using essentially data that was gathered for operational reasons to build interesting apps is recommendation system so Amazon launched this in the mid 90's most of us use this or something like this where you can use data that Amazon uses data that they've gathered because of transaction for entirely other reasons but they can use that now to recommend interesting things for people to buy so that's kind of the most perhaps impactful application of the first generation of data-driven applications so that's the first generation now as the time passed by the worldwide web was growing and this was now we are kind of in the mid to late 90's and as the worldwide web started growing there's now more and more data on the web and places like Wikipedia and so on so we can actually build interesting apps using data that's on the web as opposed to just data that's in private databases and most of the many interesting apps were built using just the data that was public data that's available in the web right now my my first company Jungly was an example of this nature where we used we built a comparison shopping engine using publicly available data on the websites of shopping companies but I'd say the perhaps the most impactful second generation app any guesses anyone no using entirely public data web data who built the most interesting app any ideas Google that's right so Google search is probably the most most exciting or the most impactful example of using public data to build a data-driven app and this is something we all use every day so it's hugely impactful right so they've harnessed all the world's public data and made it useful now as time passed by we had social media so we had Facebook and Twitter and Pinterest and Instagram and so on and so now we have a new class of data which is which is social data this is neither private data nor is it public data it's semi public data it's sort of accessible like public data but it has usage restrictions and so on and so so this is a new kind of data that became available in the first decade of this century and once you had this kind of data people started building apps using social data as well and the most interesting apps were apps that were built by the social media companies themselves things like friend recommendations or you know this is Twitter's moments which builds a newspaper style experience using the latest tweets or Facebook's feed or advertising these are all examples of apps that are built using social data now there also been some other examples of third party companies that use this kind of data for example to look at a brand's use this company called Crimson Hexagon to track what people are saying about their products on social media and they can use that to react and so on so that's the third generation now the fourth generation basically combines all these things together people are building apps using public using this private and using semi-public and then some people started combining different kinds of data to build even more interesting apps so you can combine these kinds of apps these kinds of data to build more interesting apps so here's an example this is a company called Pesa many of the examples I'm going to use are going forward are companies from Silicon Valley these are companies in many cases where I'm involved in some way and the interesting thing is many of these companies have some kind of IIT connection many of them have IIT as founders some of them have IIT Bombay people as founders some of them are from other IITs Pesa is one example and I believe one of the founders is from IIT Madras in this case I'll tell you when there's an IIT Bombay founder as well so this company addresses this problem of am I being paid fairly am at this company I have all these qualifications are they paying me should I be making more and so this is what Pesa does it sort of makes this nice distribution that tells you what people with your qualifications are making and maybe some of you can use this when you're doing your campus placements to figure out whether you're being to evaluate your offer to see whether the offer makes sense and is in the scale they'll tell you the base salary bonus equity and signing bonus for people with certain kinds of qualifications and skills working at certain jobs they do this by combining lots of data the data consists of data about salaries, about people about companies and about jobs and where does this data come from this data comes from both private and public sources so there's public sources like the web and social media and local and national government databases but there's also private data that comes from the companies and from recruiters and from partnerships that combine public and private data and can do these salary predictions right so now the fifth generation is a generation that's happening right now it's a new it's the latest generation of apps and it's sort of driven by a new class of data that's becoming more and more widespread and this is training data all of you have heard about AI this is about using AI to accomplish all kinds of things behind all this of course is machine learning and machine learning requires vast amounts of training data especially if you're using things like deep learning to train models to do things so because of this there's more and more companies that are generating large amounts of training data just to train models and this is a new class of proprietary data using the next generation of data driven apps and the many of these much of this training data is actually obtained by humans who tag examples and so on and a lot of them use Amazon Mechanical or other crowd sourcing services to do this but also there's some other ways of creating training data for example you could just drive around a car and then you can use that to train self-driving cars for example so you can collect a lot of training data as well so just to summarize the fifth generation consists of all the other kind of data sources we saw in all the other previous four generations plus training data so then you have the fifth generation app which is the generation that we're currently in so what are some examples well ImageNet was built by asking humans to tag images using Amazon Mechanical Turk and ImageNet was hugely influential and powerful because it sort of deep learning as we know it originated using the data that was available on ImageNet so we owe deep learning to ImageNet the self-driving cars which we've already seen and this one is Atari Breakout and this was people who trained a machine learning model to play this Atari Breakout game using this technique called Deep Reinforcement Learning and there the data was obtained by actually simulating the game many many times so that's another way of collecting training data by simulating through simulation so these are all different ways of collecting lots of training data some examples so that was the end of my talk on the evolution of data driven applications and the summary is that data driven applications just evolve by following the most interesting data we started with private data then we had public data then we had social data and finally we have training data right and so we've had five generations so far now I'm sure somebody will ask me to predict what the sixth generation is going to be and I'll tell you up front I have no clue because that depends on the next kind of data that's going to become available and when I give this talk three years later I'll be able to tell you right so what the next generation of data driven apps is going to be right so that was the end of that piece of the talk now we'll move on to this section on lessons and opportunities and these come from my perspective in Silicon Valley and hopefully it's relevant to some people in the audience who want to start companies right so I've sort of organized this into five themes I don't know whether I'll have time to go into all of these but I'll make the slides available so I can see it later I'll at least do the first couple the first theme is this is that I wanted to look at the startup and investment landscape around this around data driven companies just to see where where entrepreneurs are starting companies and where investment dollars are flowing and that's not necessarily the best gauge of or the most useful thing to know is that you should start companies where other people are starting companies but at least it's a market signal and it's helpful to know and next I'm going to talk about this theme of disruption versus optimization which is going to be the main thing and then perhaps we look at human machine collaboration and the rise of the cyborg depending on time availability so so let's start with the startup and investment landscape so this is the set of major big data or data driven companies that have raised venture capital in Silicon Valley you can see there's like plenty of them so many of them that they don't fit on the slide so basically the point here is that there's a very active area there's lots of people starting companies in the big data space and there's a lot of venture capitalists funding them so this is a very good area to be starting a company in if you want to but if you're sort of zoning out of this the important thing to realize is there are three broad categories and those three broad categories are infrastructure analytics and intelligent applications so those are the three classes to remember infrastructure consists of things that are accessed primarily by developers things like Hadoop and Spark and Storm and Cassandra Amazon Web Services these are all infrastructure and the primary takeaway on infrastructure is that we have good infrastructure right now to build interesting applications this has evolved over the last 10 years or so we've evolved a good infrastructure stack a lot of people are migrating from Hadoop to Spark right now or to Storm and there's enough key value stores out there so I think there's a good infrastructure to to create interesting data driven apps and this is kind of not where the primary area of interest is for either entrepreneurs or investors right now so because we have good infrastructure people's the entrepreneurial ecosystem has moved beyond infrastructure into applications for this point so if you're going to start a company don't do infrastructure do applications right so well if you're a researcher by all means you should do infrastructure but I think this is actually the there are always hardware trends that are changing and in hardware trends change there's it's possible that you can build new layers of infrastructure right and this is what Yuan Stoika who built Spark was saying at BLDB as well because of hardware trend changes you can do new things so it's possible that there's a new layer of infrastructure that can come around so I think as a researcher I think infrastructure does make sense but not so much as an entrepreneur not right now I think these things go in waves right sometimes the interesting thing to do is infrastructure sometimes interesting things to do is applications right now in the as an entrepreneur the all the action seems to be around applications the second area is analytics and the so analytics basically is data exploration either for data scientists or for end business users and a few examples of analytics companies but I've grouped them into two groups on the left hand side of the vertical bar are companies like Tableau and SAS and H2O these are what I call horizontal analytics companies companies that build horizontal you know analytics platform that are used by that are useful for pretty much every industry vertical useful for pretty much every business function SAS is another example that's kind of there and so on right so on the right hand side of the vertical bar I think companies like Palantir and IASD and Kubron these build analytics platforms also but they are extremely vertically focused right so they are focused on specific verticals of specific business functions so Palantir is focused on defense and security IASD on pharma Kubron on retail and e-commerce right so the interesting trend that we are seeing right now in the in the VC and start-up space is more and more companies on the right hand side companies that are focused on building analytics platforms for specific industries verticals and specific business functions and largely because I think there is enough of the broadly applicable companies right now that people are still digesting all the analytics tools that are available on the broad segment right so I'll give you an example of what I mean when I say vertical analytics this is Kubron once again has IIT founder the a lot of companies use things like Google analytics to track how their website or their mobile app is performing how people are interacting with their website or their mobile app and then you get nice pretty charts like this right it turns out this chart tracks something called e-commerce conversion rate which is a very important metric for an e-commerce company measures of all the people who come to the website how many of them actually end up buying something right so and that you can imagine is a very important metric for an e-commerce company and you can easily track it using things like Google analytics or Omniture or Flurry but what you can't do is when you see a spike like that you don't know why it happened why did my conversion rate go up or go down or burst why did it suddenly go down it's something that you can't really tell just from just by tracking right and in fact a lot of time at e-commerce companies is spent trying to understand why changes happen and this is what's called root cause analysis and there are questions like why are sign up changing why did this test perform or not perform or things like that and product managers and most companies end user facing companies spend most of their careers just analyzing questions like this and that's the kind of thing that Kubron addresses it takes data from Google analytics which is private data but it combines public data from demographics from the US Census Department and weather data and so on and maps and then it builds very high-dimensional data cubes I'm sure some of you may have studied about data cubes in your classes so this so what Kubron does is it builds these very high-dimensional data cubes which are 30 or 40 dimensional data cubes and then it scans those data cubes to find anomalous sub-cubes within the data cubes and it turns out that the anomalous sub-cubes often have great explanatory power and they can often explain why certain things happen why did conversion rate go down for example it may be that conversion rates went down because today there was suddenly a lot of traffic from a certain part of the world that you know that you don't usually see or maybe today there was some kind of bug on the website right so whenever there's an anomalous sub-cube there's an anomaly something that knew that happened and these anomalous sub-cubes often identify those problems the interesting thing is it's very hard to do this in a gentle purpose way because these finding sub-cubes anomalous sub-cubes in high-dimensional cubes is a hard problem because these cubes as you know are very large and you can't exhaustively scan them so what Kubron can do is to get knowledge about their domain and to spot the anomalous sub-cubes very quickly so that was the vertical analytics space so here's an example so the next categories is what I call intelligent applications and this is where most of the action is right now so here are just some of the examples there are intelligent applications in pretty much everything you can think of sales and marketing customers service, HR, security government finance, life sciences and you know pretty much every industry or every business function you can think of there is a data-driven intelligent app that's being built in that space and I'll just give you one example this is an example of a company called Descartes Lab and they actually use satellite data which we saw earlier so they use higher resolution satellite data they combine that with weather data and what they do is that they predict crop yields in this case it's corn yields they're predicting how much corn will be what's the total corn yield of the United States in this upcoming year and the interesting thing is they can do it not just nationally they can do it on a county by county basis they can tell you what's the corn yield for this county going to be using satellite imagery and weather data and they've done some analysis that shows they can do this prediction better than the U.S. government can the government does it by surveying farmers and nobody has this county level forecast which they have so this is a very specific app that's tuned for a very specific industry vertical using certain kinds of data so this is you see being built a lot of these kinds of apps so what's the overall trends the overall trends is that infrastructure is available and solid there's a major transition from Hadoop to Spark there's a lot of investment focus and a lot of entrepreneurial activity in this vertical analytics place where you pick a particular industry vertical or a business segment and you build an analytics platform just tuned for that space and the third is the idea of intelligent apps where once again you pick a vertical or industry segment and then you create an intelligent app like Descartes Lab and there's lots of major opportunity and lots of investment dollars are flowing here so if you want to start companies these are the areas to focus on. Truly is IITB if I have time I actually have them in an example truly actually it's two IITB founders I think Anish Darsarama and Nilesh Dalvi so yes and DocSapp is IIT Madras so so that was the end of the investment landscape. The next part I'm going to talk about is this idea of disruption versus optimization and let's first look at optimization and then look at disruption. Now the thing that's very hot at lots of companies these days is this idea of data lakes and the data lake is just the next generation data warehouse companies gather all the data that they have inside the company and put this into this data lake so that they can do interesting analysis on the data right so they hire data scientists and various people to you know smart people like you to analyze this data and build very interesting models maybe they use a company you know company like Kubron or Descartes Lab to build these models and once they build these models then they can actually improve the you know optimize the business right when I say optimize the business they they can lower costs they can reduce risk or improve customer satisfaction or improve quality all these things are things you can do by analyzing all the data about the business and then doing things to change how the business functions and then you optimize the business so this is something that happens on a routine basis and hopefully when it happens everybody's happy and the company's stock price keeps going up right so this is kind of the overall hope of many companies that are in this optimization space and it works to some extent it works and there are results but often what happens is that instead of this optimization a different kind of thing happens and that is disruption right so Amazon comes along and disrupts Walmart right this is despite Walmart actually investing in building a data lake and hiring the analysts and doing all the optimization it's not that they are not doing it they are doing it but they still get disrupted by an Amazon right so so Netflix comes along and it disrupts cable this is in the in the US or comes along and they disrupt taxis right so so everybody is trying to optimize but suddenly disruption you know they get disrupted right so so why does why does this kind of disruption happen right so it's hard to answer the question in full generality but I can give you some you know some ideas around it and one of the one of the key things is to think about established companies established industries and see how they think right and so if you think about any big established company big large bureaucratic established company and there is an important decision to be made what they do is that they gather together all the important stakeholders into a room and they spend a lot of time analyzing the situation and they use a very scientific decision method it's called the HIPO to make decisions do you know what the HIPO is HIPO stands for highest paid person's opinion right so this is the the decision method used by most large not just companies most large organizations use this decision method and this is one of the reasons you know why companies are sometimes blindsided to change because they're ignoring the data and going by the highest paid person's opinion so companies that now of course there's lots of data available and so companies can get better right so they should just not listen to opinion but look at data in making decisions but there's a little gotcha there right suppose you know clearly these companies are not stupid so they are hiring data scientists so so what happens when these large bureaucratic companies hire data scientists right they hire them as advisors rather than as decision makers and so the highest paid person then just tells the data scientist to go make a PowerPoint presentation justifying the decision using data right so that's you know that's really what happens so you get data driven HIPO right so make it a historical context where a lot of people trust the data right so you think it's garbage and garbage out so what's the change today do you think you can trust data more I think there are many cases I can't trust data I mean the example that I gave right at the beginning of this talk about coming here on time because I trusted Google if it's a good example right I mean about 2 or 3 years ago I would actually when I went to a new city I wouldn't trust Google I'd ask the local person for directions especially in India right I mean I think Google had a good data in India so and then of course they tell me all kinds of things that would be completely wrong right and it always get to meetings incredibly late so and then I started trusting Google and that stopped happening right so I think you know data is getting better than human opinion in many cases especially in the world it's changing very rapidly it's hard for humans to keep up with everything right whereas data can I think that's the that's the difference so so that's one right so the first reason is that many companies hire data scientists as advisors and not as decision makers right and they seem to think that domain expertise is more important than data you have to be very careful when thinking like this you know sometimes when you have enough data you have to trust the data and ignore your prior domain expertise so I think it's it's something that that people have to learn to do over time it's not always the case but you have to know when to do it the second is that the data-driven approach sometimes enables a completely new business model that was not possible before right so when a startup comes along and truly sort of embraces the data-driven approach they can use a new business model like you can do a la carte streaming versus a fixed number of channels in the case of Netflix or infinite inventory in the case of Amazon instead of a fixed number of stores and the third is this is a fear of making mistakes large companies and organizations have a fear of making mistakes and the problem is when you use a data-driven approach and this goes back to your point again is that when you deploy an algorithm to do something algorithms can make mistakes humans can make mistakes too but algorithms can make even more spectacular mistakes and for example let's say company uses an algorithm to decide how to price their product let's say a retailer uses this right it may be that the algorithm actually suggests reasonable prices for the product that work out well but on a particularly busy shopping day it set all the prices to some bad values and the company lost a lot of money so this could happen and when this happened the company sort of says I'm not going to trust the data ever again I never do that again but the problem is that algorithms learn by making mistakes so you have to deploy the algorithm let it make mistakes, learn and get better over time and algorithms get better by much faster than human beings can they can learn much faster with data than humans can so in large organizations there is often a fear of making mistakes that prevents them from adopting data-driven methods which enables startups to sometimes come and succeed where big organizations cannot so if you may have heard of this book called the innovators dilemma it sort of explains why disruption happens in a very general sense written by this MIT economist Clayton Pistonson so I highly or maybe HBS I highly recommend that those of you who have not read the book but are interested in entrepreneurship or even in research go read the book it's an excellent book but once you read the book the one thing that they don't mention that I think is new in this world is this idea of data network effects so the idea that once you have a you know once you have somebody who uses data to make interesting decisions they initially start with a little bit of data then they use the data to maybe you know make some decisions then they get more data their decisions get better and overall over time they improve much faster so this is what I call a data network effect and these data network effects are sometimes quite rapid and enable start-ups to very quickly become dominant companies before the incumbents realize it right so I'm going to use the example I'm going to use for disruption is venture capital and venture capital it's been an established industry for a long time now it's sort of started in Silicon Valley and this is where it's primarily based most of the top VC firms are in Silicon Valley now there are many VC firms in India as well and you know many of the Silicon Valley VC firms have offices in India and there are homegrown VC firms as well and the interesting thing about the VC industry is that their process has actually not changed that much since the early days in the 1960s and 70s so the idea is that VC firms you know if you're a VC firm you sort of have a nice office and you expect entrepreneurs to come and pitch you interesting ideas periodically and then you decide which of those ideas you like and you end up funding those companies so this is pretty much the way VC firms work sometimes they use data they hire data scientists to actually crunch through data from the companies that come and pitch them and try to validate them and so on but those data scientists are an advisory role which is a red flag as we saw they're not making decisions right so so this is for my firm which is Rocketship.vc where we are actually trying to disrupt venture capital using data right so so the basic idea that we're leveraging is that there are many more startups all over the world as I mentioned Silicon Valley venture capital firms are largely focused on Silicon Valley and it's on a few other geographies now including India but now there are startups all over the world and many more startups right the costs of launching a startup are really low now which means there are way more startups you know because of smart phones and because of the app stores anybody anywhere in the world can launch an app and find a market all over the world so it's not possible till a few years ago and there's talent pools in maybe a few hundred places in the world of people who can start interesting companies even in India many of us think about places like Bombay and Bangalore but there are people starting interesting companies in Chandigarh right so there's interesting places where people can start companies and there are emerging market opportunities things like India, China, Brazil Indonesia and so on where there's interesting companies that you can start so the number of companies, interesting companies now is now beyond human scale so we've actually built a database of the rocket ship of all the companies in the world and we sort of use some method to identify which of them are startups that we might be interested in finding there are 2.1 million of them so there are 2.1 million startups now which no BC can humanly sort of sift through 2.1 million startups and it turns out that over 100,000 of those companies need funding at any point in time that's a huge number 90% of these companies are not in Silicon Valley they're actually outside Silicon Valley and interestingly it turns out if you look at the number of companies that are worth over a billion dollars recently the companies that were started outside Silicon Valley overtook the companies that were started in Silicon Valley for the first time so there are more billion dollar companies have been started outside Silicon Valley than in Silicon Valley for the first time so it's a good point to remember for those of you who want to start the next billion dollar company so this is so we can see what we do at Rocket Ship there's lots of data available as well there's now data about companies on places like the app store on the web and so on we can see how many downloads an app is getting how people are reviewing and rating apps for example if it is an app driven company or if it's another kind of company there's other kind of data you can also see direct customer feedback on companies from places like Facebook what are customers of the company actually talking about the company and we can also look at LinkedIn to see who are the founders of this company who are the senior team members and what do we think about them and there's places like Crunchbase and Angel List and many other places that actually have information on what are the new companies that are being started at any point in time so if you combine all these things you can actually build models for the things that VCs care about which are things like market, team, traction competition, customer feedback and so on and then we can build a company model that leads a company to succeed given the capital and using that company model we make investments at rocket ship so this sort of changes the way typical VC works we use a data driven model to decide which companies to invest in now that's interesting but that by itself is not enough to disrupt the venture capital industry to disrupt the venture capital industry as we saw earlier you need not just a technological innovation but a business model innovation and the way we do that is by flipping the VC business model on its head remember what I told you earlier about most VC firms waiting their nice offices waiting for entrepreneurs to come in and pitch them we change it and since we can in parallel we can analyze all these 2.1 million companies we figure out which are the interesting companies wherever they are in the world and ask them whether they would like us to be an investor so instead of the entrepreneur pitching the venture capitalist the VC pitches the entrepreneur in this case prior to them even reaching out to the VC so this sort of flips the whole VC business model on its head and so this pie chart shows you the geographies of the companies that we reached out to last quarter in the last 3 months or so and you can see that that it's sort of very widely distributed because our database has data from companies all over the world we reach out and make investments in companies all over the world and USSF which is Silicon Valley has only 11% which is in line with the total number of startups in Silicon Valley so US other for example includes places like Rhode Island where we actually have an investment where most VCs wouldn't even bother looking India we actually do have an investment in a startup based out of Chandigarh where most VCs are not looking so things like that in Europe we have investments in a company based out of Copenhagen and Barcelona so these are all interesting companies that most VCs are not even looking at but the data is telling us these are interesting so we end up going and making investments in these companies so in Europe for investments so it's a two step process and the first step is that we use all the data that we have available to find interesting companies they you know we reach out to these companies and then we ask them for their internal metrics usually that Google analytics or their flurry or whatever it is that they are using internally to track so we then look at that and validate that with our own data and then we make investment decision and so an interesting thing is we make many of our investment decisions without ever meeting the entrepreneur in person because they are spread all over the world usually it's over like Skype call or something like that so it's a very different different business model so to summarize the key two themes that we've seen here are optimized versus disrupt optimized takes an existing business or business model usually an established business uses data to make it better disrupt takes an established industry or a business model and finds a new way of doing it usually for the better so these are the two key ideas that we've seen and for the key question that every entrepreneur has to think about and every researcher to sometimes should I be optimizing or should I be disrupting should I make build a product an existing group of businesses optimize their business or should I be starting a new class of businesses that blows away the existing class of businesses so this is a very important question that you have to think about and it sort of often determines whether you succeed or fail and this is kind of a hard question to answer in general but you look for some disruption cues is it an established and fragmented industry well then maybe you might want to disrupt it are they slow to adopt the latest technology trends are there asset heavy models that you can replace with asset light models for example the existing business models use assets very expensive assets like retail stores or taxi medallions or things of this nature can you replace them with an asset light model using data so that's worth thinking about and the final thing to think about is this idea of risk to water trade off starting an established industry it's a very risky thing to do many try but if you succeed whereas optimizing is sort of a much lower risk proposition you can probably make a few sales and into an established industry on the other hand when you do disrupt the rewards are much much higher think of the valuation of an Uber or a Netflix or an Amazon versus many companies that are trying to optimize right so the rewards are much much higher the risks are also much much higher so at the end of the day it might even be a matter of personal preference based on your own risk to water profile do you want to disrupt or do you want to optimize so just a very important question to think about and I do hope that many of you when you start companies end up starting disruptive companies not just companies that optimize existing business because that's where a lot of and even when you do research sometimes it's interesting to think about research ideas that sort of open entire new fields of work you know or sort of question established wisdom in some way as opposed to just taking a published paper that somebody published and then tweaking it a little bit and so that's an example of optimization versus disruption in the case of research right so even in the case of research I hope some of you will think about doing disruptive research rather than research that optimizes an example of disruptive research is something like self-driving cars so that disrupted rather than purely optimizing something that came previously let's see human machine collaboration okay I've skipped one section and we're moving to this section this theme that I call the rise of the cyborg cyborgs are always exciting so I wanted to talk about them so there's like on a daily basis right we interact with a lot of data-driven software right so typical user would interact with Facebook, with Amazon, with Twitter with Google, with Netflix and all these companies actually are data-driven companies they use, they build data-driven models of the users and interact with their users, they make recommendations or they reorder search results or they show a news feed in a specific order depending on models they have of their users right so we're all used to this so we don't even give a second thought to this thing but perhaps we should right so because there's something called the agency problem that we should all be aware of if you think about all these companies whether it's Facebook or Netflix or on Amazon or Google they build models of users but you know we often think that these models are optimized for our benefit but that's not the case each model is optimized for the benefit of the company that created the model right so models are optimized for Google, Netflix, for Netflix and so on right so they have certain business objectives that they are trying to optimize by building models of you and me right and very often our goals and the goals of the company that build the model are aligned but sometimes they are not right so most of the time they are aligned so we don't need to worry about it but sometimes our goals and the goals of the company are not necessarily aligned so it's worth thinking about this agency problem sometimes yeah so what are the problems that arise because of this agency problem the first is privacy clearly everybody has your data and their building model whether you like it or not the second is this idea of pricing and discovery disadvantage right so you can only select from among the choices that are shown to you right and if if a company uses or a service that you interact with uses a data driven model to decide these are the items that I am going to show you in this order you can only decide among the first and items they show you because you are not ever going to go beyond that right so you are sort of at a at some level they are performing a service for you by trying to guess what you might like and ordering things but at some level you are also captive to the choices that they have made in terms of what they are going to show you and the there is also a pricing issue because clearly they can build models of how price sensitive you are and take advantage of that by showing you products at certain price levels and not at other price levels and the other interesting thing is that most of these companies are building population models and are optimizing it for their own ends and your ends may not be the same as their ends right so it may be worth thinking about and so the research community remember this talk about the research community has actually helped create this problem by creating fantastic algorithms like matrix factorization and multi arm mandate and neural networks and so on deep learning and so on for companies but we created zilch for users right we created all kinds of interesting algorithms for companies to use not nothing so exciting for users to use right so and I think therein lies an opportunity right it may be interesting to create algorithms for people who currently don't have algorithms as opposed to for people who already have a ton of algorithms right that could be a very interesting opportunity whether you are in the research space or whether you want to start a company so so for example so in this fight right between companies and people the companies are actually armed with these guns and steel whereas humans are armed with wooden weapons and when those two meet the ending is always kind of predictable right so so how do you solve this problem well you can solve this problem using this idea of a cyborg right so cyborgs are very cool and so basically the idea is that you replace the user with a cyborg and what's a cyborg a cyborg is just a layer that separates the user from the data driven services that they interact with right so there is an extra let's just add a layer which is a personalized model of the user and that's actually built entirely on behalf of the user using the user's entire data and have that layer interact with the data driven services as opposed to having the user directly interact with them yeah so so what kind of layers services could the cyborg layer provide but it could provide privacy protection there is a very interesting idea called differential privacy that is going around, read about it if you don't know about it already the idea is that the cyborg can choose to reveal certain information to Facebook or Netflix and not choose to reveal certain other information or maybe it can obfuscate information slightly maybe make up a few searches that you never made but pretend like they came from you just to fool the other service into thinking that you have certain preferences and that you didn't have you can strategically spread your interactions across services for example you can buy certain products from Flipkart and some from Amazon thereby making sure that neither of them builds a full model of you so things of that things of that nature the other interesting thing you can do is around discovery and pricing because the cyborg can actually look at a much larger selection than you can and it can pick the items that are appropriate for you acting strictly as your own agent not the company's agent, it didn't pick the more expensive items for example if you are in the mood for cheaper items or it might know certain things about you that the company does not know for example the cyborg might know that you have an upcoming trip which Amazon doesn't know and therefore can decide that it needs to buy certain products for you things of this nature and this is the key idea the idea is that the cyborg can combine a personal model that has been built for you using all your data with the population model that the company has here is a graphic that helps understand this here is a user, user interacts with Amazon and Amazon builds a model of the user because Amazon has information from millions of users not just this one user and therefore if you had to figure out if this user bought a certain book what are the books they are likely to buy Amazon will do a great job because they have information from millions of users whereas this poor user has only information about themselves so they are not likely to do much and Amazon is probably going to do much better than this user but the interesting thing is Amazon only knows this user's purchase information and this user is doing many other things in their life other than purchasing products they probably have calendar, they have email they have Facebook, they have Twitter they have many other things in their lives that are going on other than purchasing products and the cyborg knows all these things not just the products and therefore the cyborg can combine all this information they can just take the things that Amazon knows well and combine it with all the other information the cyborg knows well and overall do a better recommendation for the user so that's what the cyborg can do cyborg can combine information that only the user has information that only Amazon has to make even better decisions combine personal and population models so I think this is a very interesting disruptive research and business opportunity if any of you want to do research on this you know I am happy to be you know a view from afar if you can find a local advisor because I am personally interested in this topic what is the easier on the platform by your browser but once you have these things encapsulated inside apps is it pretty hard for you to create an agent which can actually dig deeper than just the surface of the app I think that's a very hard question I think over time it's hard I think you might want to start with websites pretty much every server that has an app also has a website so I don't think it's going to be that much of a challenge but I think over time when these agents become popular and become the established way of doing business I think companies will expose APIs for such cyborgs over time and not just for humans and maybe it will happen in 5 years maybe it will happen in 10 but I think that's what's going to happen so I think maybe as researchers we have to anticipate the future sometimes it will be wrong and sometimes it will be right but it's worth having the fun so that's pretty much the wait let me just do the conclusion oh maybe I should oh well maybe just have fun maybe I'll talk about truly having any context whatsoever truly so here's an example of a very interesting data driven company where is it not even coming up so I'm using this company as an example of an interesting data driven company because it has two IIT Bombay founders very interesting company Anish Dasarman Nilesh Dalvi the two IIT Bombay founders and this sort of is sort of motivated by this idea that trust has grown dramatically whereas in 2005 we were sort of buying things online now in 2015 we are sort of sharing experiences with strangers so we might be renting a house from somebody on Airbnb or renting a house to someone on Airbnb taking a ride on Uber or interacting with someone on Tinder or something of this nature and so these experiences are somewhat personal and so the question that truly is motivated by is would you rent your house to a stranger to this stranger it's kind of scary when you think about it this guy can come and he might trash your house or do something really bad in your house so how do you know whether you trust this person to rent their house this is the problem that truly addresses they do this by looking at a lot of data about users based on their public data footprints they look at what people are saying on social media what they've been reviewing on Yelp what information can you find about them on Google what information can you find about them on LinkedIn right criminal records they go do a search for any criminal records that they can find about this person so they combine all this to come up with what they call a trust score for a person and that trust score you can use that trust score to decide whether somebody is trustworthy enough to rent for you to rent your house it's a very interesting data driven application in a way here data is removing friction normally we don't like to do certain things with strangers but if the data tells you that you can trust this person you're more likely to do it so in this case data is in some way removing friction from the world so it's a very interesting example of a company so let's go to the conclusion I'll just jump ahead to the conclusion so we were waiting for that so to conclude we looked at these 5 generations of data driven applications we looked at some lessons and opportunities this idea of intelligent apps being the place to be right now this idea of disruption versus optimization why disruption is sometimes much more exciting to do than optimization we looked at cyborgs and saw that as a very interesting disruptive opportunity that some of you can hopefully pursue and the last piece of this relates to what Professor Fatak was talking about this idea that we have silos in academia sometimes we think of and this talk was given to the data management community at BLDB there's data management, there's information retrieval, there's AI there's data mining, there's systems there are all these silos in the academic research world and all these silos have this problem that's called marketing myopia and the idea is this this idea was coined by this guy called Theodore Leavitt back in 1960 in the Harvard Business School case study and the example that he used is this comes from the early part of the 20th century in America when railroads were the dominant form of transportation, this was before airlines railroads were the dominant form of transportation in the early part of the century and they were the king of the world and the stock market and so on the railroad companies suddenly airlines started appearing it became safer to fly after the second world war more and more airlines started appearing more and more people started flying rather than taking trains and the railroad companies were looking at this and said you know what should we invest in or buy these airline companies and should we get into the airline business ourselves so the railroad companies said no, no, no we are in the railroad business, we are not in the airline business and then what happened now railroads are somewhat irrelevant and airlines have become the dominant transportation at least in the U.S. and it's happening in India even now so the interesting thing is the railroad companies were looking at the world through the lens of their product which was a railroad they should have instead been looking at the world through the lens of their customers and said look we are not in the railroad business we are in the transportation business and we should we should be in the most efficient form of transportation that's available to to customers as opposed to saying we are in the railroad business and not in the you know not in the airline business so a lot of academic disciplines have this problem of looking at the world through the lens of their product whether it's data management or artificial intelligence or systems or you know whatever it is or computer architecture or whatever it is so they think there are certain boundaries to their world but that's not the way the world works so solving any particular problem requires looking at the problem through the lens of the customers and the users of their product and that often requires combining ideas from multiple disciplines as Professor Farrakh was saying and coming up with solutions and the key way perhaps I think to look at the world right now is that we live in a data world so this data sort of impacts every aspect of humanity of human endeavor right now they think of entertainment or education or science or security or manufacturing or government or commerce or transportation everything is impacted by data and data driven approaches right now and so there's plenty of opportunity to apply data to do interesting things if we can ignore these silos about systems and AI and data management and architecture and so on and just think about how to take data and impact every aspect of human endeavor right now and for many of you I think many of these things whether it's government or manufacturing or security or sciences are often stuck in some kind of local maxima where they've been doing business as usual without using data and now when suddenly data comes along there's a chance to shock them out of their local maxima into a much better space just like Amazon came and shocked shopping out of a local maxima and as data scientists or people who work with data they have the opportunity to be the change agent to go there, use data and transform each of these fields so we should sort of think of ourselves as data plus x all of us should understand data driven methods but we should also understand how some of these other fields work whether it's security or government or commerce and come up with ways of applying data in a disruptive fashion to many of these fields not just to optimize these fields and I think there's a huge opportunity to do so so with that I'll end my talk right there we have time for some questions about the talk and after that any other questions that you want to so what I've been wondering is as we sort of mature along this curve of the digital world so there's a lot of concerns that people are expressing privacy is some of the stuff that you want to go on but more so about the digital footprint that we live behind maybe we don't want that many of us don't want that many people who are really smart intelligent people that I know are now sort of redrawing from the digital world and saying I don't want to have this footprint it's becoming too inclusive into my life the question is will there be soon going to come that will limit the availability of this data in the way that we're seeing it today it's sort of fairly open people are monetizing data and ways that we never thought they would be doing but will regulation drive some of that unique forward? so the answer is in two parts I think yes there will be regulation there will be this regulation for example you are as much from the data privacy law than the US for example so there will be some regulation that comes along but I tend to believe that this is a a problem that has been created by technology and to some extent can be addressed through technology as well using this idea that I spoke about we have armed companies with these methods and so that is causing the privacy problem perhaps if we arm ourselves and ordinary citizens with the same data driven methods then we can be on equal footing with the companies and decide what to share with the companies and what not to share with the companies and make intelligent decisions about that so I think part of the solution is going to come from regulation and part of me that hopes that a large part of the solution is going to come from technology itself I talked about the cyborg module where for the first time something would be on the side of customers instead of on the side of revenue generators but taking like the past examples in case such as even your product Jungly was supposed to be something which compared prices from different websites so that's not something which promotes any particular firm but after all a start-up's job comes down to generation of revenue and the revenue comes from the big firm so do you think something like that is a viable idea for example of cases like Zomato or rating services which are in general we've been hearing a lot of cases whereby the promotion of some particular firm happens due to backend funds or something so how to prevent those kind of cases or do you think some firm which does not promote any already established firm and does a job from the side of consumers does it have enough market to exist on its own the question I think for example is very apt like Zomato and even Yelp have had similar accusations in the past so I think once again the agency problem is best I think solved using the appropriate business model so the company that for example the Cyborg idea one way to this can be implemented in a way that that works is through open source for example just as Linux was created in the open source and is available to everybody so hopefully somebody can create a Cyborg OS in open source and it can be available to everybody in which case it's not a commercial so that's one way of doing it the other way of doing it in a commercial manner is for a company to have a business model that's aligned with their end users so for example if the company business model consists of you will pay a monthly subscription fee that I'm not going to show you ads I'm not going to sell you products then I'm entirely on your side if the business model of the company is entirely aligned with the end users as opposed to being dependent on something like commerce that's another way to build something like a Cyborg so the only revenue should come directly from the users and it should be a fixed revenue most of the times when in such presentations or data lectures we often get to look at the side which is probabilistically more possible as in data tends to be probabilistically more accurate in predicting instead of not but sir as a VC yourself you must have had times when data did not work out as in you predicted something and did not how do you deal with those cases because we never get to see all those stuff in such presentations can you share some experience of yours whereby some prediction of yours failed and how do you deal with that that's an excellent question it's very easy, hard for me to talk about VC at least is rocket ship to talk about prediction that failed because you've only been around for a short time very hard for me to we made these investments about a year and a half ago and all these companies are still around when one of these companies fails spectacularly I know that I made a mistake but that hasn't happened yet because it's been a short time I'm sure some of them will fail so it's hard to tell but I think it will be true that the data is only probabilistically correct so therefore I think the philosophy when dealing with the data driven approach is to use a statistical approach so if you're an investor and if you're using data to make investments you should not make one investment because you should make 15 investments because if you make 15 investments and all of them have a certain probability of being right you're more likely to be closer to the mean of the predicted success so whenever you use data the philosophy is always use a statistical method I think using data to make one-off decisions is a dangerous path to go down I think data should only be used to make statistical decisions rather than one-off decisions if you're making one-off decisions I think sometimes intuition is a better path than data but perhaps not yeah I don't know if people have done that too but I think that's probably the wrong the wrong approach I think a lot of people use a statistical method using dating services like Tinder so that is a statistical approach because you're sampling a large population so in that case we can convert a one-off decision into a statistical decision question I knew Derek specifically in the IoT space you just touched upon it there are a lot of issues with the standards there is no common standard and everybody is fighting together the NKM, ITU and there are a lot of power centers coming in so do you see would you advise say hold on for a while till standards are set up just wait till the chaos is over and then get on to that space or would you see something sort of an indication of standard 1 standard coming up like an actually version 4, 6 which standardized the whole communication what's your take on that the other question is on your fun easy disruption idea there are a lot of startups coming up in the some sort of a stop exchange for startups listed the potential startups who can be listed and then they can be catered some sort of a you know these platforms reach the startup than reaching the reason so are you going to take this concept and just put that like like for example Grex these are startups from India who are on the same platform GREX so on internet of things and standards I'm not hugely familiar with the standards in the IOT space so I don't know exactly what's going on in that IOT has been one of these spaces for me personally where I think there's a lot of promise but not very much has actually happened in terms of end results except you know I'm sort of seeing some early early signs you know in some of the bigger company like GE are doing investing a lot in IOT kind of thing so if I were a startup right now what I would do is I'd sort of go work with some early adopter type companies like GE or something like that and sort of try to solve a real problem and then you know by the time the standards evolve you probably have acquired a bunch of domain knowledge already so that's probably what I would do if I were in this space right now just go solve the problem and not worry about the standards for now on the other thing about the the stock exchange type model for private companies so there are things like second market and so on and these tend to focus on later stage companies typically these are companies that are very late stage startups that have raised hundreds of a few hundred million dollars already or a hundred you know and then they are sort of semi-public semi-private type companies right so they could they are not yet public but they are big enough potentially that they could be and they are all listed on these kind of exchanges and I think that's an interesting idea but there are not enough startups of that nature right I mean if you look at the millions of startups that we look at maybe there are a few hundred that are in that category so that's a very small sliver of startup world question we reached out to not necessarily our investments I didn't have time to get the actual investment thing so this is a reach out but the in terms of investments actually we have more in India now than in Latin America I think one of the reasons why frankly we until recently we had we had not made that many investments in India is that I think the valuations were out of control in India I think there was a little bit of a bubble and valuations have been unrealistically high in India driven by some unrealistic valuations in the e-commerce and a ride sharing OLA type spaces so there was a driving up all kind of startup valuation and now I think valuation is settling down to a more reasonable level in India and so we have been making more investments in India valuations were unreasonable in India they were very high too high until last year companies were overvalued based on companies such as in the e-commerce space such as Flipkart and Stampdeal these were all overvalued and there was the overall space and so that means that we didn't want to invest at those high valuations we were waiting for the valuations to come down so now we are investing in India probably seven more questions but what I would suggest is those of you who need to leave can leave now and those who want to ask a few more questions come up and ask them so let's again thank Arun for a wonderful talk