 Thanks a lot. I'm really excited to be here and wherever it comes to speaking about ML and AI, it is always very exciting. So machine learning and artificial intelligence in the past has always been in the shadows of universities and research institutes. In the future, particularly in the next decade, it is going to be seen in all the consumer products that we are going to live with. Product managers like me and you, we are going to play a critical role in bringing this vision to life. So today, I'm here to speak about role that product managers play in machine learning products, and what we will be playing in the future. I did a similar talk in the Santa Clara Center, and what made it really engaging was people asking questions. So this is hopefully very interactive and I please feel free to stop me, ask questions, drop insights whenever you can add to the conversation as well. Awesome. So what will we do today? So we'll begin with a very rough refresher of what machine learning is about. Particularly, we'll talk about things which are relevant in the rest of the talk. Then we will speak about the landscape of machine learning products and how it has changed over the past two decades and what is what it means for the future. Then we will talk about how machine learning products are built, like all the steps involved, and the focus of the talk which is the role that product managers play in this new world. If we have time, we will have an open interactive section about how do we prepare for machine learning PM roles for those who are looking to interview into that. Just to set some expectations, so I will gloss over some over-related problems in these cases, but I will not dive too deeply into that. Again, this is not a lecture. This is my experience, my perspective, and I want it to be highly interactive. Also, machine learning is a vast field. So this cannot be exhaustive. These are just some examples here and there, to motivate how people solve different problems. Also, I assume that most of you here are not machine learning experts and I've tailored this talk to people who are probably not machine learning engineers, those are not machine learning PMs, but those who want to. So let me test my assumptions before we start the day. So start the evening. How many of you are actually machine learning engineers? Good. Please feel free to correct me once in a while. But glad to know. How many of you are product managers on machine learning? Okay, cool. Also, say the same thing. In general, engineers here. Good. PMs. Awesome. Start-up founders. And the rest of you, which fields do you come from? UX. Awesome. Sounds good. So please feel free to stop me if some things are not clear. Ask me if I see that things are super obvious and I might try to accelerate it as well. Let me introduce myself. By the way, my name is Dev Dutta Gangal. I studied computer science at IED Bombay for those of you who do not know, it is one of the very good engineering schools in India. Then I joined Capital One. Capital One is built on machine learning. So there are a couple of things that they do. First thing is figuring out if they send you a marketing brochure for a new credit card, what is the response rate that they can predict for you to actually accept and sign up for a card. Then once you sign up for a card, they also predict the risk rate, which is the probability of you actually defaulting and charging off. The company was founded about 20 years ago, but machine learning is the core of that company. I then moved on to Yahoo, where I was manager for product insights for Yahoo Mail and Yahoo Messenger. I helped a lot of the product marketing teams to make their targeting much more efficient. I'm also figuring out the value of a lot of the Messenger users in the Messenger graph. I then moved on to Zynga and Tainiko, where I was managing revenues for some of their large games. Then I spent three years at Groupon. I was managing mobile apps, iOS, Android, the location platform, as well as the notification platform. I was also on the relevance team, and there we use machine learning to basically rank the deals so that once you open up the Groupon app, you get the best deal that is suitable for you, and also you get the best notification once per day, that is tailored, made, and personalized for you. Currently, I'm at Uber. I am on the Maps team, the Sensing Inference Research team. Quick refresher on machine learning, and again, this is going to be super simplistic, and I'm going to throw some very introductory concepts, just because I will use them later on in the talk as well. For some of you, it might be repetitive, but just please bear with me. What is machine learning? Machine learning is nothing but finding patterns in data. I'm just going to use some terms. You have a training set, which is a set of label data where you know what the input is as well as the observed output. Then you use machine learning algorithms to train on that data, and then you form a formula or an equation like h of x, which takes an input x, and then it predicts the output y. That basically is machine learning. This image is taken from Professor Andrewing's course. I'm also going to now speak about three different types of machine learning techniques, and just to represent the three generations of how machine learning techniques have evolved, how they have improved, and how they have become more and more powerful. Let me first start with linear regression. Or even linear classification. There are large sets of problems, where the data is what we call as linearly separable. So let me explain what that means with an example. So the example here is you have a lot of data, and what you're predicting, data for homes. In this simplistic case, you have data around let's say the average price per square foot in that given location, the square footage of that home, and let's say the number of rooms in that home. You have this data, and now you have to predict what is the price of that home. So just predicting the price is called as regression, classifying and saying that, hey, whether it is about 10 million dollars or lower than 10 million is called as classification. So in this particular case, you can actually form a linear formula to kind of determine the price. Or you can have a plane that kind of separates the entire 3D space into two buckets, where we want all homes below 10 million, all homes about 10 million. This is also a very starter example that most professors use in the first or the second lecture. This is certainly a product that is out in the market, like Redfin and Zillow and all these companies certainly predict, certainly use a lot of data, not way more than three input variables, but a lot of variables to actually predict the value of the home. Yes, absolutely. Yes, so in this case, there are like three dimensions here, right? So linear model is basically X, Y, Z, and if you can figure out a model as a form of X and Y and Z, that's a linear model. Just to kind of ice breaker here, any guesses for whose home that is? Not has got nothing to do with machine learning, but just okay, the address is 2101 Waverly Street, Palo Alto. I wish it was mine. I also wish I was that person, but unfortunately not. No, so this is Steve Jobs' home, and I love that home in terms of architecture, et cetera, so that's why I just kind of to wake up the audience I put this example in. But yeah, so Redfin and Zillow certainly use, certainly use linear regression of, or maybe some complex model to determine the, to estimate the value of any given home. Now let us move on to the next, let's say generation of machine learning problems or even techniques. In many of the cases, your data is not actually linearly separable. So let me start with a very simple example. In this case, there are, you need to separate the green dots from all the red dots. So basically all these green dots are, let's say green dots within a particular circle. You cannot actually represent that circle in any linear format, right? This is of course a very simple example, but there are lots of examples where you cannot draw, you cannot cut the hyper plane, sorry, cut the hyperspace with the hyper plane, or you cannot have a linear formula. So what engineers and what machine learning scientists have done is they created a new technique called as feature engineering. What they do basically is in this very simple case, they will create a new feature, in this case it will be like x minus one square plus y minus one square, right? And what that will do is that will basically create a hyperspace with much more dimensions, but in that hyperspace you can actually have a plane that will separate the green points from the red points as an example. Any questions so far or can I move forward? Yeah. So how do you take this? Absolutely. Yes, so machine learning, I mean, so certainly artificial intelligence and deep neural nets are of course a part of machine learning, but I would say even a simple linear regression is machine learning one-to-one, right? And most professors will actually start from linear, I mean, linear models, then they will move on to techniques like let's say SVMs, for example, where some feature engineering or some tricks are required to take your data, add more variables or add more dimensions, and then the problem space becomes linearly separable. So this is this. So a lot of the products that are actually out in the world like Web Search, for example, or even add optimization, et cetera, use tricks like SVMs. They also use what we call as decision trees, which are also of course non-linear where the logic kind of follows a tree-like model, right? Before we move to deep learning, I just want to stress one point here, which is in any of these non-linear models, what is heavily required by your engineers to do is feature engineering, right? So really understanding the data that with which they are playing is critical. And then of course doing feature engineering, and sometimes it works, sometimes it doesn't, but doing that iterative process, it's very costly for anyone to get there, yeah? So for future? Yes, so that is precisely what to do. And so it is a combination of, again, picking up what variables, right? So in any real data, you have hundreds of variables, figuring out which of them are important, how do you use variables in kind of conjunction with each other is critical. So in let's say the second generation of ML problems, your engineers spent a lot of time doing feature engineering, right? That's the point that I want to convey. But now we have moved to another technique which is called as deep learning, right? Let me explain what deep learning is before talking about advantages and disadvantages. What deep learning does is basically, it stacks a lot of linear models behind each other, right? To solve a lot of complex problems, like let's say, is that a dog or is that a car? You cannot just, I mean, feature engineering is going to be limited. It is not scalable. What deep learning does, it's, again, as I said, it is a sequence of a lot of linear models stacked behind each other. And then there are these non-linear connects, which has proven to solve a lot of complicated problems, like image recognition, for example. Now, these techniques have worked. It is called as neural networks because it derived inspiration from what scientists think is the way our brain works. Now, the advantages here, again, is no feature engineering is required. We have come to a point where engineering architecture is enabling us to actually leverage something like this. But the clear disadvantage, certainly, is that deep neural networks are hungry for data. Until you have a lot of data, they just do not work. This is my opinion, by the way, that I mean, some people say that, hey, DNNs are like, they mimic the brain, but I haven't actually seen or read any paper that conclusively says that our brain is a form of deep neural network. What I mean to say is that our brain might have some structure which is completely different than DNNs, but so far DNNs have been able to solve a lot of problems. There is a professor who is now at Stanford called as Firefellay, so she gives a good example of the limitations of DNN. She says that her daughter is able to identify things using just a few examples. And I also tested it out myself. I told my daughter just two times that, hey, this is a unicorn and this is a horse. This is a unicorn and this is a horse. And in the third try, my daughter who is two years of age, but sorry, I forgot to mention that, but she was able to, so my two-year-old daughter with just two examples was able to identify what is a unicorn and what is a horse. I didn't explain to her that there is a horn and stuff like that. DNNs will not be able to do that as of today. So that is the limitations that DNNs have. So before moving forward, just to recap, I mean, these three kind of represent the generations of how machine learning has evolved with the pros and cons. Any questions before I move on? Cool. Awesome. Now we spoke about a lot of having training data, but you always don't have data. So let me again throw in three terms. Whenever you have data, whenever you have the training site to kind of build your models on, it is called as a supervised learning technique. There are instances where you do not have labeled data. So for example, in this case, let's say all of these are actually gray, but they are actually separated in some space. So then you can use techniques like clustering to say these are actually, so these sort of yellow points are actually different from these sort of blue points and then you can actually color them as yellow, blue and green. There is a third technique called as semi-supervised learning in which what happens is you do get a lot of data, which is somewhat separable, but you have some labeled data which you can use as seeds or to accelerate your learning. You can also employ human beings to just look at some sample sets and then say that yes, this maybe looks like a fraudulent case. So this maybe looks like there is something funny happening here and then all other data points close to that point, you can then start thinking of that as either fraudulent or non-fraudulent, for example. So you have YouTube looking for workforce that are doing these videos. Correct, yes. Shall I move on? Awesome. Now let me speak to the landscape of machine learning products and how things have evolved over the last three decades. I again kind of compartmentalized it into three. So first starting with machine learning products which have been existing for the last couple of decades and which have made businesses much more efficient. Then over the last five years or so, we have seen a lot of products which deliver consumer delight and also help assist consumers in some cases. And then we will talk about the future where it's like AI is going to take over the world. So let me start with some of these examples. In many of these examples, starting with computational marketing, whether it is online ads or like product recommendations as I used to do at Groupon or whether it is credit risk, which is determining whether someone is going to default or not, stuff like high-frequency trading or anything that has got to do with fraud detection, whether it is spam or payment detection or identity detection, machine learning and particularly some of the early techniques of machine learning have been in production for the last two to three decades. They are just not seen by everyone else but they have been in the background for quite some time. Then we moved over to products where consumers started touching machine learning products. So web search, for example, or automated voice machines. When we pick up a phone, someone kind of recognizes what you're saying and then determines whether it's like, whatever. Send your phone to one department or another. Over the last maybe two to five years, we have seen a lot of products personally around, I mean typically around personalized assistance. So Google Now is a good example, right? Google Now takes in a lot of inputs where I am, gives me a lot of hyper-local recommendations, knows what is in my calendar, makes prediction whether that meeting is important to be shown to me at that point or not. Lot of chatbots, China VChat in particular, runs on a lot of chats and a lot of people have deployed chatbots to kind of automatically talk to human beings. Same with a lot of image and face recognition. These days you can go to, Facebook had this Moments app or even Google Photos. You can search for, hey I'm looking for photos for this person and it just shows you the entire list of those photos. So these are products which are in the consumer assist form which we have seen so far. And lastly, there is a lot of research happening in the medical device or medical assist area. I haven't particularly seen the example which is like productionized at scale, but there is a lot of promising work going on there plus in agriculture. So correct, so we will speak about like the accuracy of models later on, but think about, let's say personalized personal assistance. If let's say they make a wrong prediction, it is not going to hurt you as much. You're just going to say that, hey this product doesn't work well yet, and you're going to just forget it. There are some instances like Web Search, et cetera, where if you get the wrong results, you're not even going to realize that you're getting the wrong results. Plus there are some cases where like if you make some wrong predictions, it's actually going to cost a lot. So we will speak about accuracy. I think personal assist and chatbots fall into a region where if they make mistakes right now, I think people are forgiving. And then in the future again, plenty of stuff, well, plenty of exciting stuff is happening. AI is actually going to take over the world, whether it is self-driving cars, smart homes or smart bodies, which includes variables and the way you interact with your surrounding. It is time. It is time for machine learning and AI to get out of the science fiction books and get out of universities and research labs out into consumer products. There are four strong trends, which are going to make it happen. Starting firstly with commoditization of deep learning architecture. What it means is fresh grads led by a senior machine learning person can actually leverage things like TensorFlow, on let's say, and train on AWS to actually make a lot of magic happen. So that's one. Secondly, the chips are getting more and more powerful. GPUs are becoming more powerful. Things are becoming smaller in nature. That just means that whether it's your phone or whether it's the devices all around you is going to be easier for that distributed computing to happen right at the source. And then there are other things which are going to enable the ecosystem starting with cheap and excellent sensors. The cameras on your phones are getting better. I mean, everything around a touch sensor, IMU, GPS, et cetera, are improving and getting cheaper. Data plans or data plan rates are falling. India, for example, there is a plan that gets you 10 GBs in just a dollar. And as you can expect, there will be a lot of talking that happens between your end sensor or end computing device and maybe the cloud. So cheap data plans are critical for us to accelerate any kind of development. And then the last thing is particularly around if you, if there are many things that you need to distribute out, you need to take care of battery and power. There is enough innovation happening in that field as well. So with all of these like trends converging, what we are going to be in a similar state just like the App Store and the Play Store were 10 years ago where there were millions of teams all over the world of like five to 10 undergrads and people with enough experience to solve localized problems. Just the way like App Store and Play Store boomed, I think artificial intelligence is going to have a very similar revolution. So it's not just going to be the top four to 10 companies and maybe like 100 companies in the Silicon Valley, but people all over the world who are going to access the power of artificial intelligence. Now what does it mean for product managers? What it means for product managers is it is going to be your job to determine what problems to solve, right? So you have the technology, there are many problems in the world figuring out what problems are key to be solved, figuring out whether they can be solved. All these decisions need to be taken and these experiences need to come to life. PMs are going to play a critical role in that revolution. So with that, let's move on to how ML products are actually built and as we go through the process, we will talk about where do PMs play a role. So firstly, building machine-learned products is a multi-step iterative process. There are lots of trade-offs to be made, so lots of decisions to be made and also I will talk about all the steps and also there are multiple stakeholders. So certainly starting from your consumer, then the government and regulators who are increasingly interested in AI, primarily because of data privacy and data ownership. Then of course your business stakeholders whether it's you as a PM or your CEO and then your data scientists and engineers who are actually going to build the machine learning models and engineers who are going to scale it out. So you as a PM is certainly in the center of all of it and it's your job to kind of keep everyone connected. So communicating things, defining the policy, defining how things work, defining what to work on, et cetera, is going to be a critical job that any PM plays. So now let's go over each and every of these steps. Or actually let me take a quick break for questions. Any questions so far? Goals that they're looking to optimize for. So I see part of that is in the problem. So the next, so you're talking about this? Yes, so we will speak about, I mean, so what I've covered is how do you communicate the decisions you have made to your consumers? And then government privacy is something which I have not particularly covered here, but we can discuss it over Q and A. But it is very critical. Governments are playing a larger role in defining what again, again internet privacy means, what data privacy means, who owns the data, who owns the trained data. If you have a trained model who owns it. So a lot of activity is happening in that area. So let's cover each of them and we will stop for questions as and when necessary. So the first thing that a PM needs to do is identify what problems to solve, right? So now think of it, so scientists and universities typically take a different approach. They are like, hey, a lot of data is available. I have this new source. How do I find something interesting? How do I build something more interesting to get like more efficiency and whatever? Now as a PM, your job is not to fall into trap, trap or not to let your team fall into that trap, but ask basic question, which is what is the problem? Is this problem big enough? If ML solves it, is it actually going to change the consumer experience and the business? And will I be able to measure the impact, right? So PM 101, but that as a PM, you certainly need to do that. The next step is then answering a question whether is machine learning actually required to solve this particular problem? Again, teams, particularly a lot of engineers get tempted to throw machine learning at every single problem, but you as a PM, you need to ask some basic question, which is, one, can some heuristic based approaches solve your problem, right? Secondly, why are you actually using machine learning? Can machine learnings actually outperform human beings? So one example is credit risk. The smartest of human beings can look at someone's credit profile and probably not determine whether this person is going to charge off or not, right? Can machines do a better job? And they have proven to have done a better job there. The third thing is, is machine learning results explainable to the consumers? And I have a slide where we can discuss that. And then other thing is to understand, I mean, where does machine learning add value? So in some cases, machine learning helps you understand what happened in the past, but in some cases, machine learning helps you predict the future. So figuring out where exactly is your machine learning solution is very important. And then in there are some cases where machine learning helps you make humans more efficient. So for example, if there is a security agency looking at like hundreds of homes, where possibly burglary could be happening, you could use machine learning to kind of pick some videos where burglary is actually happening or someone is entering and then the human can make a decision, right? So understanding where does ML play a role is very important. Also, these are the things that you need to think about which is accuracy. I think someone asked a question, right? So if suppose you show someone a wrong advertisement, as long as it is not offensive, it is low cost. So you can afford to be not that highly accurate. For all these new experiences around face recognition, et cetera, Google and Facebook when they introduced it could have afforded to be not highly accurate. But imagine doing something similar in let's say medicine. You cannot afford to be not accurate. If you're doing a self-driving car, you cannot, I mean, high accuracy is a must. Same thing, but in a different way, which is figure out what is the cost of making mistakes. Cost of making a mistake in a self-driving car means human lives are at stake. So it's at cost of making a bad mistake in medicine means you're freaking out someone whom you should not freak out. So that's very critical. Other thing to note is the latency of making decisions. So let's say in the case of fraud detection, sometimes if you make a decision like in two seconds, you're already too late. So how do you make decisions in the five to 10 milliseconds that you have before you stop someone committing fraud? So it's so thinking about latency, then gets to questions about where exactly do you deploy models, do you deploy models at source, or do you deploy models in the back end, et cetera is something that you will have to think about. The other aspect is cost of collection of trained data. Trained data is very expensive. You need to determine how much are you ready to invest to actually collect that data. We'll go over a section where there are different techniques of collecting data and they of course cost differently. But cost is not just dollars and human price. Cost also could be in terms of policy and how you communicate it to government agencies and your brand value and stuff like that. So you have to think about what is the cost of doing that collection. Yeah. Before we move on, where do we ask the question are we, is this ethical? Should we be, what can't be stated, should we? So we can ask, we can discuss it right now. He has the same question. It is an important determination that every company will have to make. Let me put it this way. Last year I must have spent 25 to 30% of my time discussing with our legal staff, policy people, communications people because we just have to make sure that we are doing something which is not just legal but something that you can defend to your customers, right? At Capital One there is to be, this term called as the court of public opinion. So what companies do need to, you need to be in a position to defend it in the court of public opinion as well. So I'm pretty sure a lot of these big companies do spend a lot of time thinking about what the policy should be. So, I mean, I cannot get too specifically into like examples that I mean, I have experienced but I think it is very critically important because it is about all of our lives, not just our data right now, but in the future our life, let's say is going to be in the hands of machines, right? So determine that policy is critical. As they say in self-driving, one of the classic examples that people talk about is if there is a car with like four people inside but if it is about to hit or if there is a car with one person inside about to hit like four people on the street, what is the policy? Does the software company make a policy that hey, reduce the number of deaths? So in effect killing the rider or the passenger, I do not know but these are important policy questions that companies will have to think about. Mm-hmm. Yeah, agree, yeah, for sure, correct. So should you be solving this problem? It's an important question to ask. Yeah, good, good point. I will make sure I edit my slides accordingly. Moving on to the next topic which is gathering data and in particular, labeled data. So everyone says data is the new oil, so kind of you should handle it with care. The question is how do you do that, right? So firstly, again, what is valuable is labeled data, data where you know whatever, you know your access and your why so that you can then use it for training. And the big question now is then how do you collect this data in a cheap format, in a legal format, something where you can defend, right? So there are multiple techniques I am going to list out a few. So the first thing is leverage your products itself. So your products itself will have a lot of data that you generate. How do you leverage it correctly? You can instrument your own products for immediate feedback, so which means let's say ad clicks, whether someone is clicking on let's say your app and actually installing something or finishing the tutorial, et cetera. So there are many places where you get your immediate feedback. So again, PM101, you should be instrumenting your entire flow. There are instances where you do not get data, where you do not get feedback immediately. Again, credit card or like credit card default is an example. In such cases, you have to figure out what are the proxies that you will use so that you kind of get that feedback sooner. So that's an example. Then there are other examples where it is going to be always impossible to figure out whether let's say fraud happened or not. So there will be instances where you will never get your true trained labeled data. So at that time, maybe you could use humans in the loop to kind of actually manually look at data and then say, yes, this looks like a fraudulent case. And then there is a fourth example where there are some companies who have done it really effectively, which is they use the set of labeled data to build models to kind of show an experience. Then they rely on the consumer feedback to kind of build better experiences, right? Last one, I mean search, Google search for example, right? The more you search, the more they give you better results, the more you click on the better results, they keep getting better and better. Step two, account for. So absolutely, so I'm not covering too much of the engineering aspects in terms of just data cleanup, removal of bias, et cetera. Ensuring that you have enough sample, but that certainly is an important aspect in this data preparation phase, correct? The results of machine learning model. Correct, yes. Correct, agreed. So sorry, I think I misunderstood your, so if you're talking about the same bias, you do need to watch out for self-fulfilling cycles in some form. Machine learning techniques though, in some cases where you always keep us separate validation data from, let's say you're training data, is a way for you to kind of avoid that problem. Now there is also this term called generalization or over. So where if you build your model over, repeatedly over the same set of data, then it is perfectly tuned for that set of data, but if you add another data in the mix, your model will fail to work. There is a lot of literature and safeguards to not fall into that trap. So another example is there is this technical dropouts where when you train your model for, let's say for one iteration, you just drop out 50% of your data, for example. And in any case, most machine learning people will always keep a 20% data out. So once your model is completely trained, you then validate it on that set of data. Awesome, so the initial one was, the initial example was how do you capture labeled data or the feedback data from your own products? But there are other interesting ways for you to collect labeled data. So the second technique is get your consumers to actually label data for you. Couple of examples come to mind. The first example comes to mind is again, as I said, image detection. Each of these products, Google Photos, Facebook, Flickr, et cetera, have over the years asked you to tag yourself in, asked you to tag your whatever wife and your kids, et cetera, plus also type in the name. Today, they're leveraging that data to actually search for you. Search when you just type in, let's say, your name or your wife's name or your kid's name, you will see a lot of data or a lot of photos belonging to that particular person. So they kind of made us do the work for them to label the data that they use for training and giving it back to us. Another example is location. So four square in the back in the day or even Google Maps or Waze, they are actively letting consumers kind of type in where they are with check-ins, et cetera. So you have a data for where you checked in and then you have a lat-long and then you just use that to match against each other. And with many users doing it, you can then create a massive corpus of labeled data. So that's a smart way of collecting data. Google Maps, for example, when I say that I'm going from point to point B and I actually have my app open all throughout the way, then they know the destination that I plan to go and they know the last GPS point around it. So that is basically me labeling and telling them that, hey, that particular lat-long is what, 715 Jackson Street, where I came in today. So that's an example of getting your consumers to work for you. I have heard about this, but not really, I haven't seen any written evidence, but a lot of, so companies like Netflix and YouTube, they have a corpus of video data and audio as well with translations. So what they do potentially do is what I've heard, is then do a speech or audio to text translation and not just in that language in multiple languages because they have a corpus of training data already. Before I move forward, I want to ask a question. On this slide, I have one example of an incredibly smart idea to generate labeled data. Before I press the next button, I just want to ask whether someone can guess what it is. Perfect, yeah. Who was it? Chris, awesome. So how many of you know what CAPTCHA is? Okay, almost everyone, sounds good. So basically this is CAPTCHA. The smartest thing about CAPTCHA is, in this case, RECAPTCHA is owned by Google. The smartest thing about RECAPTCHA now is you, they are generating this labeled data not only from their own consumers, but from consumers on a different website altogether. So how does this work? So now if you go to, let's say, some random website, say, suppose you want to buy tickets for the next Giant's Game, that's the first time you're visiting that website, they will check whether you are human or not. So they will either show you something like this or something like that and ask you to type in what these letters are. They will also probably ask you to identify is this a tree or is this a road sign and stuff like that. Now they have kind of done some partition on that data. So for example, I'm using an example here. Maybe in this case, Harold is something that they know is actually Harold. So they let humans actually type in Harold. But the next thing they probably do not know with high confidence what that is, but if I start typing in A, R, A, N, R, I, B, then me as a consumer on a different website, I'm actually labeling data for eCapture. So that in my mind is I think the smartest idea ever generated for getting labeled data. I mean, I typically always work in incognito windows. Whenever I go to any website, particularly eCommerce, something like this shows up and I'm always helping label, I give labeled data back to Google. Yes. And apply the same thing for the other languages. Yes, yeah. Any other question, any other comments? This is a fascinating product example for me. So if anyone wants to add more insights. I mean, the idea of purpose of Capture is still there or is someone- I do not know what the original purpose is. I presume original purpose is to actually detect whether you are a human or not, but then leveraging it in such a beautiful way. And then as one example, possibly could be that, hey, you are the first word kind of checks whether you are a human or not. The second word actually lets you as a human being labeled that particular image. That is fascinating. That is correct. So yes, fair point. I think that's why maybe they kind of got into a more complex Capture there with images and stuff, but you are right. Cool. And now if nothing works, there have been examples in the product history where people are put in lots and lots of humans into the loop to actually label data. So some examples come to mind. The one of Google books. This is not about label data, but this is just about capturing data. So Marisameyer and Larry Page, back in the day, early days of Google, they had this idea of digitizing the entire library of books available online. So then they just got back to simple basics and said that, hey, if you just want to digitize one single book, let's figure out how much time it takes. So I think one of them was actually flipping the pages. Other one was just taking pictures. And they did some simple calculation that they said, hey, if you automate this process and if you have X number of human beings plus X number of machines doing this, we can actually digitize the entire set of world's books in whatever X number of years. Of course, only massive companies can do that kind of data capturing. But maybe in your localized context, you could probably capture some data which is very localized to you, which has somewhat of a privacy barriers, et cetera. So just keep that in mind too. If you cannot think of any smart idea, maybe just putting humans into the loop can help you generate a lot of data. Another example is Google Maps. So you can go online and read about their operations. They employ many, many human beings to actually look at images and actually kind of start labeling things. Then you can use that evidence again to kind of train your machine learning algorithms to do that job, and then you can use the humans to just do validation. So you can play around with it, but having humans in the loop can kickstart a lot of your, yeah, a lot of your labeled data processes. And this is half, half joke. So I know of a professor who said that, hey, who else achieved? He said college grads. So he's like, I, whenever I want to validate something and I want to get captured data, I have many grads who I can throw in on a project, and then they just go and get me a lot of data and finish up their project. So yes, you can figure out ways to get other people who can do it for free or for cheap. That brings to the question about, again, whose data is it anyway? There are certainly other policies around how machine learning model should be used, but one thing which I want to cover is, again, data ownership. So we are in an interesting time with there are important questions to be asked. Gmail, so certainly Gmail's machines read my email, but do Gmail users or do Gmail employees read my email? If machines read my email, what rights do machines have in terms of how pervasive it can be? These are important questions to be answered. Alexa, Amazon claims that Alexa is only listening for the word Alexa. But yeah, where does that stop? Is that a mic in my home which is constantly listening to me? I do not know. DNA sequencing and everything around medical data. Who owns that data? If there is a big company which is like processing your DNA sequences, do they own the data? Do you own the data? Very important questions to be answered. Face recognition. So Facebook probably knows about exactly how each one of us looks and probably can do amazing predictions there. But say suppose if I commit a crime tomorrow and Facebook is able to capture some video feed, do they have the right to say that it was me who committed that crime? So these are again important policy related questions that we need to answer. And then the last but not the least is permission. So certainly a lot of these companies kind of throw in a lot of permissions in your whatever terms on services, but it's not just about getting the legal permission. It is also about ensuring that the consumers actually understand the permission that they are signing up for. So thinking about all of these legal policy questions is critical and the PM is going to be the person who is going to try these discussions. So once it's out there in the wild, how is there a way to be able to develop a response for it? How is there a way to be able to protect for privacy if you are collecting all this image data? Even if you're doing the wrong thing, is there any, there's no way to be able to pull something out of pen or put things back in the box So, I mean, yes and no. So Europe is taking a very like stone, whatever action on stone action or like is drawing a very strict line about like how do you own the data? Who won't say it whether you can make the decisions where you can employ them and stuff like that. And certainly big companies are clearly held accountable for the policies that they're building. But yes, I mean data is going to be pervasive. A lot of small companies will own the data. I do not have a clear answer to how things are going to evolve. Yeah, correct. Yeah, so yes, the data retention is one, but what about trained models? If you've trained a model over based on data for last 10 years, do you need to change your models? If that data is, I mean, so very, very important questions and certainly there are many PMs along with of course legal policy people who are resolving this as we speak. No, but no. So I do not know the, I mean, governments will have to figure out, governments and regulators will have to figure out where to draw that line. But so far I haven't heard anything where, let's say you need to retain your models because of old data. Yeah, so basically plenty of questions that you as a PM will have to like really come to a resolution with the legal policy folks, engineers, et cetera. Open question here, again, I mean, one, one important thing which people write about is big companies like big tech is now a term. Many politicians are afraid of big tech because they own so much data. So the question is, will data ownership will always be a purview of this massive companies or how is it going to shape with like smaller companies? So I just wanted to take a pause here and throw the question back at you. If some of you are startup founders or working in smaller companies, how do you manage data? If anyone wants to share, that would be great. And sources of data that are accessible to companies from proprietary data sets are how companies are going to get competitive advantages. There's not really any advantage to them opening up any more than there's some advantage for any of the proliferation of social networks that they were five years ago had an advantage in operating with each other. So, I mean, examples which I have read about are legal stuff, right? So if there is a company which is like doing operations on figuring out what legal documents mean and then like coming up with some tricks to kind of let's say fight that legal case, all of that data is going to stay within that company. So that is an example where a big company might not have ever have access to the data. I know. I mean, I don't have an answer to that question, but it is an important thing that many companies are thinking about. There have been instances, of course, where large companies have pulled out their entire business from China in particular for the same reason, right? So people, I mean, there is a lot of articles around, again, not just individual privacy, individual content, whether that data should sit in China, whether the data can actually leave the Chinese borders and or Russian borders and sit, let's say, here in the US. If the data is here in the US, but for Chinese or Russian citizens, who has the right to subpoena that data? Is it only the US government or is it, like, these are all open questions that people are figuring out and understand, yep. You have a question. So what responses are there that companies can do beyond leaving the country to be able to protect or they do not have to be incorporated into something that may not be ethical within the confines of your country, but is a concern of what they want to go for directionally in China? Yeah, so, I mean, I think different companies have taken different stances. Google certainly, there has been an instance where they pulled out. There are other companies who complied and then who kind of put the data centers in China who kind of, I mean, whenever Chinese government subpoenas or something, I don't even know whether subpoena works, but orders for any data, they just provide it. So different companies have taken different approaches here. So it remains so important question. Sorry, so do you mind speaking up? Got it. So can you speak to, can you speak to precise examples of what sort of data and how did they generate it? So along the same lines, it doesn't matter if we have this and so we have commodified tools that plug into the systems we already used, I'll see around, et cetera. It doesn't matter if we have all the data. Yeah, I mean, the other way to think about it is, I mean, machine learning certainly has a better output with a lot of data, but certainly even with a small set of data, you can get something. You can at least validate some of your hypothesis and then that can be a starting point, right? As long as you are validating and moving forward as a company, then getting more data to get to the next level of accuracy can kind of wait, maybe. Which acquisition? The one. Okay, yes. Is the symbiotic relationship a benefit from utilizing and creating opportunity perhaps in the selling or packaging of a larger company's data and opening that up? A lot of times these larger companies don't have the nimbleness to manipulate that data and be able to utilize that whereas a small company could and so that looks like there's opportunity between the market place and the data. Yeah, makes sense. Sounds good, yeah, so great discussion. So I wanted to learn as well and so I kind of stopped. We can probably move faster, but I also wanted to maybe add a few industries and to kind of generate insights. If anyone of you works in any of these industries, give us some insights. So insurance, medical, driving safety, smart homes, agriculture, very critical industries where machine learning is about to revolutionize a lot of experience as well as the business. I cannot, I'm not a product manager on this, I don't even know that answer, but even if I know I could not respond. But I think there is a blog on how user score is created. You should check out with it. You should also check out later, but you can check out whether they have any future plans on how to improve that or not. As far as my understanding is concerned, you're talking about Uber, right? Oh, sorry, so which user score were you talking about? Yeah, I was talking about user score. Yes, okay. But I don't work in Uber. Yes, so if you're talking about Uber's user score, it is predominantly driven by the drivers because it's a market place. Riders, rate drivers, drivers, rate riders. So I do not know whether we have any plans to change it. Add an industry to that. I'm in American publishing, and we do a lot of direct sales at our company, so I think something like 40% of our books are PBS. If we even have insight on how people are doing it, that would, that could completely revolutionize our internal process. Because we also sell audio books, but we don't have any data or insights which are available and I just have more as a mix-up. Good stuff. Sorry, I cannot hear you. The smart phones, we're mentioning, there's a lot of things going on with listening to electricity and what things are powered and using energy. Interesting. Also, guys, I just heard they're creating these flows like what's always open and stuff like that. Nice. Yep, absolutely. Yep. I think our original question about this is early data. I'm wondering if you could do, if there is any push from the consumers of the consumer side that's early data, it could help them a lot. And sure, I'm calling it in terms of early data. Most of them already could be saying that some of them was like, these are just like a laptop. So there are, so I have, I don't remember any consumer movements saying that, hey, as a consumer, I own my data. There are certainly a lot of regulators, so certainly a lot of industry thinkers who are pushing for that, who are pushing for, hey, who actually owns the data anyways, right? And again, Germany and Europe is a little ahead of the game in terms of everyone, in terms of who actually owns their consumer data. Sounds good. So moving over to the next, yeah. So, dear, S1, which is half the American population, they have all the insurance because obviously, you're building compliance and sign up some paperwork, but you can't access that. Then you also have tons of like, for example, vaccination records, all over the place, immunizations, things. There are private HIEs being established now that you can have health information exchanges so you can actually have clinical data coming in. Nice, okay. So there's a ton of data outside, like the private data that exists within companies and obviously when you have the private data itself that you're getting from insurance companies or you're getting from the variables that people are having that you can access. Interesting. So basically business development is can get you that data which can get any small company that data as well. So moving over to the next stages, I've kind of combined the three stages, but it is important for the PM to understand what your engineers do. So the first stage is preparation of data that includes stuff like removing buyers, any future engineering that is required. That takes your engineers a lot of time. Once you do that, then there is certainly a step around figuring out what machine learning approach to use. And even if you use that training the model, figuring out what the results are is a critical step. And then the third step after that is the architecture and engineering around how do we scale it up, right? So there are questions about scaling it up to let's say hundreds of millions of users and also making it run real time. So what you can run in a Python notebook and at high accuracy might not actually scale. So there is a lot of work that your engineering team has to do in order to make the product a reality. What do you as a product manager need to do to kind of support them? So first is basically at least know what your team is talking about. Second thing is again, empathize the fact that they spend a lot of time, almost I think 70% of the time in the first step, data preparation and future engineering step. So please recognize that fact. Certainly ask questions, but ask them in a respectful way because machine learning experts kind of know what they're doing. But you should certainly feel free to ask questions to just make sure that you, it matches the product vision that you're trying to build. The fourth thing is always encourage your team to start small and then iterate on the way, which means if you do not really need 10 million data samples to actually or if you really do not need to build your product for 10 million users, it's okay try to build it on a small set, see whether it works, deploy it in real life, get feedback and kind of iterate on the process. Do not boil the ocean. And then the last thing is know when to stop. So I have had experiences where we are trying to use machine learning to solve some problems, but it just wasn't working because the data was sparse or the events, so the label data or the propensity of it happening was so low that it never led to high precision, no matter what we did. So trying to understand and then just stopping the team from burning for the cycles is going to be your job because you have to be the person kind of to bring in the bad news. Now to the next step. So once let's say all your decisions are made, how do you build a user experience around it? So let's take again a few examples of what typically can happen. Let me first talk about what precision is and what recall is. Precision by definition is let's say if you say that hey these 10, these 10 consumer, whatever, these 10 data samples are blue, for example, let me make it simple. But if out of those 10, only eight of them are actually blue, then your precision is 80%. A recall is how many of your samples are you actually covering? Now if your data has let's say 14 or let's say 20 blues and if you are able to detect only 12, so which means that your recall is 60%. So with that in mind, when you start building any machine learning product or building or deploying any machine learning algo, you hope that your curve looks like this, where you are able to get high precision as well as high recall. But frequently at least in the first couple of tries, that is not what will happen. You will either have something like this where you're able to get high precision but with very low recall, or you're able to filter out the right set of sample in your set, but your accuracy is very low. You're not actually, and there are a lot of false positives in the mix. So the question is what will you do in the process? You just then have to figure out whether you can build different consumer experiences to kind of tackle problems at both ends. So let me give you some examples. If let's say your precision is high, which means that whenever you say that, hey, this is true, it is true with high probability, say with something like marketing, then what you can do is you can start small. You know that your precision is high, so you go down that road and kind of build products around that, very well knowing that you probably are not reaching all of the people whom you need to reach. On the other hand, you could also go the other way around where you could collect a lot of sample, you could collect a lot of data, you could filter a lot of data. You know most of the people you're interested in are in that set, so you then maybe deploy humans to then kind of actually pick the people that you are interested in or the events that you are interested in. Does this make sense or? It makes sense to me that. Let me give you some more examples. So far, let's take that into account. Now if I want to find out all the stop signs in let's say this particular whatever, 10 segments of the street, I could do two things. I could have a model that is high recall but low precision, which would mean that it could give me a lot of pictures of anything that is remotely red in color, but it is not a stop sign. Now what I can do is I can have maybe 1,000 of those images and send it to operators and then the operators look at this 1,000 images and then say that yeah, these 800 or these let's say 300 are actually stop signs. What that does is it still filters out from the millions of images, you're still getting down to 1,000 where which contains let's say 90% of your stop signs. But now you just have to put humans in the mix because not all of the 1,000 are actually stop signs. So that is one such example. The other example on this side is if you're predicting things correctly, then if you say that hey, let's say there are hundreds, there are let's say 600 customers whom you need to reach out, but you don't want to be in a position where you're reaching out random customer who is not in a position to be helped, but for some reason you're able to detect, filter out let's say a hundred of those examples. 80 of which are actually customers in need. So you say, okay, I cannot reach out to 600 people, but at least of this hundred, I'm going to call all of this hundred. 20 people will say right now, no, no, no, you got the wrong person, but 80 of those people are actually people you need to reach, right? So it's like starting small with high accuracy and then getting data from that, that can help you improve your model further. That's basically it is. There are, let me now try to rush because it's already eight. There are other things that you need to think about. Once you build a model, the first thing that you need to ask is, hey, I mean, this is what my model does, but how important is the effect that or how impactful is what the model does to the larger scheme of things in terms of larger consumer experience or the business. There are instances where what you build or the experience that you improve actually cannibalizes some other metric. Let me give an example. Let's say Groupon, for example, there was one instance where we improved the CTR on a particular type of deals, but what happened is people were just opening up those deals but not making a purchase. So your CTR went up, but your conversion rate fell down. If I tell you the net net, it really didn't make any impact, right? So understanding what exactly is happening is very important. And then the third thing is very frequently you will be in a position where you have to make some trade-offs. Sometimes a particular feature helps you get more growth, high engagement, but low revenue. Some other things get you high engagement, high revenue, but low growth. You need to, as a company, decide what is your top metric and then prioritize accordingly. Naturally, different companies at different life stages have different top priorities. No, not necessarily. It is something that you need to decide as a product team well in advance. But at this point, that is where the decision comes into play. So involvement is sort of either end of this line. There's this large area around capital labeling, data, prepping, machine models, machine learning models, et cetera. Very engineering, I would also imagine they take up a very large part of the actual development cycle as well. So yes and no. I mean, sometimes data collection can take a long time because if data collection is out in the real world, it takes a long time, right? So yes and no. Maybe in terms of what calendar days this particular phase can take a long, long, long time. In terms of maybe let's say the actual human, our spend coding or something, certainly what the engineers do to take, consumes a lot of time. What are you finding yourself as a product manager doing on a day-to-day basis? So most PMs, I mean, there are always multiple projects in the mix. Also that step is iterative, right? So even if let's say you have a long-term vision at any given time, you can say that, hey, let me start collecting small set of data. So this entire phase, it need not be a six-month thing. It can be smaller, it can be this entire sequence of events but done like six times in six months. So it typically don't find yourself out of work because there are multiple projects to do. There are multiple times you need to do this. You had a question? It does look like this means, to some, you're getting smaller and smaller intervals but it's still a waterfall process that appears. Are there ways where you can be able to act in an agile fashion when you're dealing with the machine as a necessary component for your project? Good question. So certainly the way I have presented it, it looks like a waterfall. But there is a lot of iteration that happens all throughout, right? Again, how fast you iterate depends on how smaller you make the steps but how frequently you deploy them. It again depends on which company you work for, maybe large companies. Actually figuring this out and figuring out, integrate like legal policy issues can take a long time, right? And if you do not have a clear answer there, then there is no point in building or like spending your engineering team resources. So it is a mix. You need to play by the year. Then I think this is one of the last slides which is the last thing is if you have a decision, you also need to figure out a way to inform the consumer on why this decision is made. So why wasn't my credit card approved? Let's say why can I not log in because maybe you detected me as a spam or a bot but I'm not. Or in some critical cases, if my earnings are blocked, then that is something that you really have to explain it to the consumer. So bear that in mind always. If your machine learning model or product is in the experimentation phase, explain it elaborately to the consumer. Call it out as a beta, right? So none of us complain when let's say, Google or Facebook photos or Google photos got our initial tagging wrong because we all kind of knew that it is an experimental stuff. Alexa does a lot of funny things sometimes. None of us complain, right? Having said that the first time, let's say a self-driving car actually is responsible for an accident, people are going to freak out. So it's a question of consumer expectations, sometimes you have to send them, sometimes you're not in a position to send them correctly. And then the last point, if you're not at a point where you can actually explain to the consumer what is going on or explain to the consumer the way you have arrived at a decision and if it is critical, maybe it is time not to launch a product. So you should always keep that in mind. Moving on to the last section, interview tips. So I mean, if any one of you is looking forward to interview as a PM on machine learning products, certainly everything around PM-101 kind of is something that you have to do. Besides that, make sure that you know your basic statistics. Sometimes basic stats questions are asked and people kind of fumble, not because they do not know but it's just been like 10 years when they did stats last. Third thing is get your hands dirty. So there are plenty of courses available on Coursera, Udacity, et cetera, where you will be able to actually build your DNNs or you will be able to train data, build data, see how it looks. And literally this course, this kind of Udacity Coursera have created a lot of sandboxes where you can play. Anybody who hasn't had too much experience can also go and play. So do it, you should do it as a PM. And then last but not the least, hopefully the framework that I have put together or and you can of course modify yourself but hopefully a framework like this will help you think through the any problem case study or a product design case study that is asked of you in an interview. So that's all, thanks a lot. I had a lot of fun being here. Yep. So do you think it's overly ambitious or confident that a product manager who would start to get their hands dirty in relationship to let's say a job hiring situation from a data engineer who's trying to get into product, they're obviously going to have that technical acumen that is going to blow that other person out of the water. So it's going to take a while before that you can be equitable in terms of being considered for that role. For the product role. So I would say, I mean, based off what we interview, based off what I've heard, I think about 20% of it is technical, right? So most of it is still like regular PM 101 is what I would say. But yes, people do get technical. People do ask questions about stats. And if you just know, if you have built done things yourself, it just gives you an edge in an interview. So how does something like a generative adversarial network fit in that model that you showed us? Is that filtering data or calving data? Or is it more of a selecting data? So I have never been in situations where we have used that, but they are going to be critical for let's say self-driving. Most of that, most of the data generated for that those purposes are actually done in-house in let's say one of these test tracks. So people do not typically seek for the data, or I do not know the answer to the question, but I know for sure that companies must be actually training. So you should read this. There is an Atlantic article about how Waymo does it stuff. It's a public article. They kind of explain how exactly they create those labeled data like events. So it is, again, as I said, more or less it is the same, right? Figuring out what problem to solve, whether it is important or not, whether you can measure it. All of it is regular PM 101. For machine-learned products in particular, there are these the things that I spoke to, or something that you also need to keep in mind, because how you solve the problem in this case becomes important, right? It's a building of website. The how part isn't that critical as of today because many people have done it in the past. That has basically the difference. So someone asked about bias in the past, and I think you answered it from a system statistical bias perspective, but especially because you use a lot of loan data. How do we, as product managers, make sure we're just not out there like perpetrating racist data and really show you data that's based on this with energy? Yeah, yeah, correct. So certainly a very critical problem that Facebook is looking to solve, right? If you look at the stuff that let's say Mark Zuckerberg and Chris Cox and all of their leaders have put together, looks like that is one of the top problems for this year. Because yes, we can certainly get into this scenario where maybe some racist comment can get likes and like positive reinforcement where it should not. So how to figure that out? It's a critical question. So a way to do it is maybe employ humans in the process where at least some part of the decision-making process is reviewed by humans. And if humans at that point, the moderators say that hey, something is wrong here, then that gets corrected back into your training models. So I cannot think of any other way of doing it besides employing humans and moderators at least at the moment.