 Good evening. Thank you everyone. We're delighted to be here tonight, Wednesday evening in Bangalore. This is actually being broadcast live, so a lot of people are also joining us from many places around the world. Lucky for you, I will be moderating tonight's discussion, but my colleagues will be doing most of the speaking. So I'm just going to kick us off. Big data. We hear this term all the time. You hear phrases like data is the new oil where we need more data. And so tonight we're just going to take a pause and ask why? Why are we collecting all of this data? What is the implications of that and how can we do it better? So to start off, I'm going to actually give you a non-data example. Plastic. Plastic has been around for years and years. It is undoubtedly extremely valuable. We use it all the time. But today we're having a cultural moment where society is asking what is happening to all of that plastic? Its value has expired and it's causing pollution. It's causing damage that we are now seeing the risks materialize. And so that is what we want to ask the same questions about data. It's not just about collection, collection, collection similar to plastic. It's sitting somewhere. It has and collected data is going to be data at risk at some point in time. So like I promised, I'm going to hand it over to my colleague, Dr. Rebecca Weiss. She is Mozilla's Director of Data Science and she's come here from Mountain View, California. Please join me in welcoming Rebecca. Can everyone hear me? Yes? Turn it down. So I'm going to talk from the perspective of somebody who is interested in data science as someone who builds data science teams and as someone who is trying to help Mozilla see the value in data science. And if you are a person who is interested in data science or has gone to a data science talk or if you've been working with data scientists, you've probably seen this triangle before. It's very famous. Monica Ragotti, who was a data scientist at LinkedIn, is the one who coined this. And it's supposed to be treated as a roadmap, sort of an illustration for people who are interested in building a data science function in their company. And the idea is that almost everyone thinks about the top where you have the AI and ML products, but they don't think about all the steps that they need to take to get to there. And the idea is that most companies are not thinking about how they have to invest in these areas. And they have to hire the specialists that know how to do these various tasks in order to get to the top of that pyramid. The problem is that most data scientists don't really think about privacy or transparency until much, much, much later. And the path is really about getting to that AI as fast as possible. And you don't really typically think about where user privacy fits into this model. And that means that usually people think of data scientists as people who are interested in getting the data as fast as they possibly can and then figure out what you're going to do with it and then privacy comes later. So I would like to start this talk with this kind of a notion that if you start with a privacy-first culture in your company, you will end up with better quality data in the end. And the idea is that if you're thinking about data products as your company strategy, in the end, in the long run, it's the quality of your data that will be the competitive advantage because of where we think the market is going. And so I'm going to elaborate on this in three points in this talk, and then I will give you examples of how we've been doing this at Firefox. So my first claim, if you focus on better data, meaning lean data practices, you will spend less time arguing about bad decisions. And I think that the real point about this is that most data science is really about enabling decision makers in your company. Most companies want to become data-driven decision makers. And as they focus on this kind of endeavor, that means that they really can't afford to have bad data because bad data leads to a lot of meetings where you argue about who's right. And I kind of am corny, so I like to put cartoons into my talks. So everyone who works in an office can probably relate to Dilbert. So the main thing that I'd like to point out about this is that nobody really thinks about data quality until they see what it looks like at the end when there's something obviously wrong about it. And that the thing to take away about this is that if you are at that point where a decision maker is saying like, there's something wrong with this data, it's probably a little too late. And now you're going to have to spend time as a data science unit trying to figure out where in the data you went wrong. And that means that you're not thinking about how you're going to be building these products, you're not thinking about that path to AI. Instead, you're thinking about diagnostics and trying to figure out what's wrong in the pipeline. And this isn't just a joke, even though I love cartoons. There's also a lot of evidence from a lot of business journals that this is something that is a real problem in a lot of different companies and that this is something that we're not really seeing a lot of progress in. And a lot of this has something to do with the way in which you manage data and the practices that you employ. So here's some stats to consider. 26% of companies globally feel that their data wasn't accurate in some way. And that a lot of these companies would point out that only 35% of them have this sort of centralized data structure, which means that you're seeing a lot of these hybrid solutions where people are just putting all their data into data lakes. And the reason why people think that this is a problem and how we got to this place is that there are kind of two areas that seem to stand out a lot. One of them is that everybody wants to become more data driven. It seems to be that the only way that you can become competitive now as a company is if you have some kind of data strategy, which means that there's more effort and there's more focus on the data that you collect and the way you use it to make decisions. Which means, of course, that more people are going to find things that are wrong with it. The other problem is that the way in which we manage data is not really keeping up with demand. Again, 35% have the centralized structure. So that means that as people are racing to create more data to answer more questions, this introduces the possibility of human error. So labeling, processing, ingesting, collecting data. And human error is considered one of these leading causes of data inconsistencies. And this is also where you see things like breaches, because it's usually just a source of human error. And so I can't really read this quote because it's very long and it's very small on my screen. The main thing to take away from this is that in this particular quote, refers to how 50% of people's time seems to end up getting spent on trying to correct misconceptions about the data. When someone sees something that's wrong at the executive level, then they start asking all of the other employees to figure out what's going wrong, which means that you have this kind of cascading effect in your company about where people are spending their time. And this means that you're not spending time thinking about progress. You're thinking about how to regain trust in the data for your decision makers. And that means that people are no longer trusting the decisions that they're making, because now they're starting to question the data that people are making. So that's the first point is to think about how basically if you start focusing on better quality data, ideally you'll spend less time in meetings. Another point that I wanted to make, which is kind of near and dear to my heart as a manager of the data science team, is that if you have better quality data, you're going to get improved productivity, particularly from your data scientists, because you really need to think about what exactly are data scientists spending all their time doing. I think a lot of people think that data scientists spend their time fitting models and making AI products, but that might not actually be entirely true. So the first thing I want to talk about is that it's really expensive to build a data science function in your company for a variety of reasons. The first of all, they're expensive just in terms of salary. Data scientists tend to have a lot of education. They tend to spend a lot of time in very rarefied skill areas, but they also take about 11% more time to hire, according to a lot of these industry surveys. And additionally, they churn faster, meaning that they're two and a half times more likely to leave their role and go to another company. That means it's really, really expensive for every company to build a data science function. You really have to invest in building this kind of team, which means that you should be thinking about how do I make sure that I'm getting as much value out of each individual data scientist as I can, which means you really should look closely at what they're spending their time doing, because a lot of data scientists are actually spending their time on a couple of different areas. They're not really just spending it on machine learning. This is a capability analysis chart from PricewaterhouseCoopers, and this is instructions and sense of if you're a company that's looking to become data driven, what do you need to invest in? And the columns from the left to the right indicate decision makers all the way to data scientists all the way on the right. And each row represents a critical skill area. So you might see domain knowledge, meaning people have an understanding of business. You might see engineering people who know how to write code. And what I really want to focus on are two areas specifically, because if you can notice the data scientists are all the way on the right, and data scientists have to be really good at everything. And really I want to focus on how much time data scientists spend dealing with dirty or messy data, because again, if you're thinking about this from the perspective of somebody who has to run a business or think about a data science unit, this is going to cost you a lot of money. It's going to cost you a lot of opportunity. And it's not the AI ML work that people think of. So this was a survey that was done that asked data scientists, what do they spend most of their time doing? And what you can see is that cleaning and preparing data is really what a lot of data scientists spend their time doing. 60% of data scientists' time is spent on cleaning data. And that's because, again, this data management and data collection processes, they're sort of opportunistic. They're not really thinking about getting to the end result of the better quality data that you need to build better models. And so what I'd like for people to think about is how if you think about governance, you can actually get better quality data that will make your data scientists more productive. So when I say governance, what does that mean? Because that means a lot of different things. So for me, that means thinking about how will you manage the data you collect and how are you dealing with it inside of the company? And how are you thinking about how people will use the data, specifically groups like data scientists? How will they be using it to answer questions? And this is a thing that data scientists really should be thinking about because it also makes their lives better. Because you should be thinking about another thing when you think about what data scientists spend their time on. This is the same survey where now the question is, what do data scientists like the least about their job? And it's the same thing that they spend the most time doing, which means that they're going to cut corners. And this is just pure human nature. You do not want to spend time doing the thing you like the least. So remember that first point that I made about how poor quality data leads to people making bad decisions. So if you have data scientists contributing to your governance practices, you will end up with better quality data for decisions. And the data scientists should be spending less time trying to deal with messy data that they don't really understand, which means that you will get better quality data products out of them. And that leads to the third point that I'd like to make, which is ultimately that when it comes down to it, pretty much every person that works in stats or CES or machine learning or statistics of any kind knows that garbage in, garbage out. If you want to build better models, you need better data. And specifically I'd like to talk about how this is really becoming an issue when you think about fairness, accountability, and transparency in machine learning, where a lot of people point out that you can't really make bad data better just through investing solely in math. You really need to start thinking about under what context was data collected, what was the intent behind it. So again, I'm corny, so I like cartoons. And this is sort of a classic XKCD joke where they talk about how the models that you build, once you're in the data science landscape, if you're not thinking about the data that goes into it, you're really only going to be stirring the data up and transforming it in ways that allows for you to achieve your ultimate goal. And what a lot of data scientists focus on is just what is coming out of the model, how accurate is it against a particular standard that they're interested in improving against. And that means that like it's really about like does this output look right? And they're not really thinking about what has gone into the data. And one of the things that I want to highlight is think a little bit into the future where everybody is building a data science unit in their company and they're using cloud services to do all of their data collection and ingestion and processing. A lot of these companies like Salesforce or Google, if you look at where they're acquiring companies, they're acquiring it, think back to that triangle, they're acquiring companies that are hitting every single one of those levels of that triangle that enables people to start getting to the world where they can fit models faster and quicker and more easily. What that means is that it's not just that it's going to be easy to fit models, it means it's also going to be easy to make bad models, which means that again you really need to focus on what kind of data are you putting into this system, how are you collecting that data? So another comment on this is that I don't think that you'd have to just take only my word for this. So this is a slide from Mary Meeker's Internet Trends Report, so I don't know if you know who Mary Meeker is, but she's big in the venture capital world. Pretty much every year she puts out one of these huge slide decks where she talks about trends in the industry and where a bunch of companies are going. And all of the VCs read this very closely because this is where they end up deciding to put their money. And even she is starting to highlight that thinking about data governance, thinking about how you collect data, thinking about that kind of tool, that's a competitive advantage. And she really points out these three areas as places where you're going to see a lot of explosive growth and data. And the bottom two are really related to those types of acquisitions that I was talking about where these companies are looking at ways in which the management of different data sources together, if you're thinking about an all-in-one cloud solution, that's going to be less of a problem for you as a company that's looking to start this kind of function. And if you're thinking about optimization, a lot of these companies in the data science and analytics space are trying to make it easier for you to optimize if you have some sort of target that you're looking for. But data collection still remains something that is a bit novel for a lot of companies in that in particular, as these companies start to invest more and more in these types of technologies that give you an all-in-one solution, it's that your ability to have a high-functioning AI unit in your company is less going to be where you compete and more about what kind of data do you have and what kind of problems are you thinking about applying these solutions towards. So that means that the differentiator will be your data management practices. Okay, so I brought up these three points about how privacy-first culture leads to better quality data, meaning you'll spend less time wasted on useless arguments that are really about your data quality in the beginning. You'll get more productivity out of your data science and analytics professionals and you'll end up seeing better models, better data products that they produce as a result. But I can talk about this also not just in three bullet points. I can also talk about this from my experience from running the data science team at Mozilla. So how many people know, I know we just talked a little bit about Mozilla, but how many people know how big Mozilla is? We're not, okay, you do. All right, I'll teach you all now. Okay, so Mozilla is really only about a thousand employees. This is not something that I think a lot of people know and we are owned by a nonprofit which means that we are not going to have the kind of millions of dollars that you can see that other companies will invest into all kinds of ideas. This means we have to be a bit thrifty about the way we think about things. Also, we have a manifesto where we care about transparency and we care about privacy. So when I first started in this space, how do you do this? How do you think about privacy and transparency but also think about being a data scientist, especially since the way in which we thought about data collection was sort of really on worry. We were not really a fan of it. And now we actually do collect a lot of data and we're very transparent about this as well if you are using Firefox, which I hope you all are. If you go into the address bar and you enter in about poll and telemetry, you will see all of the data that we collect and you will see links to all of this public documentation because almost everything that we do is collected in one of these resources. And I'm going to talk about how we started from this space of not collecting any data at all and getting to this world where we have these resources because it all sort of started from this privacy-first mentality. And I have to highlight that we started from this place of just complete fear because if you think about it, the browser is something that sees a lot. It's involved in a lot of transactions. Browsing history is very valuable. It's also a thing that everybody uses now. People use browsers every day and you use it for hours every day. So a lot of your life is online through a browser. And browsers are very privileged software because they have a lot of access to parts of the machine that a lot of other software doesn't. So how did we go through this process of thinking about what data is safe for us to collect and what is not. And I think the main story is really that very early on we invested in governance and we invested it specifically around data collection. So in the beginning, like I said, we were just not interested in data collection whatsoever. I can show you all the bug threads where people were starting to say, like, well, we can't really answer this question unless we collect this data and everybody's saying, no, that's a violation of privacy. We didn't have unique client IDs. We didn't have any form of unique identifier. We weren't really collecting anything about performance or stability of the browser. We were just sort of building it and shipping it. And this led to a world where we kind of didn't know anything about how Firefox was performing. We couldn't tell if it was slow because our friends were telling us if it was slow or if it was slow because it actually was slow. We couldn't tell if it was really unstable because it crashed on our machine or if it was really unstable and it was crashing on everybody's machine. And this meant that not only from the perspective of, like, we didn't know if these problems were real, we also didn't know how often it happened at large and we didn't know under what conditions it was happening. So the sad resolution of this story is, and I'm sure many of you can probably tell, our market share started to drop and we didn't know why. So then we decided that if we're going to collect data, let's do it. Let's just collect data, but let's make sure that it's only for the purpose of building a better product. And that means that we need to make sure that we're only collecting data if we believe that it provides this direct user benefit. And I've highlighted this. I've underlined it because this was sort of the test that we decided that any time we tried to collect any data, we were going to check to make sure it provided direct user benefit. But the problem was that what does that actually mean? How do you check for that? It's sort of a smell test. And basically it just became a question of, like, does this smell bad to you? Does this smell bad to you? And it ended up becoming sort of a place where there were lots of fights inside of the company. And I bring up the example of performance data because performance data seems really innocuous. Like, let's take a look at rendering time. That seems pretty fine. How fast does it take for pixels to paint to the screen? Let's measure it. But the thing is those measurements don't actually provide direct user benefit. And then you start getting into this world where people are actually going to start asking you, well, what analysis do you intend to do with it? And how do you expect that analysis to provide direct user benefit? And this turns into sort of a value judgment about whether or not this question is good enough for somebody to ask. And if this data is the right kind of data to collect to answer that type of question. And ultimately what this does is it just creates delays in shipping software, which means that people have an incentive to not do it and not collect data at all because they don't want to deal with this ambiguous direct user benefit test. And this is not a good practice. And so really I want to go back again to thinking about those three points, like this idea that less time wasted, more productivity, getting better models. This is really about cost and how you really need to try to reduce cost as much as possible and an ambiguity is cost and delays is cost and bad data is cost. And so for us as a company that wanted to preserve privacy, we couldn't add overhead cost to product development as well. So we had to find some kind of balance between collecting data for the purposes of making the product better but also preserving user privacy and not creating this sort of ambiguous delay in the way we thought about data collection. So our solution was that we started thinking about reviews and code review in particular was a really useful model. We thought that well, when it comes down to it, data collection is in the end, it is code. You are shipping code. And if you think about code as something that has code review, well code that's about collecting data, we could have data review. And the idea was that if we make data collection have to go through the same form of a mandatory review as what you see for code review, but if we made it as lightweight as possible, then it becomes a fixed cost. And that means that it becomes easy to think of it as something that is predictable as part of product development. So that meant that before anybody collects data in Firefox, we wanted to make sure that all data collection followed sort of three principles. And the first one was that we should make sure that we write down our motivations for data collection in a way that users can see and understand. And this goes back to this idea of how fairness, accountability, and transparency in machine learning. There's this idea that you want to make sure that people know what the data sets were, what context they were collected under, what context the intent was behind it. This was a way for us to think of it in that way. Make sure that users have control over the data collection mechanism, meaning that if you can't find a way for a user to opt out of this data collection, it's probably not something that should pass review. And the other part was that we should make sure that there's publicly accessible documentation about what we're collecting and why. And that means that if you can't find publicly accessible data collection, then that means that we're probably doing something that isn't right by the user. And so what we discovered is that when we started to employ these practices and these principles in our day-to-day, we started to see that assets would get generated by other teams, and that made it easier for people like data scientists to do their job. So again, we charted a pattern this off of code review, and you can see our code review or our data review practices online. There's links that we can share. And the idea is that, like, again, you're shipping code, so practice this type of review as religiously as you would practice code review. And that if you ask the same standard questions for every form of data collection, we can ensure that we are regularly checking for privacy landmines, which is something that you should do when you collect data, not after you start to use it. And so we ask the same questions pretty much of every single form of data collection in the browser. We ask, what are you collecting? We ask, why do you need it? We ask if you considered alternatives. We ask about whether there's public documentation, and we ask whether users can turn it off. And these should be seen as very simple questions. They're very innocuous, but it can be actually kind of hard to answer some of these questions, and they should be really easy. And if you can't really answer these questions, then you probably are in a bad place, and you probably should go back and consider how to define data collection. So I'm going to walk through how some of these assets came to be as a result of us employing these practices. So since all data collections started to have these sort of standard questions that we asked, that meant that we started seeing standard documentation. So again, think back to that data sheets for data sets example. We're being more transparent about how a data set was created and how it can lead to fair AI outcomes. We started doing things like this, what we call this probe dictionary. So every data collection effort in the browser, you can find it in this thing that we call a probe dictionary, which is public, and that every single probe has a series of collection details about it, like why it was collected, how it was collected, which versions is it collected in. And that means that you can see the reasoning behind these measurements and how we use it to make product decisions. And again, if you want to find out more detail, not just at the individual data collection effort, but if you want to think more about how data scientists are actually using it in the company, it starts to build up from this kind of asset. So you can start to look up more deeply into each of these probes. Which one of these dashboards is it in? Which data set does it belong to? And each of these things, again, is public. And so you can go as a consumer, a user, who's interested to know more about how we're doing these things. You can go to the documentation straight from this type of probe. So this starts to create what we kind of think of as like a discovery loop. If someone is interested in learning more about how we think about data and how we're using it. And this documentation is something that we use internally. It's public, but we use it actually regularly internally. And all data scientists and engineers, when they work with the data sets, if they learn things, if they understand more about what's going on with the data set, they contribute it back to this documentation. And so that means that they're becoming more productive the more they learn and the more they contribute back to this process because they're not reinventing knowledge. They're actually trying to make sure that it's clear and easy for other people to understand. And so again, if you are interested as a user, you can actually follow this documentation to the actual code that is producing the data sets that we are actually using to make decisions and to build ideas and just actually produce these types of data products that enable decision makers to make decisions. And the point is that you don't have to take our word for it, like our data management practices are transparent. And you can choose to trust us with the way we think about data because of the way that we are transparent about these practices. And this is only possible without adding extra cost and going into a very heavy-handed way of thinking about data governance because we started by following these structured processes that were about privacy first. And this is why we think that this sort of data governance model should be part of your company's strategy, especially if you're thinking about building a data science function because this type of good governance, it helps build momentum in your company. And I have one more detail, which is that almost all of this happened before GDPR was a thing. We were doing this because we were driven by our manifesto and our principles. And so when GDPR came out, we were like, eh, it didn't do anything. We didn't really have to change a lot, which meant that it didn't cost us a lot of money. And if you think about a lot of companies that have now had to adapt to this world that has GDPR and all of these other regulation efforts, it's going to cost them a lot of money because they're going to have to go back to the way they thought about data collection processes, back to the way they thought about their data management. They're going to have to spend money and lose time and opportunity in the market to get into this compliance state with these regulations. So with that, I will say, Mika can now come back on stage and we can talk a little bit more about how other companies have had to deal with this process with Stanford.