 I'll spend half my time talking about trust issues. I'll tell you why I'm motivated by that. And I built an open source tool which is Git for data. That is the second half. I'll give you a flavor of how I'm thinking about it and what the interface looks like and so on. The first question, why do I care about trust? My previous role was as the head of analytics for a political consulting firm, which was running Nanda Nelkani's election, among other things. On a daily basis, I used to move just through my data 100 people left as opposed to right, based on my assessment of where the opportunities are, who are likely to be convinced and so on. And later on, we went on to consult with very large firms and invariably the any output, data science output that I was generating used to trigger a lot of organization change. Standard operating procedures were getting modified, product strategy was getting defined, just a lot of turmoil in organizations. So I've been thinking about this question for almost a year. That is, we are doing all of this fancy algorithms. We are crunching a lot of data. I mean, all the data that we did both for the campaign and later for our business clients into tens of GBs. And the problem is, I would make all of these recommendations but I had a nagging question always, which is that do I really believe it? How do I demonstrate to myself, forget other people that this is actually meaningful because I know the real world consequences of everything that I'm doing. So I ended up imposing a lot of discipline on the processes on the people on myself and essentially the tool is nothing but the same discipline embedded into a piece of software. And the main thing I was worried about is somebody getting fired because I was producing the wrong model and sounds familiar to you? I mean, if that is not a fear that you have, I would be concerned. And so it did happen that there were a bunch of embarrassments that I've had and over a period of time which made me acutely very conscious of it. And I come to the table with a fair amount of discipline. I've been trained as an academic. But the problem is that the pace at which we are doing business to the number of data sets that we are dealing with, number of different methodologies, the business questions that we are dealing with, we in most organizations, we are looking at very iterative, laborious ad hoc and chaotic environments. So I thought that this was my unique experience. Maybe I'm new because I came from the systems background dealing with a lot of the stuff that the previous talk was discussing. And I went around, in the last six months, I got on the road and started talking to data science teams, anybody that I could talk to and said, these are some disasters I have had. What is the truth? You tell me. And these are some of the stories that I've heard. A data scientist in a large resort slash hospitality kind of company saying that I have done all of these models and made my recommendations, this and that. Sometime later I realized that the semantics of the column changed in the middle. Half the data had one semantics and the other half had a different semantics. Familiar? Now, with there is to a last person, there is nobody who has told me that mistakes have not happened and models do go wrong. And this talk and the tool and everything is about imposing a certain discipline so that we can be confident ourselves that we are producing output that is correct and beyond that valuable. And some of the stuff that I have seen in a large investment bank, you are running all of these models, scoring every single individual and suddenly one day realize that you are working on negative balances. I used to think this is small stuff until I saw this Forbes article, which made me believe that people are making modeling mistakes of the order of I'm putting the power plant in the wrong place. That is the scale at which these things are happening. And therefore we need to, as a community, take responsibility for the collateral damage that we do. This is not a demonstration of our intellectual match on this, but about our maturity in handling all of the data and the consequences in the real world. Now, why does this happen, right? It's not so much about that the individuals doing all of these don't have integrity and so on. It's just that the environment and the processes that we are dealing with are not yet there. And in the last four years, I have seen a sea shift as far as decision makers attitude towards data. Earlier, they used to not believe and now they want data to be the basis for every question. We are swinging the other way and we are not ready as a community for all of that. Let's talk, there are many reasons for this. Let's drill into a very simple, the fundamental reason, right? Suppose I give you a CSV or somebody else gives you a CSV, JSON, whatever it may be and say, this is the model. What is the first problem? The first problem is it does not have any memory. There are lots of CSVs lying around in my S3 on my laptop and here and there. I don't know where it has come from and I don't know how it is related to every other CSV that is there in our system. And this is especially problematic. I have seen when there are mid-sized teams, anytime you have more than about three people in the data science team and you have to coordinate within each other, I have seen that there is a fair amount of chaos. The second problem is that there is a lot of code that is generating, manipulating all of these CSVs. We don't know which machine, which commit, who has run it for what reason and what changes did it actually do to your data. And we don't even know. Many times it is coming out of MCMC models and these kinds of things. We don't know what parameters you have chosen to actually run those commands because all of them capture important information about the choices that you're making, about how you're thinking about the problem and what trade-offs you're making. And the third problem is context. A lot of data comes to you through email, this and that. In the email, there is also two additional bits of information, which says something about the underlying source or what the dependencies are and whether they are not, they can trust it and so on. All that context that should have been there along with the CSVs is not there. And essentially what we are talking about, moving to a little more tool-driven approach, tool mostly to manage the complexity of the process. I'm not talking about the pandas, I'm not talking about R, but it's about the end-to-end process. And here, my thinking is that we are familiar, we have been through this process before and it looks like a lot of software engineering, all the terms that you are familiar with versions and this and that. So the question is, what is appropriate in this domain? That's where I spent a bulk of my time thinking. While I was working, I couldn't get enough time to actually build the tools and embed my understanding. But the moment I left, the first thing that I did was to build an open-source tool. And my goals were three things. The first thing is it needs to have some notion of versioning. I need to know, this CSV is derived from this other CSV, I need to be able to look through the logs and so on. And get experience that you are familiar with. But that wasn't enough. I need something beyond that. I need a metadata management service. I need an ability to track, to embed more information into my repository saying this is where I got my data from, this is what I actually did and so on. So, and I need to be able to explicitly link. I want to know precisely the commit, the touch, this code. Accessed, even accessed, forget modified. Because there's a lot of sensitivity associated with even accessing all of this information, especially if you're following some of the privacy and other regulations that are in progress in Europe and elsewhere. The other important thing is extensibility. See, everybody has a slightly different notion of what is the process that is appropriate for that domain, e-commerce, pharmacy, investment banking. You don't want to impose a certain way of doing things. Any time we have done this in the past, this is not the first time. They have been frameworks that people have proposed, NIME and some others. The problem is that the moment you impose a certain way of doing things, there is nobody wants it. Because there is something that doesn't fit. This is a new requirement that wasn't very obvious initially. Once I started using this, what I realized is that in order to keep my data repository up to date, I have to run lots and lots of commands and with lots of parameters all the time. So the ability to implement a workflow and something that does not add to my complexity, the usability part of the tool became very important for me. So I said, let me not imagine what the tool looks like and let me discover it along the way. So what I did was I said, Git is all that I need for now. Everybody has Git and we know exactly what it looks like. And started using it on a daily basis and started extending Git with whatever features that I felt were missing. And that is how we get to dig it. It's an open source MIT licensed Python package you can go to GitHub and it is available, Python 3 only for now. I made this choice, it was a conscious choice that I don't want to add a new interface. I don't want to implement a new experience. It should be something that you're already familiar with and you do it as a matter of routine on a daily basis. And everything that is, it's a wrapper around Git and everything that is in the Git is what is missing in, what is required to bridge Git and our data science needs. And the architecture is fairly simple and non-opinionated. Anytime I felt that your choice is going to be different from my choice, I turned that into a module of some sort and embedded into it. These are two important modules that I wanted. One of the big things that was happening during all my data work was somehow CSVs and all of these things have a way of losing records, some random stuff happening to the records. And I really wanted to be sure that at every point in time my data is internally consistent. I want validation as a core capability of this digit tool. I just need to be able to say digit validate and it says yes, everything is okay. And I can specify what all I consider as the dimensions of validation. The second thing that I wanted to do was this transformation. And here the core rationale is that we need to standardize. If everybody has a different way of transforming the data, by the time 10 people are there, you're looking at 10 different implementations. In order for me to even audit what I have done end to end, I need to have a language within my own team, within my own organization to understand what I have done with my data. So this is a set of capabilities. You say something like digit transform and it goes through a bunch of these things, including anonymization encryption, whatever you specify. And there are some other things. If you don't like it, you can replace it with your, even this is a module, you can replace it by your favorite versioning system. I'm looking at things like Instabase, which is a versioning system built ground up for data. So these kinds of things will happen very soon. So I'm completely expecting this to be replaced by the Instabases of the world. The S3 came about because I realized very quickly that there's much higher degree of sensitivity as far as data is concerned compared to your data, compared to your core. Whereas you are now comfortable putting your code in GitHub, you are not comfortable putting your data in GitHub or any of the repositories. And everybody has default S3 access. So this system, if you say digit push, it goes to S3 by default. And a bunch of other instrumentation and so on. This came about, this metadata thing came about because after a point I wanted to visualize my data, how many repositories we have, what are all the changes that it has gone through? I'll give you a sense of it. So I built my little personal GitHub for data. It's not open source, but I can open source if people are interested. I'll show you a snippet of it, how it looks like. So otherwise the command is actually fairly simple. You have, when you say digit, you get either all these standard commands of Git that you're already familiar with, but it adds a bunch of new commands. And you can find all of this information online, but I want you to just focus on one command, which is the auto mode of operation. And this is inspired by Firebase. This was one change that I did to the Git experience. So as I was saying, it was becoming too complicated for me to manage the command lines and go through this 10 times or 50 times a day. So I added this auto mode in which it intelligently figures out, digit intelligently figures out what it needs to do. I'll show you an example of it. Let me walk you through a simple example. So let's say you downloaded the company database from the MCA. You put it in a directory. You just say digit auto. It wakes up and it realizes that I don't know which repository this particular CSV should be part of, and it asks some questions to you. Where should I, what should the username be, repo name, this and that, and where should I be storing it? And it creates a preferences file called digit.json. This is directly inspired by Firebase, and I love the whole thing. And the moment, from next time onwards, whenever you say digit auto, it picks up all of these preferences from digit.json. You don't have to do it, you don't have to specify again. And you can embed a whole lot of rules into the digit.json. So coming back, what digit.auto does is, it first realizes that there is no repository, creates the repository, creates a metadata tracking file, stuffs it with all the information, like the commit information that I was talking about, and comes back and says, I'm done. You can, it's actually a standard git repository. You can go to a particular directory and you'll find what it is storing. All of this is completely transparent. The more interesting thing that happens is, let's say you go through your modeling and you somehow change this file. You just say digit.auto. It scans your entire directory recursively and sees what has changed. And then goes back, looks at the dataset, your repository, and says that these are all the things that have changed. And it also computes the differences, saying that certain columns have changed, or certain rows have changed, keys in JSON have changed, you know those kinds of things. It is pretty, I'm done. So there's a bunch of other capabilities in Digit itself. I have implemented some plugins. You can go through this. I'm mainly, the reason I took the effort to come here and talk to you is that I want us to have a conversation on the tool and the usability and collaborate on developing a more structured approach to manage the end-to-end process. So all of these are outstanding requirements, if you will. And pretty much the whole, all this documentation is available on the GitHub. You can go through it. It's only Python 3 for now. So that is all that I had. I'll give you a 10-second plug of what we do. We are a data science automation company. Our first product is called Ask Scribble. The idea is very simple. You punch in a bunch of keywords. What you get is, get back is SQL. Let me leave it at that. Somewhat mysterious and feel free to talk to me later. Hi, quick one about Q and A. We've run out of the official time. So if people want to leave, that's okay. But then if you want to stay for questions, yeah, go ahead and stay for questions.