 Like the groundedness of the talks until now, hopefully, I'll continue in the same flavor. And I'll talk a little bit about our experience doing feature engineering and what has helped us do it. A little bit about us, unlike the previous companies, we are very small, we are a company called Scribble Data, we are an ML engineering platform, our users are data scientists. And we do a lot of training data set generation for the data scientists. And this is the, I'm summarizing the lessons that we have learned doing feature engineering at two large scale, large, not of the moon frog or dream 11 and their scale, but for most organizations. And our customer base is not the new age technology companies, but the old school retail companies, everybody in their brother is building up their data teams. And this is experience from deploying this our tooling and learning the process of feature engineering. And the key message is that productionization of ML is actually very expensive. And if you deconstruct it and look at it very closely, we know the elements that are driving these costs. And therefore, we can go about it very systematically and address it. And we have personally seen at 10x productivity improvement. And I would like to share the few things that we have done to get to that level. Okay. So, while they set this up. Just to give you a sense of how expensive it is. If you look at today's, the ML engineering teams work starts after the data engineering work ends. That means the data is delivered through Kafka into your lake, hive, HBase, whatever it may be. From there, it goes through a series of transformations and you generate matrices that are input through the models. And then there is a separate step of actually model serving. Just for my understanding, how many of you are data scientists here? Okay. Data engineers? Okay. Good mix. So, we are talking about talking to the right audience. So, and it happens in stages because the input data volume itself runs into terabytes and terabytes. And you have to bring it to a point where you can feed your scikit-learn or your Spark ML or whatever it may be. And you need to do this reliably. The expectations of the data scientists also have been changing in the last 2, 3 years. The output, the expectation today is that you are able to build these models and put into production and actually improve the KPIs like Govind was talking about. The expectation is not that you will impress me with your wonderful new algorithm. The meter has to move forward. Right? Yes. I can't use this. Okay. Yeah. Is there a way to properly? Because I... Okay. Sorry about that. Now, this is a picture from both their blog as well as recent presentation from the Uber Steam at Fifth Elephant in August. This system, just to give you a sense, it took about 20 plus people over a period of multiple years to build these things and get them right. And what they do is essentially drive 5,000 models. And what we have consistently seen over a period of time is that the moment you put the first model, the second model, the third model, there is a proliferation of models, there is a proliferation of use cases, there is a proliferation of pipelines and so on. That is what starts building up a lot of this complexity. And why do you need so many pieces? There was a very interesting paper which I totally love from Google. Sorry, the attribution is missing here. This is from 2015 Neural IPS conference where they talked about where does their effort go at Google, right? And if you look at this picture, the ML code is this, this black box in the middle. Something else is about making sure that this box actually is being fed properly. It is computing the right things and it is doing consistently day after day after day. And the entire Uber's 20 people team is essentially making sure that all of these pieces are running all the time. And if you deconstruct it further, what are all of these pieces achieving? The first thing is speed. The point is that every day we can imagine new use cases and new models. We need to be able to put them into production. And data scientists are under increasing pressure to productionize whatever models because it translates into direct business impact. The second purpose that all of these boxes achieve is correctness. Because of the statistical nature of the computation and the world being very complex with all of its nuances, it is very hard to get an ML model right continuously over a period of time or for all corner cases. So a lot of this work is going towards making sure that this is working as per expectations. And if there are deviations, keep correcting itself. The third one is evolution. The assumption that we have is that the data scientists actually know which model to build ahead of time. But there is a huge learning process because you have to discover a lot of tacit knowledge and constraints and assumptions about the world, about the process, about your own system deployment and so on. So this thing enables you to deploy many models, many versions of the models really fast and finally operate, you know, drive a lot of data through this entire system. I'm going to focus on one particular problem in this. The rest of the pieces are for a future talk and something that we are thinking a lot about. Let me zoom into that. This particular step feature extraction. So this is essentially what it's doing. There are billions and billions of rows of detailed events that are coming in and it has to be translated into a large matrix. And this mapping could be as simple as a arithmetic operation or it could be running a full-fledged model itself to derive one of the columns. For example, this could be the predicted spend or affluence category, whatever you could infer. There could be a model driving each of these columns itself. And you have to do this. I mean, this is typically a one is to thousand compression that is happening between this point and this point. And because of the complexity of this step and the large volume of the data that we are dealing with, it never happens in get go. Our computations at our customer, for example, this process is a 12 hour process, right? Even for a not a very big data set, and we are not even talking about Google Google scale. It has to run every day because essentially it is generating for every customer. There are new customers every day, they do new transactions. So this feature matrix, if you will, has to be updated every single day. And it doesn't happen in in one step. It happens in three steps, you know, roughly mapping to the scale of the data. There is one thing that consumes the last one year's worth of data. It's probably terabytes. It will run in batch mode and it will run every week or something. And then there is something that is near time. This could be last 24 hours, last 7 days, and then there could be last few minutes. And the feature engineering, if you will, this mapping of the actual transactions to the matrix happens at all of these three levels. And once the matrix is available, you feed it either offline depending upon your modeling strategy or online, right? And the output of this model is the one that is actually served through an API to the end application. And what we have observed is that once you set up the first model, this quickly proliferates because we are not limited by the ideas or the limited by the use cases. We were the most amount of friction or the slowdown was in putting together this messy process. So, when you, this is traditionally done in the house in big firms like Flipkars. They have entire ML engineering team which is doing all of these things. What has happened in the last one year because of the growing importance of this? All of these companies who have built homegrown systems have come out and talked a lot about what they have done as well as new offerings have come from big data engineering names as well as there are some newer players entering who are just focused on that particular, this particular computation. And I would strongly recommend looking at some conceptual work that is done. And this is, this is, there is a great presentation by Lowen in the last three weeks or so. You may want to look at it. And Lowen is X Airflow Committer and same as with Bisham. So, the question is what is it that is making all of these things expensive? The first problem is that you need to have confidence. Today you cannot put a model into production just like that. The business will immediately ask why should I believe this model? On what basis should I have the confidence? And trust is an end-to-end property. It is not the property of only your model. If the model is being fed garbage, it doesn't matter that the model is great or whatever the statistics is great. So the need for, and this is a very expensive operation. In the past what I have found is that when there are mismatches, errors in the computation and you realize that end chasing it through the entire pipeline or going all the way to the raw data is, it just burns up time like nobody's business. The second one is that this is continuously evolving system. There is no, this will never stabilize. This will never settle down, unless you have a very established, very focused use case. In most organizations there is a proliferation of use cases, proliferation and of course the algorithms are every day there is, there are new algorithms that are coming in. And there are questions of even the stability evolution of the input data sources. A lot of the data that we consume comes from past systems and they keep changing. Changing schemas, changing semantics continuously and you have to cope with it. It is not a highly controlled systems and well thought out systems that were discussed in some of the earlier talks. The third problem is the cost of development, the time and effort that it takes for data scientists to define some of the features that they need. This also is not to be underestimated because the features runs into hundreds and thousands. The last one is operations. This thing has to run every single day. So invariably things break down and you have to keep doing a bunch of activities to make sure that the current feature matrix is not only accurate but also available. If you do it correctly, so there was a natural experiment set up in the sense that we did something similar for a tier one food and beverage company in India about three, four years back. This is where my concern and my interest in this whole thing emerged and we took a lot of the lessons, embarrassments that happened during that time and embedded into the current approach. This is a comparison of what did we gain and where did we gain. The key thing to remember is that given that this is a continuous activity, this is an intensive activity, this is an evolutionary activity, you need to think through this whole process and there are three, four dimensions along which you need to think through and if you do that, you are looking at a very significant productivity improvements here. It is a good use of your data. The first one is trust. If you are every day or even infrequently if you are having questions of whether you believe the numbers that this whole pipeline which runs for about three hours to 12 hours to several days. If you have ever questions about correctness, you should stop there and start focusing on this because it is every investigation of yours is going to be very, very expensive. So building in visibility and auditability, knowing where any dataset has come from in your entire system will give you enormous benefit over a long period of time in my opinion. The path of it is if you ask what does auditability require, what does trust require, it involves things like metadata standardization. What this means is that any dataset that is ever generated in this very long compute process, I should be able to know who generated it, when it was generated, why it was generated and what are the dependencies with other bits of information in the system and it should be readily available to you for investigation at any point in time. We do things like linking it directly to the get commit that generated the code. I actually know the commit that generated this code and this has saved us a lot because we have several deployments, several test deployments, production deployments all over the place. In order to know whether something was wrong, linking data with the code has helped us a lot and this is part of the metadata that we collect. The other thing is that the dataset is proliferate, lots of files keep getting generated all over the place and you have to be able to surface that information knowing, having a simple search interface that says type out any name, any, can you fix this thing please? Having visibility into all of these processes and the files that they generate is critical and this has saved us a lot of time, early warning system. By the way, we routinely see data quality issues at the ingestion end because this past systems they are made by some third parties, there are not many controls over the incoming data. So very aggressively watching the ingested data as well as very extensively building quality checks is very important. By the time you compute and you feed the model, it is too late, ideally should not even go to the model. The last one is like I was saying, one big problem is complexity. There is a proliferation of pipelines, datasets, models, versions, runs with various parameters. You have to have a way to cope with all of them and clearly identify each one of them. We will incorporate things like namespaces, versions and linking of code and data and so on. Clearly isolating the outputs of various runs and this has helped us a lot to cope with the volume of the output, not in terms of the number of bytes, but in terms of the number of different files and sets that are generated. So the other thing that has helped us a lot is start looking for abstractions that will give a natural interface for the data scientists. Remember I was telling you how these features run into hundreds to thousands. Imagine a data scientist writing a lot of code, more code is more errors in my mind. So the question is what is the most compact way for them to express what they are looking for? And create a higher level language for them and we have introduced our own DSL for our customers. Now we have started working on an open source version of it, a reusable platform independent feature specification mechanism if you will. A lot of reuse, pipelines are not that different. We should be able to reuse a lot of the development that has been done before. Today the code of a typical data scientist looks like this long you know pandas or a spark code and having imposing a structure on it to provide reusability is going to be critical. And part of one thing that comes out of this reusability and the framework associated with this is ability to test. The typically these things what we have observed is that the even data scientists sometimes lack discipline as far as testing their own ML code is concerned. Having an easy way for them to build a lot of tests, express it and make sure that they are running all the time gives them confidence in their own code. So some of this structure is developing. So a lot of this looks like importing what we have learned in the last 30 years with software engineering. This should not none of these things should be a surprise. But I think the opportunity at some level is to incorporate the right abstractions build the right abstractions starting with the starting point being the lessons learned from the past. I was talking about versions and metadata metadata was not a big thing in the past. But now it will become critical because routinely we handle with you know millions and millions of files all over the place. And this is only going to increase in future. So the last one is because of the resource intensive nature of this computation, you have to keep a watch on what is happening. If there is a 12 hour process and it fails in our seven, it is very hard to recover me you lost a day and and all the pain associated with it. So one of the simplest things that is required is a default integration with a set of tools right that are already available that will give you visibility into the performance aspects of your system so that you can understand their behavior as well as debug it. And some of the things that we like is for example, net data which gives you memory and CPU consumption for us memory is a big deal because pandas blows up very very quickly the moment you cross about 100,000 records simple scheduling. I want for example, my data quality monitor to run every hour in my data lake or I need some background processes with supervisor and so on. All of these tools are available a integration out of the box integration with a bunch of these tools will help the data scientists also understand the behavioral aspects and therefore, we can fix the problems earlier. And this has speeded up our ability to deliver this feature engineering almost 10x we are doing 10x the volume of the data with 10x the number of features with only and half the time that is that we took in the past. It says something about my not having thought through the previous time that is one possibility but I think these are all good ideas that have been demonstrated to have value in other places at other times as well. So with that let me leave you this is a long topic and we will have I expect that this will be a thread that will come up in future again and there are more elements of this pipeline that we need to discuss as a community. And I am strong believer in end to end discipline right discipline at all levels of this data product if you will and I look forward to that conversation. Any questions? Any questions? Okay. So if there are no questions we will continue the conversation during the. So when could you brought a important point about my data so and what is your view in terms of industry is such in terms of data standardization because your platform seems to be looking into streamlining what comes in before it gets consumed. So maybe you can share your thoughts in terms of what you have seen in industry what is the maturity overall from metadata perspective. We looked around at metadata and asked what standards exist out there today the standards are very small I mean very limited the sense that there is data package which is done by the journalism community and there is DVC all of these have different flavors. So two three things that I see is that I first see a need for open source tooling for metadata something that is standardized used by everybody whatever be their computation model or Julia spark whatever it may be. The second thing is that we have to agree upon interoperability that means that the standard has to be shared across people so part of the conversation that needs to happen is what needs to go into the metadata what would in you know what would cover bulk of the discovery and auditing needs of the customers the third thing in my opinion is that none of these things can save you if the data scientists and the data engineers don't believe in the need for trust in your offering because why do you need auditing you because you want to believe the output that you have right that you are generating. So the third one which is a bigger thing in my mind which is an appreciation by the community as a whole for the need to build trust in the end to end service. The questions? Thanks. We are open sourcing a tool partly to I mean based on our experience to you know to drive the conversation forward you would like to see metadata standards. Thanks Venkat.