 How many of you are data scientists or have a title like data scientists, how many of you are ML engineers, that is the official title, how many of you have any model in production as in something that is running every single day lots of them, hopefully this talk then will be interesting relevant and so on. I will tell you a bunch of stories along the way so that you can make it little more interesting, you just started, 4 more minutes, how many of you have 1 to 5 models in production, more than 5 models, I have run out of questions, I think there is lots of interesting stuff. I am going to talk a little bit about trust, auditability, words like that. So, the big shift that has happened in the last 2-3 years is that we have gone from a situation where we can build some models, any models to, we need a modern running and we need to believe that it is actually running correctly. And that changes the economics and the requirements quite a bit when business starts depending upon the models because you know many of the models if you look at some of our customers, there is a loan scoring model, it can deny people, real loans. There are people who can be denied geo connection if the model is wrong. There are real world consequences to these models and therefore, the business is beginning to ask questions like, do I believe this model? Should this model be even running? Is it running correctly? How do I know? How do I know that the people who have been denied, have been denied on the right basis? Will they come as in one case happen? People with machetes showing up at your outside your data team's office. This really happened by the way. So, these are all new emerging questions and we will see more of it going forward and this talk is somewhat forward looking from that point of view. Yes. So, the point that he is making is that very soon if you are going to put any kind of model AI anywhere in your organization, you will have to explain it to your customers, explain it to your management, explain it to the regulatory authorities. You are not going to be able to put a statistical model in production just like that. So, let us get started. So, we have a bunch of customers and this is a fictitious customer of ours. Think of it as just, you know, I have blended the experiences from bunch of different customers and they are looking at what Uber is saying, what Google is saying. Google has hundreds of these engineers, massive complicated systems and these are, at me is a mid-sized customer and the customer is asking, what are 400 people doing in Uber? Why does it take several years and hundreds of millions of dollars to build the system? What exactly is the system doing? And you know, can I realistically compete with Amazons of the world? These are all many questions that are floating. Hopefully, you know, at the end of this presentation, you will be able to answer some of these questions. And it is not an accident that the core ML systems that the Ubers of the world have built are at the core of their competitive advantage. We will see more of it in future. Now, so I will talk a little bit about productionization of ML, why it is complicated, what does our last few years of experience suggest to us and zoom into one particular part of it, which is the feature engineering. And most of the talk will be about the feature engineering. And I will say give you some guidance on how you should think about it and how you could go about building it. So, the first thing, this is a picture, it is this is the architecture of system called Michelangelo of Uber. Now, Michelangelo runs about 5000 models and the group that the Michelangelo team is about 2025 people based on my last conversation. And it took several years to build this, this all the boxes that you are seeing. Now, they have been pretty forthcoming and they have discussed a lot about the internal details of their system. In the last one year, company after company, whether it is LinkedIn, Stripe, Airbnb, every one of them have come out and talked a lot about their systems. What does the structure actually look like? And one thing that you can take away is that they are making massive investments in their ML platform. Now, where does all the time and effort go? As a company, I belong to a company called Scribble. And what we do is a customized ML platform. So, we are very interested in these platforms. What exactly are they building? Is it appropriate for all the customers that we have mid-sized customers? Now, so we have been studying it and trying to deconstruct the system. Luckily for us, Google did some of the heavy lifting. This is a picture from a paper at a new IPS conference in 2015, where Google talked about where does their time and effort go? The key thing to take away here is that all the, you know, the scikit-learn, all the heavy-duty conversations that we have around the statistical models is actually a small part of the overall effort. Now, we thought initially that this was our experience. Maybe we are not very smart and maybe we have some more distance to go. But in our conversations in company after company, you know, data team, data science team after data science team, this comes to us. It is a very consistent pattern. The reason is very simple. This is a probabilistic piece of code. Its behavior keeps changing over a period of time. And there are many corner cases. It behaves in sometimes in unpredictable ways in different situations. So, the whole system, everything around it is essentially to make this small piece of probabilistic code work. And the question is what are some dimensions that you should, that the systems infrastructure around the ML is trying to cope with? The first one is speed. The idea here is that nobody builds one model. People always build many, many models. If you can think of one model, you can be certain that there are 10 more models that are in the pipeline. Down the line, you are going to build them. If each model takes a lot of time and effort, you are not going to be able to realistically build anything. There was a customer of ours who spent a year building their first model and then productionizing it. When they came to us, our, their ask for of us was one model every month. 12 models by the end of 12 months and they will be running forever. Covering every aspect of it, the logistics aspect, the people aspect, the supply chain aspects and so on. Because market is driving this. Investors are driving. Even investors are asking how many different models did you deploy? What are they doing? How much improvement did they deliver you? So one of the big objectives of all that complicated infrastructure is to deliver you the speed. Increase your ability to put more and more models into production. But it is not enough to put these models into production because you need to have confidence. If there is, there is a customer of ours that is doing loan scoring, risk or not risk. If it is scoring the contract wrongly, you lose real revenue and real opportunity to serve the market. So it is not just that I want so many models to be put into production. I need to have some confidence that this thing actually works. Correctness is a very significant concern. A lot of monitoring, lot of quality checking, data verification, all is about getting, gaining trust in the model that you deploy. The third one has to do with an emerging understanding which is that a model is never done. As we put the model into production, as our understanding of the problem space increases, as our understanding of the different methods improve, we will keep changing these models. So this infrastructure should allow you to gracefully evolve over a period of time. Deploy more models, more versions of every model. And the last one is scale, which is the one that we talk about the most. But that is not the driving factor in many cases, which is that are we consuming petabytes of the data. The data set sizes in most organizations is actually in a few GBs or terabytes. It is nowhere near Google's. So there is one class of problems which is the Google's problem. There is everybody else. The architectures themselves, if you look at what Stripe has discussed, what other folks have discussed, they are all over the place. They look very different because they are hyper specific to that particular organizational context and the skill levels and the problems. The question is, is there a general pattern that is emerging? We think there is. Luckily for us, again, Gojek articulated this at a recent conference. This is a slide from them with my annotation. And you will see that the ML platforms that these organizations are building have three big pieces. The piece in the center is what we traditionally focus on, which is all about training newer and newer models, class functions, versions of this model, and so on. There is this deployment part, which is about taking the ML models that have been deployed and put it in Kubernetes, containers and so on to make sure that it is actually running every day and it is serving the predictions. The third one, which has been known as a problem but did not emerge until recently as a separate category of activity, which is the feature engineering. The question is, what is it that is feeding this models? Typically our work starts from I am uploading a CSV and running my random forest or whatever is the wonderful algorithm. The question is, where did the CSV come from? If you ever have to answer a question of the model quality, invariably the question also will, you will have to chase down the data set that is being used to train also. So I will focus on this. This too, a lot of discussion has happened around this. This maps to, for example, the ML flows of the world. And this one is the Kubernetes and Kubeflow and all of those examples. This is one of the places where a lot of effort goes in. But it has not been tackled as a separate activity. And one reason is that while we could build generic tools like ML flow that actually work across organization, all this feature engineering is very, very specific to that data set and the organization and the models. So this is one of the last pieces that is being now organized and standardized. You should see this conversation as part of that larger journey. First this got standardized, then this emerged last year. Now in the next two, three years this one also will get standardized. Now in about three years, four years from now you should be able to have standardized end to end stacks. Now let us look at this. What is feature engineering? Our ML models essentially understand matrices. They do not understand raw data. This is a sample from a customer of mine. Somebody bought thai dragon fruit. Now the ML model itself does not understand thai or dragon or the fruit in any of those details. You have to translate it into, is this a premium product? What proportion of the spend is in the premium product? And what proportion is spent on the imported product? And then the statistics will take care of the understanding the distribution, summarizing it in various models and so on. Now the big change that has happened in the last few years is that this used to be traditionally the code that has been written by data scientists and small number of these attributes are generated and it is running on the laptop. Now two things have happened. First thing you are actually generating hundreds and thousands of these things every day. There is a retail customer of ours. Just the health related attributes for each customer were about 500. Because they want to know, you know, a healthy itself is a very complex definition. There are many, many dimensions to it. The second thing that has happened is that this is getting computed every single day. Automated, semi-automated, incremental and so on. Because this hundreds of features are being computed on very large data sets, this cannot happen on the laptop. This cannot, this has to be a very, very managed process. The last thing that is happening, this is again a new dimension that will, that we are going to see I think in future. There is, there are situations where you have to create hundreds and thousands of these models. There is an automotive customer of ours. For every make model year combination of a car, they have a different fuel efficiency model. You are straight away talking about hundreds of models. We are not even considering the geographical context or usage or any of those kinds of dimensions. Now there is no way you can engineer the features manually. At that point in time, you are talking about new methods to do automatic feature engineering at very large scale. So at some level what we are talking about is first of all this particular activity where we are taking the raw data and turning it into a consumable, model consumable input is not only growing in scale, complexity, lots of new dimensions are being added, lots of new questions are being asked. Today the, as the model algorithms are standardizing, for example if you go to Slack, you get the choice of five algorithms. As an organization they have already settled down on five methods because they have a variety of considerations including maintainability of the code. Once the, as the algorithms, the statistical algorithms are being standardized, more and more of the model behavior is determined by the data set. So you have to know where did you get it from and why did you get this? Why did you use this and not something else? And this is going to determine a lot of also the explainability and some of those other dimensions as well. If you are asking about bias in models, where is the bias in models coming from? Bias in the data sets, more and more in future. So this is going to be a very significant activity going forward. Now when we look at the feature engineering systems of all of these different companies that have publicly talked, how should we understand it? Think of this as a taxonomy. This is our own. So one is that the system seems to differ based on the amount of data that you deal with. This is very logical. The way you think about petabyte scale operation and the way you think about terabyte is going to be very different, very logical. The second thing is how closely is this model development process integrated with the, you know, the feature generation or pre-processing of the data. A lot of deep learning combines the feature engineering and also the model development at the same time. So the solutions look very different. And in further, we notice that the methods really depend upon whether you have data set that look like a billion rows with five columns or whether it looks like a million rows with 70 columns. Here the cost of scanning is very high. But let's say you put the whole thing in Google BigQuery like Gojek does, then the engineering looks very different. It looks like SQL, right, feature engineering. But if it is some of the other customers of ours where they have 50, 60 columns, the cost of scanning the data set for every feature when you have 500 features is going to be very expensive. So you should think about different systems that you are going to build based on the nature of the data set that you are dealing with. And this is just the beginning. You will have more of them, whether it is happening on the edge or whether it is happening in the cloud, whether it is automatic, non-automatic and so on. So you will see lot more activity in the next two, three years. For the first time we have seen open source about three months back, Gojek collaborated with Google to open source the first feature engineering system. We will see lot more of it in future. So we built this feature engineering systems for various customers over the past couple of years. So we started with something very simple and with every experience, our understanding of what is it that is required and what kinds of systems are being built and why all of those things have changed. And this reflects a summary of our understanding last two years across many, many customers. There are three big pieces when you think about feature engineering. The one that we tend to focus on is the feature richness, all the health features that we were talking about. Can we make them rich? Can we make them as accurate as possible and compute as fast as possible? This is where we started out. But what happened is that with every passing experience, we realized that a new set of building blocks are required because we do not actually trust all of this computation. For example, we would like to understand the distribution of the features themselves. That is where your understanding of the drift and biases, all of those things come. So and the second set of building blocks that is required is do you even have confidence in the data set that you are actually ingesting through this very complicated process. Remember that feature engineering, these are very, very long running processes, 24 hours, 48 hours is not very common. We have a customer where we process three terabytes of data just like that. It is not just the beginning. So by the end of the year, we are looking at 30 terabytes or something like that. The point is that do you even believe that your input is complete and it has integrity enough that it is worth going through this very complicated process. So these are the, we started with this but these questions became more important with every passing day. So the other thing to remember is ACME like companies. Most organizations that we see are highly leveraged. What I mean by that is that the data team is constructing very complex system. They are taking on challenges that are way beyond the capacity of that individual teams. I mean here I am talking about just the sheer amount of time that is available. Not even talking about the skill level and so on. So we are talking about creating these kinds of infrastructure. We know that this is fairly loaded laborious activity. We also know that they are very short staffed. So you have to think carefully about how to put this whole system together. Let me zoom into computation of the features reliability. So the problem, that particular problem here is that your features, you may have computed x number of features 10, 20, 50, 100, whatever and you keep evolving them over a period of time. They keep changing because your understanding of what is actually interesting and important keeps changing. And there is a proliferation of pipelines and each pipeline generates a bunch of data sets and there are many versions of each of this data sets. So it is not unusual for you to end up with tens of thousands of these data sets lying all over the place. The question is do you know what has been generated when and why? What we need is strong metadata collection, strong isolation between the various data sets, very structured way to organize the data. Now we looked at a bunch of different options, open source options that are there in order to build very robust feature engineering. What you ultimately need is a computational framework, Pandas, PySpark, Julia, whatever it may be. You need an ability to break up this very complex computation into many steps and stitch them together. So we are talking about a DAG like framework. My current favorite is Prefect. Then you are talking about state management. So this, what is missing is in Pandas is where you have different snippets of the Pandas code all over. You need to move the data frames around and you need to be able to track this information. To give you an example, there was an IoT customer of ours where the Pandas code was 6000 lines. Imagine like, you know, the amount of heavy lifting that each line of the Pandas does for you and 6000 of those command. How do you even know that this thing actually works? So the way you cope with it is break it up into small pieces, be able to test them in isolation and then stitch them together with a combination of, you know, DAG like structure and some state management with lots of audit trails all over the place. Now the problem for us is that no single system exists out there which has all of these things. Your current option is to combine all of these. The second problem that you will run into as time progresses is that more Pandas code is more bugs for all practical purposes. Now the question is, can you move from very detailed code like structure to a specification based approach? Anybody who has reached a certain threshold has moved to specification. There is some form of specification, whether it is in Gojack, Uber, all of these guys have. Everybody has a slightly different form of it. We also have. Our ML platform also supports a specification format. But this really depends upon the individual context and the kinds of features that you are computing. For example, if there is a very complex computation then expressing it as a specification language means you are effectively writing the same complex computation in the specification. So it does not add that much value. So it is a matter of judgment that you make on the situation. But the important point here is that complexity builds up very quickly in the system. You want to be able to create the right kind of abstractions for you to simplify the whole process and speed up the whole process. This is something that came as a surprise to us. When we talk to individual data science team, most of the focus is on the particular data sets, the transaction data set or the IoT sensor data set and so on. But there is an interesting conversation that is happening between the data science team and the business teams. In one of our customer cases like I was mentioning, they had actually an anthropology group sitting there who would chase every one of the customers. When they bought Bojia, they actually look over the shoulder and say, what kind of people are these who buy Bojia? Interestingly, they have mental models about what the purchase of the product means, what it tells you about the preferences, the travel experience, the willingness to experiment and a bunch of things. So very quickly we realized when we looked at all of our customers, in every situation the business has a lot of tacit knowledge about interesting features that are not explicitly captured in your transactions, in your database, in your data set, but the organization knows about it and uses it on a continuous basis. So in every one of the situations, you should look at augmenting your current data set with new information that is coming from the business team. And you have the option of, if you cross a certain threshold, there are labeling services that are available. But our main challenge that we saw was that there are again two different systems that need to be again integrated. You have to connect to the label box or whatever. So one of the things that we ended up doing is having a lightweight labeling system just built out of the box part of the feature-changing system. Any data set that you want to label, immediately you can create a little activity out of it and go through that process. And it can be pulled by the feature-changing system. I think, so the ones that I would suggest you take away is those that did not match your intuition. We could not have, I did not foresee the importance or the frequency with which this labeling question was coming up. And you should, that is a tremendous opportunity in my opinion. If you are building a feature-engineering system, please think through the augmentation of the features themselves, getting information from your business teams. This is my other favorite. A data scientist, you know, seemed to be the life of the data scientist seems to be to move from model to model to model. In about two weeks time, the person is already on to the next model. The person has forgotten what the previous model was running and if it has gone to production, imagine somebody coming up after three months or six months and saying, oh, there is certain model that you have built. Can you, something wrong is happening there? Can you look at it? You have to literally start from scratch at that point in time. And data scientist seems to be under pressure, growing pressure with every passing day. So one of the things that we learned and created is an auditing system so that you know exactly where your data set has come from. It takes several different forms. One form is that wherever there is a data set, there is a fair amount of metadata associated with it. Who has generated? What was used to generate down to the Git commit? And once you have generated, I have a simple way. Imagine three months from now, there is a question about the correctness of the model. You look at which data set you have used to train. Imagine having a search interface where you can type out some keywords and get to the source of that data set, the process that is generating. We used to find that initially when I was data scientist and I was doing all of these computation, almost 20 to 30 percent of my time was just going to chasing numbers. As long as nobody asks you questions about the correctness, everything is fine. The moment they say you said the prediction rate or whatever, churn rate was 2.5 and last week it was 3. What has changed? This number does not agree with my number internally. That is enough for you to set on a path where you are spending the next many days trying to reconcile data. So that after going through a few of these occurrences, one of the first things that we did was to no data set will be generated or every data set has to come with a whole lot of extra information. Contextual information which will allow me to quickly find out why and what about any given data set and secondly, give me a way to cope with the complexity. The tens of thousands of files that will be generated over a period of time give me a simple way to get to it. Again a lot of the activity that we will see in the next few years are around pandas, not at the core of the pandas. The functionality is very stable and most of us understand all of it. But what goes into pandas and what comes out of pandas, what did you do with it is where a lot of action will be in the next few years in my opinion. So this is one of the places. Every customer of mine has a blank checkbook as far as hiring data scientists are concerned. 5, 10 people and they keep asking me to recommend them. I have exhausted my network like some 2 years back. But the important point is that models are being built and models tend to be very close. The entity that is looking at optimizing the seat allocation or location of employees is also the one that is looking at the wellness of the employees. So in that case where there are many models that are closely linked to each other and then that draw upon the same data set. Now a new mechanism is required for you to first of all figure out what all your ML platform is computing and how do you get to it. Can I reuse any of that information? Now all the big firms have some form of a marketplace to enable this reuse. And once you cross about 2 models or 3 models you will need this because a new data scientist will walk in and the first question that this person will ask, what data sets do I have to work with? Something that is being generated every day. So there has to be a place where you come to, where you can say you know these are all the things that are being generated and this is by the way the data distribution of each one of this. So that people can make judgments about whether it is relevant to the particular model or activity that they are doing. So this is almost certainly coming beyond once you cross a certain scale whether it is in terms of the number of models or number of people. There is a variety of ways of coping with it. You can build a separate application or you have a very disciplined way of processing, of tracking all of this information. I will talk about one thing that is really important that came out of again a surprise element for us is there is no company whose data sources are stable and trustworthy. I used to think that machine data, forget some of the old systems that are generating transactions and so on. We can believe that the code is buggy, bad things keep happening. The newer devices that are generating logs should be very stable. Now it turns out that that is not the case. So we had a situation where there was this IoT customer lots of sensors that are observing these individuals but there is a business process which is that there are ops guys who walk around and turn off these sensors whenever they feel like. So the device is stable but the process around this is not stable. As a result we used to face like massive gaps in the data. So my general as a general rule of what I recommend to every data science team is please do not trust any data source until you validate it and you stabilize it. Once you believe that your input data is has integrity has completeness and then it is worth doing. Your model is not worth anything if you do not trust the data. So every organization has a different notion of it and this notion itself keeps changing with every passing time. What you need is an extensible framework where every day you can think of new measure of quality and then you embed it. There are heavy duty solutions out there but you know they cost gobs of money and also that they need full time effort to use them also. So one of the things that I foresee is lightweight health and the quality checks around the data. There are simple libraries like Pandara and so on but it needs to be more sophisticated than that. So lots of things and the important thing is it is checking every day for every source. It is not a one-off activity. It is not an optional activity. So let me so I can tell you that this is a cost center. This is going to cost you lot of time and effort and money. You want to think carefully about it. Pretty much anybody who is going to build serious models today will have this kind of a system. There is no other way it can work and there are bunch of questions that you should be asking when you build these systems. You have to know that if a data scientist want a particular feature to be inferred from the raw data, you have to push back and ask is the model creating enough value to be worth the cost. All of the systems that we are talking about are things that require full FTS, full time employees to be able to manage. And it is worth asking whether the incremental value of this new feature is worth it or not. And remember that you are paying the cost every day because it is being computed on the data source every single day. And also there is a compounding effect. Again this has to do with the structural of the underlying systems whether it is pandas, whether it is PySpark and so on. It gets complex fairly quickly and all kinds of computational behaviors start surfacing. And then bunch of management costs associated with these features. So when you are thinking about model input, remember that it is an expensive activity. You want to ask whether it is really required or not. And let the data scientist or whoever is consuming that bit of information, let them justify. Make the ROI case for it. Ask lots of questions. Do you need it for more than one model? Do you need it now? Will it change over the next so much time? Is there a way to how accurately should you be computing some feature? Because there are times when you have to trade off the accuracy also for speed. And you know how available should it be? Is it okay if you compute once a day or is it okay if you have to compute every five seconds? And how do you deliver? And how big is my data set? You have to ask a lot of these questions because the economics is very tough for this feature. And today if you have to build a feature engineering system, you basically have three options. Gojek has open sourced wonderful tool, but it is very GCP specific and it is for tall data sets. If you have large event data sets in BigQuery, then this will actually make a little bit of sense. And they have fantastic talk on this one from last ODSC. You want to check it out. The second option that you have is to stitch together based on different components, building blocks that are available. My problem with this thing is when there are JDs that say that you have to have airflow experience, you know that that is fairly complex as it is. Now imagine stitching multiple of these things together into a coherent system. So you have to think very carefully and choose the building blocks. They can get fairly painful very quickly. The third option is there are wonderful people like Scribble like we are happy to help you. They are always looking for more business. So you can come to us. We will make the whole thing easy for you. So the various options like I was saying, this is one of the challenges that I found was that this is Java when the rest of your data science stack is all probably Python. So that will be the language issues will be quite significant. And here the main challenge with doing it in house is the skill levels. These are fairly complex systems. Running at class itself is a challenge. So you need fair amount of skills there. There is a lot of open source conversations that are happening. The thing to remember here is that if you notice all of these companies, they are way, way larger than most of what you would need at your scale. So you cannot directly use what is appropriate or what they have articulated. They have articulated design and an approach that is appropriate for their scale of use. But they have told us what are the concerns that drove them. That would be probably where I would start from and then figure out or think about what is appropriate in your context. Yeah, I am done. Last thing. This whole thing is an exercise in end to end discipline. Today if you are telling that I have confidence in my model and you do not have confidence in your feature engine which is essentially generating the data sets driving your model, that will not be a defensible argument. You will not be able to stand in front of anybody and defend your model. So the trust in the data science system is an end to end property. You cannot say that I have black boxes. So invariably what will happen is that as the expectations of the model quality accuracy increase over the next so much time, you will naturally think about the full life cycle of it. All the way to raw data and beyond raw data, a lot of our conversations with customers are around with developers of apps and end application developers because the choices that they make impact the rest of the pipeline as well. So the flavor of the conversations, you should expect them to evolve very rapidly in the next three to four years in my opinion. With that, I am happy to take questions. Hey, I had a couple of questions. I am Tejas from Swiggy. So we are building, we already have a production training production store in place, production feature store. And we are building a training feature store. And we had a couple of queries. How do you plan to do feature backfilling one? And how say suppose the two data scientists data scientist creates a feature say order for the past five minutes, number of orders for the past five minutes and data scientist B creates order for the order number of orders for the past 10 minutes. How do you like, you know, generalize a common feature across both of them and maybe derive different features out of it? Okay. What was the first question? Feature backfilling. Feature backfilling. Okay. So in all cases, backfilling is almost like a required. Let's say you have a new feature, right? Five minutes was there. I want to add a 10 minute column. You have to go back through the entire history of your data and recompute that information. Backfilling has to be there by design. There is nothing in the architecture that says that you require. But for all practical purposes, you should build it, right? It is in the pipeline in the way you design the pipeline. In our case, what happens is that we say something like make sure that we have the features for the past one month. So it looks to every single day and make sure that that particular days features are actually computed and so on. So it is, the architecture does not explicitly talk about backfilling, but it will be something that everybody has to do, right? That's the first thing. The second thing that you talked about is an interesting thing, which is different data scientists require different features, slightly varying features maybe, and also their features will keep evolving. Now how do you end the cost? All of these are probably duplicate computations and so on. Typically, what happens is that all of these feature changing systems also have a policy layer around them. There is typically a team of senior data scientists or somebody who has to determine that this is the way to resolve this conflict. Either everything is boiled down to 5 minutes or 5 seconds or it will be at 10 minutes or they will come down to 1 minute. But this is a decision, policy decision that the feature changing management team will do. So you, I just talked about the technical aspects or the architectural aspects of this platform. I didn't talk about the policy and the management aspects, but that is also going to be a very significant activity. So in this grid line there is, do you do star schema, do you perform a flat? So depending upon the, now what the question that is asking is that typically there are many sources, even the database has many tables. There are computationally efficient ways of doing the joints in the underlying compute infrastructure, in this case, Hive or whatever it may be. So typically this is a conversation that happens between ML engineers, Scribble and the data scientists, all of us to see what is an most computationally efficient way to express a certain join or more complicated expressions. And in most cases what we have found is that the data is flat, it is sitting in S3. So effectively the joints are all happening in the pandas layer. But in the case of Feast for example, all the joints everything is happening in BigQuery. Yeah, one question and then we will come. Okay, thank you. So first of all thanks for this talk, I think it was really informative. On the feature engineering side, is it like something applicable for the like the image data as well or no? So for a long time I used to think that deep learning and some of these applications do not require the feature engineering. So we generally used to not talk. Of late we have been finding that deep learning engineers are very interested in talking to us. So I said why do you care? And what they were telling us was that all of them even the image data sets require preprocessing. And they all require auditability, they all require a bunch of things that we are talking about that is appropriate to them as well. So I think that even I mean some of these things are applicable even in that context. And it is coming in my opinion. Okay, and I think one touch point which we mentioned it is like metadata which is in the in the in the term I will call it as a data catalog, data catalog. So at which stage this should be part of it is like you know I am just putting the raw data and I am mentioning about the information or is it something now the feature are extracted. Now it makes more sense because these are the set which will be used by the data scientist. So metadata and catalog are slightly different topics. So catalog is really to understand what sources you have and what is inside those sources. The metadata is about context, context of the computation. And you can pretty much put whatever you want in that that will help you you know answer some questions. For example, let us take the case of in the in the process of actually generating all these features I dropped a bunch of outliers. This is an attribute of the process and that particular data set and it may not go into any catalog itself right. So the eventually what we see is two three things one a metadata standardization process every system anything that is touching data accessing data storing data will all have metadata standards and there will be standard ways to represent and visualize it because the lineage questions are becoming very very important and the lineage applies to the underlying data set itself which is where it will go to the catalog and pull some of this information. So we will have a metadata layer itself eventually which will combine a bunch of today standalone metadata sources and processing techniques. And I think the last question on the trustness of the data which you said it is like something you know even if the feature are expected then again you need to test it. From my perspective it is kind of you know again they duplicate work because if the data engineering or whatever the team is is giving you the feature I will assume those are responsible for you know correctness of the data as well they have tested it. So in this case it would be redundant effort from my side. So our learning is that it pays to be paranoid across the board before you consume the data during the consumption for example in our case when we are manipulating a whole lot of data at every step in the DAG you actually do precheck and postcheck. Ultimately this is I love saying this all of this code is math written in python. If you are not careful about the code math is unforgiving you make any mistake in the entire pipeline it will show up eventually in the model. So the it pays to be you know pays to have a lot of guard rails along the way. And if it means additional duplicate computation then so be it. If it gives us more confidence go ahead. So we will need to take the concentration. Okay. Okay. Let's let's.