 But specific to this work. So though we did not use later, but the insect was one of the pieces where we had to apply control vocabulary and ontology. Right. We have some two, you know, two specific for our needs, you know, coming from smart logic, it's CMA4, which helps us to integrate the ontology into the system. And then future data, we will not be having any problem, such problem, at least we will be able to minimize the problem. Right. So, and then as I said, like some sort of normalization of the data had done in the background, right, we had a lot of insect values, they are sometimes in thousands. Sometimes they are in point decimals and so on. So we normalize based on the overall impact of that protein to two different classes or different classes of insects. So they get zero to five different classes. Right. And I already explained during the process itself for we have also done I know ML work for other pieces where we had a lot of features. Right. In this case, the feature was only one. That's the sequence right out. And I also explained to you how did we overcome that. Right. And once you use that, now you're generating too many features, that's another challenge. Right. So but what would we apply in addition to looking at say variable importance, which is, you know, like in various methods is that your domain knowledge, whether if I look at the sequence nature of, you know, nature of these amino acids within the sequence, if this is a physiochemical property, this needs to be like considered, right, that is the domain information. And then one of the important factors in model management is that how can you integrate different programming languages. Right. Our group is like extensively using our programming. And this, this is a Python toolkit. Right. And so what we did for that, we extracted all the features using Python, but then training validation and building dashboards, all this piece was taken care even data cleaning was taken care using by our programming. And heterogeneous skill level of end users, I don't know whether it is same for all the domains, but definitely in biology, biotechnology companies. So you build off, of course, you're going to build this model, but how are you going to give the product to the end users, right, you cannot ask them to run it on cloud or run it, you know, you cannot do that. Right. So somewhere it has to be, okay, just one click or do just some drop down selection and all. So, I mean, many of us not only we are doing it, I think, even you might be doing this piece, we are building, I mean, we built a dashboard, which basically helps them to visualize and then take the decision like for advancement decisions. Yes, protein A has to go for the next phase, protein B, no. And model management. So in a company or like probably many of you are coming from IT companies and different domains, maybe that is from the data science domain. Probably you might be the early adopters of some of these advanced technologies, right. But being a agriculture company, we may not be facing up with the existing technologies. So but we are, you know, we have understood the value of the data. And then we started like we moved into AWS now last five years, right. So then after that, we started looking at, you know, how to run these workflows, how to build the models on the cloud. So, so that is when, you know, we started using Pakistan. Now we are also having the, you know, the SageMaker available as well. So in, I mean, in future, we'll be also using all those things. And I already explained you why we used Pakistan. So now the, so some of the learnings in this is the data component basically. So we had to do a lot of data cleaning data, you know, if you do not do, right. So as I said, we use only feature earlier, and then we've got almost nothing other than the traditional method outcome. And in this case, we, after cleaning the data, so we had, you know, good amount of, you know, the comparison we could do, we could identify the patterns in the data. So all those stuff, that is one of the learnings and other learnings is like, you know, like when you generate a lot of features, what are different ways you can, you know, select those features. Important point. And what we want to do next is that we want to explore more descriptors. Basically, this toolkit has a lot of, a lot of other features, which we want to extract, right. So one of the problems we have in that is it generates the columns just with ABCD sometimes as a column header. Sometimes we are not, we are failing to identify what is this feature is about. But in some cases it gives the header of polarity and hydrophobicity, something like that, which we are very familiar with. So we are looking into the detailed scientific information of those, you know, the descriptors. And the second thing is recently, I think just a few days back, I feature was updated. And now we have a version called I learn on the GitHub, which also can be extended for DNA sequences as well. I think I don't, it can be also extended to any other sequences, basically generally. So wherever it is like continuously repeated, right, you can probably feed it generate the feature. So, and then, so what we want to do, based on the customer input, we did not consider the insect column. But what we also see is that, as I mentioned earlier, different proteins are effective against different insects. So, can we now include those insects as a feature, or a class of insect or order of insect as a feature, so then identify those potential, you know, proteins which can kill at least those groups of insects with more confidence we can tell those right. So that's the next aspect we are going to do it. So, now I'd like to leave it at where else this, you know, this this method can be used, definitely wherever you are using the proteins, wherever you are studying of the proteins, talk about, you know, pharmaceutical health domain, you know, cases, because we all know that cancers, right, diabetes, and age related diseases, right, your Parkinson's Alzheimer Parkinson's, like you would have seen the old people in measuring the body and Alzheimer's like the people for it tend to forget after certain a all these are controlled by proteins. So right in the age people protein aggregates that process is called protein aggregation, which leads to this process, and scientists are having studying different sequences, which can probably minimize these efforts, right. So, this study can be extensively used in that, and also in the diabetes and the cancers, they have the time or like I'm done finance, wonderful. Yeah, so, and then agriculture, not only I talked about developing the insect resistant plants, right. So now you have identified, you are going to, you know, suggest to the scientists, so they will take it forward. Of course, right, so that that's that's there, but you want to develop plants resistant to different things like maybe drought, or more water, more shade, right, we have all those products, right. So, can we apply this for other studies as well that is like we are thinking in house as well as even like anybody working on plant science should be able to use and also like many bioinformatics studies right who many IT companies also do a lot of bioinformatics. So it is definitely useful there if you have to come up with say some medicine, you know if you are working with some health domains where you want to suggest what should be the target. You know, you are developing a drug, what should be the target, the sequence target, right. So that's, that's the place. So that's it I wanted to share. So if you have any questions. Okay, thank you also for this very interesting presentation. Can we have a round of applause for the speaker please. Questions. It's the last talk of the day, but any questions. Okay, there's one question here and then one question here. Given the diverse nature of biology, you are recommending a set of proteins to the scientists, but do we have any real true positive, which can be actually taken forward with many of the features can be actually incorporated in your method as you mentioned as since there are many different classes as a further descriptors can be included right. So likewise, on what basis to that there must be some benchmark. Do we have a real true positives which is validated by the experiments ourselves. Yeah, I think very, very good question indeed. So we do have a lot of controls included in this when I say controls, there are positive controls that are negative controls. So we already that the label data, we know that which proteins are performing, you know, highly efficiently in the in the house we have the biological proof for that, we have been using that as a product, we have those products in the pipeline, you know, in the public as a commercial product, right. So that we have included. And also we do have a lot of, you know, the proteins, which are not performing at all. We have taken care of that. Yesterday's session, Chris mentioned about selection bias. So I think there is somewhat similar to the problem that he was trying to address in the sense that whenever we continue to make selection according to some prediction that we have made, and we select only those which are according to that prediction. In that sense, we are so if there are some values which are true negatives, I'm sorry, which are predicted to be false, but they are actually true. Yeah, those are the cases which eventually get ignored. So is there anything that that compensates for that in your model? Okay, so I think it's a question related to feature engineering, right. So you have the features and you have built a model, right. So based on the performance of the model, right, when you look at your confusion matrix, so it all starts from the data you have. So in this case, one thing I did not mention is about the data bias also, which we took care of by, you know, different methods, using those different methods. So in this case, one point is that we had, as I said, like we have both components like, you know, the it is supposed to be active, but we are predicting it as inactive, the vice versa. Right. So that we can fine tune that, right, depending on the feature you are you are probably feeding it in there. Or the data you are feeding it in there, but it all depends on the what customers basically want. The feature engineering part, we can do it probably once we realize that, you know, some of these features are probably not contributing at all. So as I said, there are many more features only by exploring those we should be able to probably know that. Yeah, answers your question. The question was more about so in that the predicted to be false, but they are actually effective those proteins. So as we go on sampling from get gone making experiments based on our predicted results. So we are actually ignore those values. So we don't actually make experiments with those which are not predicted by the model. Yeah, maybe you want to answer. We just like I understood the question right. I think, you know, we're asking that what if a model predicts it right. And then if it's model predicts active, but then it's still inactive and then we are still pushing it to the training data and training it again with the same data. So to do that, we're actually we are running the models in the lab lab along with the we're continuing the biological process as well. So that's how we are validating and we're actually validating the models correctness before you push it to the production. So we'll make sure you know before you push to the production will make sure if the model is highly accurate in you know, reducing either the true positives or post negatives right so we make sure and then we'll push it to that. That's the only way we can improve that. We will be before you push it to the production will be testing biological process as well as model testing. Yeah, I mean, as I indirectly said like it is depends on how you need it. There will be bias definitely like when you start tuning those the values you're looking at right so definitely that's expected to happen. Any other questions. Okay, if not, thanks a lot. And this brings us to a close to this edition of the fifth elephant. Thank you everybody for staying with us through the two days of the conference.