 Okay, three people, about one more minute, so if there's questions from the room during, I'll ask you to repeat them back because some people on Zoom won't be able to hear otherwise, so they might be able to hear. This does do some kind of job picking up the fact they are. Okay, well, we'll go ahead and get started. So, really great to have, I'm kind of proud of this dog out here from Simulations Plus and what I understand, he gave a really great talk at the Comchem, or the computer discovery GRC this summer on working with chemical data, which is something that I've done a lot of and probably not as much as him and I've suffered a lot in doing and so I'm really excited about learning from him on this and hearing about it and it's also very relevant to work at the open force field initiative is doing and I know work that we're doing in the lab as well. So, this is also an open force field webinar, so we have an audience on Zoom who may ask questions if you're on Zoom and you want to ask a question, you can certainly just shout at it or we'll call on you at the end and also if you're trying to interrupt during, you're welcome to try to drop things in the chat window and try to keep an eye on that and likewise, if there's problems with the audio, feel free to chime in and we are recording this so hopefully we'll be available to others later. So, please go ahead and sign down. Great. Thanks so much, David and open force field initiative, especially John and others from open force field who have thought that this could be an important talk for mostly all the people working in chemical mathematics, especially doing a model building. So, what I'll be discussing over here is the process of data curation and I specifically mentioned here that this is the forgotten practice in the era of AI because nowadays everybody is talking about AI or artificial intelligence and thinks that okay all we have to do is get the data and build the models. So, it is not a QSAR modeling or model building is not just a push button approach but it needs a lot of preprocessing of the data. So, what I mean to say here is when you need any news or if you put it as any activity data, it is necessary that you validate that data prior to using it in your model building or building and these are the two news actually which were which have been published which were which appeared in news one was about the Loch Ness monster and the other one is a Black Bigfoot in California. Now, people believed these news for big heads actually and until if you find someone who can mention that this is a fake news you won't find it. I'll get back to this these two photographs during my talk later but this is something what we have to believe when you're dealing with the activity as well. We have to find out if there is anything wrong in the data or there is any fake data. So, before I start my just before I start my talk I just want to mention that I do not want to evaluate or criticize any public or commercial databases here because these databases are definitely useful and just we cannot evaluate that these are bad data or this is good data. So, it is my my intentions are very straightforward. I just want to suggest how we can utilize these existing databases efficiently and how we can advocate if I can advocate a strategy for data curation that is what I want to discuss over here. So, just to just before I enter into just before I discuss my outline I'd like to show why data curation is necessary. So, this is the email my colleague received a few months back from one of our customers and I'd like to highlight over here it says we have some test data through Admit Predictor and we have we get rather poor correlations. Now, Admit Predictor is a software simulations plus distributes which has more than 140 Admit properties actually. So, they were evaluating it and they found that the predictor is not giving us good predictions and they shared few data sets with us. One of the data set is for human liver microsomal clearance model. So, they shared this link with us. They collected the data from Timbal and they compared they predicted the HLM clearance from Admit Predictor and looked at the prediction versus experimental value and this is the correlation they found. So, looking at this correlation they just mentioned okay we don't see a good correlation. When they shared this data with us we got this data we carried out some data curation and convergence and what we see is a very good correlation. So, what did these investigators miss? First of all if you look at the if they had looked at the predictor manual we have specifically mentioned in our manual that the HLM clearance data is corrected for microsomal binding they missed on that part. So, first thing they missed was converting the bound clearance to the unbound clearance. So, if you do that kind of conversion you have a correlation which appears on the right hand side of the screen. Secondly, you cannot complete you cannot use the data which are boundary data. Like for example there were 84 compounds which have been mentioned as greater than 150 or there were more than 250 compounds which were mentioned as less than 3. You cannot use those data points when you are comparing the predictions. So, if you remove those if you carry out appropriate convergence and if you remove these compounds which are not relevant then yes we do see a good correlation. So, this is what is happening everybody thinks that in case of AI all you have to do is get the data run the predictions or get the data and build the models. This is the notion we want to change actually and what kind of preprocessing is required that's what actually and what we faced during our preprocessing processing is something I will be discussing over here. So, this is the outline of my graph only four points what is data validation where are these errors coming from and how do we find these errors and finally why should we care about them. This is I already discussed one example that yes we need to care about them I'll definitely show more. So, let's start with data validation. Data validation is if you look at the dictionary meaning of validation it just mentions that it is an action of checking or proving the validity of accuracy of something and if we apply the same definition towards the data validation or data curation what we are doing is it is process of ensuring data have undergone data cleansing that means the data curation what we are doing to ensure that they have the data quality that is both correct and useful and that is where actually the data curation or data cleaning comes into the picture. In drug discovery validation is an absolute necessity you cannot simply believe the data specifically if you are extracting it from the big databases you cannot simply believe those data actually on an average it has been shown that there are at least two errors in each medicinal chemistry paper this is an I'm just mentioning the average error that could be more there are papers which are absolutely perfect in terms of chemistry and biological data and overall error rate can be as high as 8%. These errors I'm not talking about the systemic errors or random errors that is a different topic I'm talking about the errors in the databases which can be introduced during the data digitization or data extraction from the papers and if we want to build the accurate and predictive models the clean data and the accurate data is absolutely necessary and it is my data so if we see the databases which are available for bi-activity the chemical databases or the bi-activity databases there are tons of databases of the collection I wouldn't say tons of but tens of databases so we have Campbell or Pugging or BindingDB these are the free ones but there are commercial databases like Ghostar database or Deceptive Database or Clarify database so these are the ones which are commercial databases as well so the usefulness of these public data sources is questionable actually because of the quality control which is lacking when these databases are built so once we have the data extracted the data we cannot use it we have to wash the data we have to rinse the data and sometimes we have to scrub it and this is what is necessary before model building actually before using the data for any purpose so what I will be discussing here is simple my if there were no there are few papers you have discussed about these data curations so what I'm discussing here is only my experiences only after sharing or after knowing what people have been doing so what I'm showing here is I have my experiences after standing on the shoulder of the giants so who are these giants if we look at them there are a few papers from there are a few papers in the literature which is like Trust by Verify by Professor Grop or there are two from Tony Williams Loop these are the data which are talking about chemical structure curation so if you read these papers most of these papers are talking about oh you have to standardize the data you have to carry out neutralization you have to use appropriate automers you have to use desalting you have to take check check them if the compounds are mixtures or take care of the stereochemistry so most of these papers do talk about chemical structure curation or chemical structure standardization but there are few papers which are talking about bioactivity apart from the chemical structures so these are recent papers which are mostly talking about the biological data and but these people do not mention how these data curation what are the troubles in data curation or how do we find this and this is what I will be talking about in this talk that where do we get these errors from okay we have these databases where do we get the errors so there are various sources of errors I'll just start with simple data extraction so as I mentioned I do not want to bad mark any database or any authored section so what I'm showing is simply the compound identifiers as compound x or compound y without showing the actual database id or compound names so if we look at this data from one database what we extracted the same compound has two different ic50 values one is 56,000 nanomolar and one is 56 nanomolar now if you go back to the references of these paper this is just a reference in terms of Kimbell reference actually just if anybody is interested in this is a terfinadine and one author mentions that terfinadine herga ic50 is 56 nanomolar and the other author mentions that terfinadine ic50 is again 56 nanomolar so when the data extractor database extractor whoever extracted this data he made a mistake in terms of instead of saying 56 nanomolar he mentioned it as 56 micromolar and that made a huge difference if you just get this data and predict herga ic50 using any commercial model you will see that okay one compound is predicted very well the other compound is predicted by t-login and difference and then people start comparing oh that model is not good the commercial model is bad so we have to be very careful when you're extracting the data second source of extraction is source of data error is again data extraction now what i'm showing here is an example from commercial database here this is a discipline and bioblability of discipline is 0.15 if we look at this it says activity is 0.15 which is a bioblability so that means bioblability is 0.15 this looks quite realistic why because the range of traction bioblability on bioblability is 0 to 1 so if i tell you that bioblability is 0.15 people can believe it but unless if you look at it as enzyme cell assay now enzyme cell assay says that it is a fraction of compound unbound in human plasma it is not a bioblability what they are discussing is plasma protein binding and unfortunately or fortunately the range of fraction bound is 0 to 1 as well so until you look into these databases and begin more what exactly these activity or these values are representing you cannot you should not be trusting the numbers as is this is the classic example where the fu value or the 0.15 could be fraction and bound or bioblability but when you look at the assay then you find out oh it is a protein binding not bioblability so these kind of errors need to be taken care of need to be looked at actually when you are looking at the data now there are some sources of errors which are original research articles what i'm showing here is the same author Singh Adam published identical compound which is am 8191 in two different papers one is in 2014 and one is in 2015 in one paper they mentioned that guys and in the second paper they mentioned that it is her guys 50 is 18 nano more so again in this case the database guys database extractor they did not make any mistake they pulled out the numbers what they see in the paper but when you see these kind of differences specifically three log unit differences or six log unique differences it should immediately click to you that there is something wrong with these numbers because the significant numbers are identical what you are looking at is just the micro molar or nano molar numbers so we have to be very careful when you are looking at 18 nano molar or 18 micro when we built the model for her guys our model suggests that her guys is 50 value of this compound should be 18 nano molar and not 18 micro another source of error is us us means the database use so what happens is many times some these programs are designed in such a way that when you read the smiles or sd files and it has the extra salts or counter ions associated with those compounds it is available as a single smiles rotation for a single compound but when these programs like even admin predictor or other programs when they read these software when they read these miles or sd files they look at it in terms of oh this is extra salt or solvent and it needs to be split it out so they are designed to help the users to take care of the counter ions or the salts but at the same time user needs to be very careful when you're looking at these uh these salts duplicates for example if you look at these compound this compound is by chemical structure it is duplicate and the LD 50 for these two compounds one is 810 mixbuff eggs and one is 150 mixbuff eggs but what we miss here is in one case it is a citrate salt and in one case it is a pure compound so we cannot compare one cannot necessarily compare the activities or toxicities of different salts you have to be careful when you're comparing or looking at the duplicates and you'll see oh there are duplicates with different numbers they have the wrong data or there is something wrong with it it is not you are ignoring the salts or the solvents sometimes these numbers are pretty much similar so if i see these compounds we see there is no hardly some difference like 290 versus 350 of the difference but when you are looking at some phosphate salt and a regular one there is almost tenfold of difference or ninefold of difference you have to be very careful when you are comparing the duplicates and just you cannot come to a conclusion that they are identical compounds and having different numbers so we are equally responsible we cannot just blame the data with data extractor or the authors we are equally responsible for looking at the data and last but not least is a i call it abcd which is automated and blind compilation of data now this is something actually we are facing it a lot nowadays so it is very tempting to automate the conversion itself so there are programs which just look at the pdf file and extract the compounds and store it as smiles or sd file so this is a very good example what we found is this compound 21 shows extracellular double bond in the original research article so this is the geometry paper we are talking about when we built a model now this is this data was extracted for rag plasma protein binding data when we built the model this compound always jumped out as an outlier even if we include it in the training set the final prediction was always very long so we it we started thinking what is going on in here most of the times if you talk to curious here people they say oh this compound is always an outlier we have to throw it out or we should not use that but instead of that if we can go back to the original paper and see what is going on then you can find this compound is reported in the paper as as you can see there is a double bond over here so what this compound the lower chemical structure has hidden the carbonyl oxygen which is missing when we corrected this chemical structure and added back to the database then voila our database this compound was predicted perfectly fine so this is what we call it as an abcd mostly you cannot just there is there are advantages of automation no doubt about it it helps a lot to pass in the process but just be careful when you are looking at the data because of the these automation there is a possibility you're missing one something and you have you may ignore that compound just because oh this is a wrong compound this is another example of abcd this is not really clear but the this the compound what we are looking at is over here number 15 which has r3 as f the r3 group is over here the pin this substitution is f and the r1 r2 is i think that is the prizine of the prizine chain with fluorine but if you look at the compound 15 and compound 16 what the extractor has done is they have combined r1 r2 r3 group of 16 over here and fluorine group of r3 group of 15 over here so these kind of mistakes are very common when we are looking at automated program which extracts the data so we need to be very careful about this kind of abcd let's find out how let's see how we can find them actually this is what i'll be i have few ideas that what we are using when we are extracting when we extract the data and how do you find it first of all just look at the activity clips or technoscopic clips for example this is the data which is available from i think epa or epa dashboard it is a ld50 data at ld50 data which is available in milligrams per clip now i tried to look on specific outcomes have used to be very unusual because all other compounds have very low toxicity very high toxicity low number means high toxicity and this compound doesn't so the reference of this paper is very old it is a 1984 paper so till now nobody has gone back to those paper we don't know if that paper is available or not in the pdf format but we somehow managed to get this paper and the compound number 89 which is shown over here it corresponds to 0.322 milligram per kilogram that means they have missed on the decimal point right before the entire number and that's why actually you can see a huge activity atoxicity clip when you're looking at this when you're looking at this similar example i have is from other set of compounds which are you can see that ld50 is almost 7000 milligrams 7 gram but if you look at 11 most similar structure of this compound they are all against single digit numbers of course in this case we do not have the original reference we couldn't get the original reference but when we build the model we remove this compound put it in the test set just for our understanding what is going on we build the model of ld50 and find out that if we predict the toxicity of this compound it comes in the range of 6 to 12 that means they have also missed on one decimal point maybe up to 7 over here and so if you have a data just look at it in terms of activity clips or toxicity clips many times you can immediately point out this compound that's something there is something wrong with this compound or there's something wrong biological activity or toxicity of this compound that's how actually you can easily find out if there is anything anything wrongly reported or wrongly extracted you can also run some molecular match pair analysis for example in this is the data from our rat plasma protein binding data and you can see that most of the match pair most of the transformations here are nitrogen to carbon specimens replacements or substitutions but the difference between the rat plasma binding in in these cases is not very significant but specifically if you look at the first pair the difference between rat FUP of these two compounds is very significant so that triggers us an idea okay we have to go back to the origin okay look at it for this one so there if we can use various cheminformatics tools like toxicity clips or flustering or map molecular match pair analysis to find out if there is an the data is consistent or there are if there is an inconsistency in the data this is one approach we use a lot nowadays actually if we have hundreds and thousands of compounds what we do is leave one group completely out to build a model and see how that group is predicted using our the temporary or intermediate model and what we see here is now this is an example we found that the units were reported in the database as microliter per milligram of course and when we built a model for without these groups and then we found that these group these compounds are not predicted well back to the original paper we found that the actual units in the paper are milliliters per minute per gram of liver so whoever works in clearance model building they know that these two convergence need almost 50 there are some factors actually which are required to make this convergence and those convergers were missing when the extractor data this extractor or data extractor extracted the data and it is very easy when you build the models without with just like by leave one leave group out instead of leave one out or leave by out this kind of a cross validation for us as well similarly in this case what we noticed is there were a few set of compounds which were reported there were unit discrepancies by so our models predicted that our most of the compounds were over here but there are a few sets of compounds which we found that they were predicted very low when we looked at the paper and corrected the numbers this is what we find that all the compounds have moved into the cluster and these compounds have moved so what we have found is if they're if by using leave group out if we see the difference of three log units or six of unit differences then there is definitely the data extractor has messed up with the units like instead of micro molar here put it in micro molar to nano molar or micro molar so that kind of errors are very common in the database now this is another example actually when we are building a model for qcr everybody thinks that we should have a wider range of ic50 values or pic50 values so what we see here is the data extractor did not they just mentioned that okay this is pec50 value but pec50 value needs a specific conversion that is a minus log n to the that kind of conversion was missing why because pec50 value cannot be 0.078 pec50 values most of these pic50 values or pec50 values they are in the range of two to ten or three to ten but less than one is definitely something which whoever is building a model should immediately ring a bell over there that 0.078 doesn't seem to be the pec50 value but similarly pec50 value of 36 means is ic50 value is in molar or even tens of molar that is not actually right number so basically we have to be careful when we are looking at few data points which are way out of the normal ranges so when you're building the model you should be aware what are the normal ranges of these set of components of these properties so we have to be careful as I said actually we are equally responsible for data creation then we cannot just blame the data extractors now this is another example I was building a model for ratio of blur to plasma and most of the names the range is again 0.55 maximum up to 1.5 or 2 I had these almost 300 data points when I plotted them against our predicted human rbp value we noticed that the experimental rbp value there are 35 compounds having experimental rbp value of 1 and there are almost 22 compounds having experimental value of 0.55 origin we go back to the so if you look at the chemical structures one is nalaxone one is nalaxifin they are pretty different chemical structures but yet they have rbp value of 1 similarly if you see indomethacin and others over here they are diverse chemotypes but again you see 0.55 so that doesn't look that doesn't sound good actually it is we need to see what exactly is going on so I went back to the original papers and what we see here is actually disturbing because many authors they just mentioned that okay no data available value of 1 assumed for desinders and people who have built these models in the past they ignored it they just looked at it okay rb value is 1 let's use this value in building the new summer now say same thing over here like it says 1 for basic and 0.55 for acidic drugs okay now something that is more funny is this author it says assume to be 0.55 and he has a reference of this paper which is qubit at all so somebody has used it that's why I'm using these assumptions that is something which is disturbing and more importantly the people who are building the q-sol models using these kind of value I really feel bad that people do not even care to look at those footnotes when there are specific numbers if it rings a bell look at the footnotes in those papers there is something you will find but finally we just removed those compounds from our analysis and built a model and then we found that okay yes that was good the last example last but not the least actually user experience it is very important when somebody is building a model his experience counts are not the example what I'm showing here is provided by my colleague Bob Clark he was working with building classification model or metabolism model for ester hydrolysis so he looked at some compounds from the WDR world drug index and this is the drug oh I have it I did not have it oh yes it is a proximal orange so and it is available as pro drugs in the WDR as well so mostly when when he carried out the ester hydrolysis of these pro drugs he came across that in one case it was even numbered carbon chain and in other case okay in this case it was even number carbon chain in this case it was odd number carbon chain which is very unusual because mammals have problems metabolizing the odd carbon long chain so most of the pro drugs which are built in they are with the even number carbon chain and when he looked at it it is odd carbon odd number carbon long chain then he went back to the chemical structures so if we look at this chemical structure and it's ester it should be COO instead of having O C O actually this is a reverse hydro so the even the world drug index database they had messed up with these infrastructure when they added the data and if I was building these models I would have never picked it up because I did not know about these problems related to metabolizing odd carbon or even carbon so yes user experience it has come a lot when we were doing the data creation lastly I just want to discuss why we should care about these data creation so I'm going back to the original structural pictures which I showed you that these are the two main news which which had been in everybody believed him in dick for decade but it's the fact that if the lie is repeated a thousand times it becomes a truth so in 1993 the original photographer of that lockness monster he admitted that he had made this monster out of some plastic and some clockwork and some toys a million that means the picture what appeared in 1934 it was a lie it was a fake photo similarly if you look at the big foot news in 1950 it had been in the it had been there for in what since 40s or 50s until the original photographer's family revealed a secret saying that the whole thing was a hoax Wallace had made the oversized footprints with a set of carved wooden feet and so what happens is if these people had not revealed it we would have never found out that these are the fake news but what happened in between is these news have become so much of a truth these are the news in 2017 people have still been going over there to find out the lockness monster so it is a record year for citing the lockness most of people still believe there is most over there or people still believe there is a big sight for you so you and once it has been in the news or once it has been there or so many decades people don't believe that it's a lie it just becomes true and this is what actually we fear a lot in terms of chemistry as well let's look at the example now amiodarone and amiodarone hydrochloride so amiodarone is a simple so what we are looking at is amiodarone structure solubility of amiodarone is shown to be 700 milligrams per meter and this these are the snapshots of the game data section and the literature the back reference is marked next which we believe that it is an official record and if we look at its salt again we see that it's salt the solubility of salt is 700 mixed per meter as well and again the same reference page but if we notice actually everybody who works with chemistry he knows that the solubility of the plain molecule and its hydrochloride salt or in salt it cannot be added when my colleague Bob Clark was building the solubility models for our using admit modeler the program consistently found it as an outlier so its nominal solubility of 0.7 nix per ml or 700 nix per liter is true we found that this is less actually the it is predicted to be less actually compared to what the report said because we build a model only on plain molecules not with hydrochloride salt so unless we go back if we build the model and we find out that this is an outlier we could have never found out that there is something this comes in these two structures and these are actually an official records in murph index is an official record another example which my other colleague Marvin Waldman found out is about the fantasy many of you may know or many of you not know actually this compound is a non-existent compound this compound doesn't exist okay but it got well it got into the literature for computational work trying to rationalize the fact that it doesn't exist so they just designed a compound carried out some computational studies showed that it does not exist even then it has a cast number and surprisingly it is available on the pubca if you go to pubca inside it is available by the chemical vendors there are few chemical vendors they are ready to sell this non-existing compound now what happens is once that happened this becomes a virtually real compound and somebody will ask so what is how is it going to matter now how it is going to matter in terms of drug design here is something if you just follow these biostatic replacements so diazepam and bromazepam they're a simple biostatic biostatic replacement and how it is in terms of ai people just ignore to look at the chemical structures and just go on building the models designing the virtual library of compounds hundreds of thousands of compound library so i won't be surprised down the line someone designs compound wherein they have replaced the panel ring with this pentazine ring because this exists in pubca but this is you can buy it so why not design it so we have to be very careful actually when we are looking at these kind of data and whoever is using this actually i'm afraid down the line people will start using these even if it is non-existing compound they will start designing the compounds using these fragments and i did not mention actually the pubca website itself mentions that it is a hypothetical compound and another example actually which is my favorite this is another example from pubca so this is a chemical compound which is part of the pubcum database it has a molecular weight molecular formula and the reference reference is United States patent this compound is extracted from the patent and the patent reads lip polyamines and preparations and uses there so if you look at lip polyamine structures these are the structures which are lip polyamine structures so there is nothing wrong with these structures they are all the so this patent includes all the lipopolyamines and the substances or the raw materials that are used to synthesize them so where does this big compound come from the only graphics that looks very similar to this existing compound is a bar chart depicting comparison of transfection efficiencies now what happens is how did i come how did i why do i think that it is the same compound and let's look at it it has 19 carbon it is 19 by 11 grid and we have it as a 19 by 11 grid and they also have these little spikes pointed as carbons or metals so and there is no other graphic in the entire patent which will tell you that this compound is exist in this compound is from that pattern i this is i'm saying it is an assumption but i'm pretty sure this is the only graphics from which it gets and how do we get that i already mentioned about ABCS automated and blind data curation or curation of data there are programs which just look at those PDF files and if see if it sees some graphics it just translates it into the chemical structure without knowing that it is if it's a compound and it's a stop so these are the wages of automated and blind curation of data now actually something i like to mention people asked me a lot after my last presentation at GRC can we automate this process and yes we can but we have to be very careful that when you are looking at the data it is whatever i have shown you is fake data or wrong data but there are blacks ones available so yes it does it there are some news which are real for example this fossil fish news it does it is a real news you have to be very careful when you are differentiating between fake and real news and this is the example which i show actually this is the example of blacks one in terms of chemistry so i showed you activity clips before which were fake but it's not necessary that activity clips are always fake because now this compound versus these two compounds they have four points they have ic50 herb ic50 of four p ic50 in four but this compound has very high p ic50 but this is possible the reason behind that is most of the herb ic50 values are the acidic group one compound is responsible for the it affects the herb ic50 value so yes these kind of data this kind of blacks ones do exist in chemistry you have to be careful about many people asked me at Gordon Conn from many years back and that is an automated curation procedure but what authors missed here again as i showed i don't want to show the names of the authors but you can definitely go back over there and find out what authors missed here is for example in this case this is one example that the compounds lost the aromaticity when they carried out the automated curation of thousands of data thousands of compounds they ignored to look at the structures at the end if there are any discrepancies or they did not match the initial structures and final structures they lost the aromaticity or they also found that the initial structure was in iodide salt but finally the curation process just combined it just made a covalent bond between sulfur and iodine and this is not the only example actually these are not the only examples there are many examples in this curation process i have looked into my colleague Marvin has looked into these data and he found that there are many compounds which have lost their aromaticity or there are many compounds for which you can see the covalent bonds between sulfur and iodine so i understand automation is necessary but at the same time you have to be very careful when you're using the automation and you have to compare or at least do some sampling when you're doing the final structure comparison so it's what i always talk about that like when you are building a model it's up to you if you want to have an extraordinary model or if you want to have just extra and an ordinary model it all depends on how much you rate the data so you want to validate the data or not to validate the data it's completely up to you but whatever efforts you take in pre-processing or data curation it is going to make a difference in your model which is extraordinary or extraordinary so this is the conclusion which i want to mention here is watch out for the abcd's i showed few examples of abcd's actually when you are doing the data curation or data cleaning you have to watch out for those actually and we have to be very vigilant about using the bioactivity when you when we are using the bioactivity databases or the compilations we cannot simply use these just get the data and build the models it doesn't work this way many people think that qcr is just push button approach it is not actually there is a lot goes behind building the future model i also mentioned automation is necessary but it could be dangerous as i have shown you an example that you have to be careful at least do a random sampling of looking at the structures after automated curation and finally this is i have been telling everybody that if you see something say something not now maybe down the line the example which i showed about fantasy if we don't mention it to others there is a possibility that after a few years somebody will definitely come up with those kind of compounds which i showed you so it is necessary at least spread a word if we can contact the database guys of course we should tell them that this is wrong we should take it out or at least make some i would say note over there that this compound is okay has made a note that it is hypothetical compound but people do ignore it so we have to be very careful and as they say we don't plant pairs for us we plant the pairs for our hairs so we have to be vigilant and we have to mention about it so what i mentioned what i spoke about today these are not only my own examples actually the entire simulations plus cheminformatics team have been involved in doing all the data operations i'm just talking on their behalf so Bob Clark Marvin Marvin Mike Bolger Robert David and Michael along with Eric and John they have been always supportive and they have been working on the data operations along with me as well so and of course i'd like to thank the open force field initiative team who have given me a chance to speak on this topic actually for this open force field initiative thank you so much and i did not mention on the similar topics whatever i discussed about the data curation or model building we are hiring on the sea hiring for the same task actually so we have a position often for post-op thank you but i'm a question but i mean so i very much you know think what you're doing is great but i'm curious like so a lot of it kind of relies on noticing things that are obviously crazy which is which is really great for things that are obviously crazy but i wonder about the things that are not obviously crazy that are probably also wrong or at least some of them are wrong like i can tell you i saw kairi and my friends and it's not obviously crazy because sometimes i see them but i didn't actually see either so like the same thing for chemistry like you know see i'm stuck there with bad mail and so that creates huge outliers but what about things that are also wrong but they aren't wildly wrong what do we do about those how do we catch them um what are the what are the things about wrong but is there any hope for catching errors that are not as bad they're like wrong compound but it doesn't create a huge outlier um they yes i completely agree that the solutions are how to find those find the levels whatever i mentioned they won't be able to catch them off specifically if you have a compound which is very it is not very bad and it is not impacting the thing and so there if you are building a model and you're doing some clustering so most of the times it just it definitely gives you an indication of there's something not but yes we do meet that kind of compound the person kind but it is very um very rare case actually at least in our cases whatever data we are looking at we do not come across any of those kind of um yeah i think we needed ourselves for a minute apologies and um okay that's helpful yeah yes so i'm kind of new to municipal chemistry but i heard a lot about active groups and at first i thought that they were all that's kind of data herds and then later i thought oh wow these are real and people don't believe them really hard yes now i'm hearing again that maybe several of them are data herds can you since you've probably looked at a lot of these can you get some insight as to like what fraction of active groups you found are are real when you dug in and what fraction are so the question is uh what fraction of activity coefficient a clip activity clips are data errors versus real we'll take questions from jamie so um what i realized is it's most of the times if you are looking at the data from the databases you have to be careful about that so i cannot mention about the activity clip fractions because i just showed you an example which are actually are real so um and i think there have been few groups working from europe they have been working on activity clips a lot actually so but i would say with 50 50 chances that there are chances that some of the activity clips are real actually and we did come across those but if they are affecting your model a lot then you have to look at those seriously just go back to the original paper and find out if they are right but again you never know if the author has made any mistake but we should we saw the example in one of the slides that occur on some extreme state so if your model is definitely telling that this is an outlier or this is something uh suspicious data you can definitely reach out to the author of the paper just to clarify your doubts are there questions from zoom okay questions from zoom i don't think are there questions from the program yes please we do contact the authors like for example this automation paper what i show in the last we did contact the authors actually and the author has sometimes authors or the database administrators they take it positively they do accept that yes it was a mistake but one of my colleague approach is about the the big grid molecule i showed you about that kind of one thing but okay guys did not ask the thing i said okay we still want to keep them in the house they want to remove it so sometimes we have positive responses sometimes people say okay i i know they are wrong but we want to keep them there are certain we have to take them with a pinch of sauce okay sorry the question was if these errors which i have shown in the on the slides they were they have been taken care by the database people or not sorry i shouldn't have mentioned that before we have last chance for people on zoom yes well they are i don't think it's a well yeah it does have some kind of but all of them are i would say cyclobutans okay well thank you very much you really appreciate it thank you