 Hi, everyone, and good morning, good afternoon, and good evening, whichever time zone you're in. And thanks for joining us today. Today our panel is about automatic screening of COVID preprints and the screen it pipeline. My name is Khalid Kulichola. I'm an associate professor at the University of Illinois School of Information Sciences, and I'll be moderating the panel. To make the session more interactive and fun, we'll have polls throughout the talks throughout the panel's talks today. And to participate in the polls you can use the QR code on the slide that you see right now. Or you can go to the Mentimeter.com for the Mentimeter poll tool and enter the code that you see on the slide, which we'll also post in the chat. Thanks for running. Rene Bernard is from Quest Center for Responsible Research, and he will be helping us with monitoring the Mentimeter. Thank you, Rene. Do I sound okay? Okay. Somebody posted that. Okay. All right, so before I give the floor to the speakers, let me just give a quick overview of our panel today. So we have four speakers. First, our first speaker is Tracy Weiss Gerber from Quest Center for Responsible Research in Berlin, and she will introduce the automatic screening working group project and screen it pipeline. Next up, we have Peter Ackman from University of California, San Diego, and he will talk about some of the more integrated details of the tools and the pipeline. And our third speaker is Colby Warland from Indiana University, and he will discuss some of the lessons that we learned from the COVID-19 preprint screening. And last but not least, we have Anita Bondarowski from Sidecrunch, and she will talk about how authors have responded to their preprints being screened. So without further ado, I'll let Tracy take the floor. Tracy. Thank you, Helio. I just need a moment to share my screen. Okay, so everyone should be seeing slides now. Okay, so I'm going to provide a brief introduction to what are these automated screening tools things and what kind of things can they screen for and then I'll also give a quick overview of our automated screening working group as well as the screen it pipeline, which Peter will provide more details on after me. So our first question you might be wondering is what are automated screening tools and automated screening tools are simply bots that are developed to detect common problems or beneficial practices and scientific publications and that could include preprints as well as published papers. So some examples of things that automated tools might screen for include open data and open code, blinding randomization or power calculation to the NIH trigger criteria, limitation sections, or bar graphs of continuous data which are present potentially misleading way to present data. And there are tools for many other things as well and Peter will give you a complete list of the things in our pipeline in the next talk. The other thing that the tools do is they generate reports to share customized feedback with authors, readers, editors or reviewers. And so this information can help authors to improve their preprint or their manuscript before submitting it for publication, ideally, and it can also provide some information that may be helpful to readers to different types of readers when they're examining a paper. What are the potential benefits of automated screening tools. Well, one of the first ones is that tools can screen many papers very quickly. And this makes them a scalable solution because we can examine a lot of papers in a very short period of time. They draw authors editors readers and reviewers attention to things that can affect transparency rigor or reproducibility or other aspects of good scientific practice that we would like to highlight. And they can also identify problematic practices and one of the things where they may be particularly powerful is identifying problems that are widely accepted as normal within certain fields. So we all know as meta scientists that poor reporting practices or a lack of transparency about how critical methodological choices were handled are fairly common things. And unfortunately, if you, you know, wait for reviewers to flag these things then you may be waiting for a very long time because it's unlikely that you'll be asked to change something that's a standard practice for your field. And so the tools can raise awareness by flagging some of these things and directing people to educational resources that can help them understand better practices and implement them. Everything is not perfect with automated tools the tools have significant limitations and so I think it's important that we consider those limitations at the beginning of these sessions as well. The first is performance so tools are not perfect sometimes they make mistakes, and some of those mistakes can be resolved and others can't we always work to make sure our tools have the best performance possible, but we know that they will never be perfect. Next thing is that tools can't always detail whether a particular item is relevant to a particular paper. We are working on refining the pipeline to get better information on this, but as of now you're the tool may screen for something that isn't in fact a relevant or an important item for your particular study design. And the other thing is that tools can't detect all potentially important factors so some factors may be too complex to screen or to nuance to develop a tool for, they may not be things that we can train a bot to detect. And then the factors for which we have or can create tools may not be the most important factors so we will show you a number of tools that we have in Peter's talk. But there may be other tools or things that you would like to see and we'll have opportunities through the mental media polls for you to tell us about what things you might like to have in the pipeline, a little bit later in the session. And the last reminder is that tools are not a replacement for peer review so we really encourage readers to interpret the reports understanding limitations of the tools and to recognize that the tools are designed to be interpreted by an intelligent human who is, you know, who knows more about the paper than the tool does. So, with that in mind, we have developed an automated screening working group, and this was founded in January 19 with five members and three and a half tools. And the goal of the working group was to bring together scientists who have tools for creating and screening the scientific literature. We wanted to bring together people who had tools in order to develop a community of tool developers. And we hope that this might help us to facilitate collaborations, sharing problem solving sharing code and solutions to different issues that people are having, as well as collaborative projects that might lead to better and more complex tools. And we also hope that we will be able to create an uber tool or an improve my research button by bringing all of our tools together, because we know no one wants to go to 20 different sites to screen their paper for one thing on each site. It will be much more efficient to access all of those things in one place. And then part of the idea of this community as well is knowing when we have duplication of efforts so we have multiple teams working on the same tool. And seeing if there are ways that they can work collaboratively in order to build a better tool or a stronger tool. We also want to know whether feedback from tools is effective in improving reporting and set standards for tool creators or for adding tools to the shared pipeline. Our group is currently around 25 to 30 members from the US Europe and Australia, and we are certainly welcoming additional members so if anyone is interested you're more than welcome to contact me. So we were working on the various networking and other things when the pandemic occurred. And many of you will know that a lot of scientists shifted their focus during the pandemic, especially in the early days to focus on more pandemic related topics when they were able to do so. And so at the time the pandemic started about 25% of COVID publications were preprints, and many of those were posted on Met Archive somewhere also posted on bio archive. And I'm sure we all remember that preprints were getting a lot of discussion in the press because they were some of the earliest information that was available to us about the pandemic which was urgently needed by health care providers. However, many of you will also remember that the scientific community was very concerned about the quality of COVID-19 preprints. Many scientists hadn't yet embraced preprints, or weren't aware that preprints existed. And this was a challenge and a discussion both for the scientific community as well as journalists and the public because these studies were being reported in the media. And preprints are particularly interesting to us because the fact that they're not yet published means that they offer a unique opportunity to improve reporting. So if we screen publish papers, the papers already published, there's not much that the report can do or there's no opportunity for the author to react to the report. Whereas if we screen preprints, the authors have potentially an opportunity to make their manuscript more transparent before it's published. So our automated screening tools aren't perfect, perfect, but they allow us to intervene on a large scale, which the COVID preprint publication certainly are. And hence the audit-mated screening working group had started working on this problem of screening COVID-19 papers or preprints early in the pandemic. So in response to the pandemic, we added some additional tool makers and tools to expand our pipeline. We pulled all of our tools into a single automated screening pipeline called Screen IT so that you get the results all in one place. And then we began using that pipeline to screen COVID-19 preprints that were posted on BioArchive as well as MetArchive. The reports are posted publicly using the web annotation software hypothesis. And then we tweet out links to the reports via the Twitter account at SciScore reports. We published an early article detailing some early findings from the first year or a little bit less than a year, 8 to 10 months of screening that was published in January of 2021. So where are we today? Well today we've screened and posted reports on more than 18,000 COVID-19 preprints and Colby will tell you about some of the results that we found later in the session. We have a vibrant community of tool makers and other scientists who are interested in automated screening and that community is divided into four main working groups. The pipeline group focuses on maintaining and updating the pipeline as well as developing standards for tools to add to the pipeline. The research and applications group focuses on how we're using the tools so exploring new applications or ways to use them for meta research studies. The developers community and the statistics tool developers community are really designed to facilitate networking, collaboration and shared problem solving. And together we're working to improve the pipeline, develop new tools and make more tools more accessible to scientists. So, we have now reached our first poll question so hopefully some of you have mentee open already but for those who don't, you're welcome to follow the QR code here or just go to mentee.com and enter the code. We have a new multiple choice question for there for you and we're interested in knowing whether you would watch your paper or your preprint to be screened. So we'll give everyone a couple of minutes to answer that question. And I am going to stop sharing my screen and Renatia will share some results with us in just a minute. And if you're having trouble accessing the poll just go ahead and post a message for us in the chat. So we have a few answers coming in so far. And it looks like many of you who have responded would be interested in having your paper screen someone definitely doesn't want their paper or preprint screen and some of you are uncertain. And nobody doesn't know. So that's good. Everyone has an opinion. Okay, we'll give people just a little bit longer to enter the their answers to the poll question. And then we'll go ahead and move on to Peter's question. There's a there's a question from Olavo to everyone about the sample here being a little bit biased. I think that is a fair point. I think it's very likely that a sample of people watching on our presentation on a Saturday. As Olavo has pointed out might be a bit biased about screening and also an audience of meta researchers furthermore. Okay, you have time maybe for a few questions if you have questions for Tracy, I would be a good time to ask those. I'm not seeing anything yet. I think we can move on to Peter's presentation and people are more likely to have questions when they have more information. Okay, can you see. Okay, so hi everyone I'm Peter Ekman at UC San Diego and I'm just going to talk about the screen it tools and the pipeline. So, each preprint that we screen is downloaded parsed and then analyzed by a set of tools. Sci score is so these are the tools in blue. Sci score screens for criteria defined by the NH and the resources used in a paper like software or cell lines, and if they're identified correctly. Odd pub open data detection and publications checks for the presence of open data and code reported by the authors. Limitation recognizer reports study limitation statements made explicitly by the authors. Barzooka screens for bar graphs use for continuous data which can be misleading way to show continuous data that fighter screens for rainbow color maps which can be hard to see for colorblind readers. trial identifier searches for and verifies clinical trial numbers and then reports like the title and the status of the clinical trial as well. Site reference check checks for any references with editorial notices like reference or anything. Our transparent reports conflict of interest funding and registration statements made by the authors and seek and blast and checks for incorrectly identified nucleotide sequences but that's a semi like semi automated tool so it's not really part of the pipeline. A human has to go in and look at it. So this is just a general overview of the pipeline here. So we start with the pre print data set from bio archive and meta archive. And from that data set we get a list of pre print identifiers. And for each pre print we extract the text and PDF. So both are available on both are available on bio archive. So with the text, we, it's, it's like extracted from a full text HTML page. So the text is very clean doesn't have any like other things in it. So that that is just fed directly into into these sets of tools that are all like text based tools. And we also have a, we also have a study type classifier here, where the like the text is used to classify whether studies modeling or not, which can change what we actually show. So the pre print identifiers also used to download a PDF, and this PDF is first fed directly into reference check. And then the PDF is then extracted into images, which are the images are extracted out of the PDF and then these images are fed to like the graph analysis tools. And the results from all these tools and combined into one like HTML report. That's post on hypothesis and then tweeted about on Twitter. So that's the pre print data set. So by archive and met archive, which of the two must popular repositories of coven 19 pre prints have a hand curated set of coven 19 pre prints. So here's, here's the page that they have that on, and this is updated daily and it allows your programmatic access. So it's easy for us to get access to the, the pre prints. Let's talk about extracting information from the pre prints and the actual tools. So the tools. Each tool, the pipeline provides the sources of input for it. So it provides a image, which is images which are extracted from the PDF, the raw PDF file and text which is extracted from the full text HTML page. So you can take any of these is input or all of them, whatever it wants. And each tool must output an HTML summary of its findings. So it can be displayed in the report and data for insertion into a database where we just keep track of all the pre prints that we screened and all the results that we've gotten from them for a later analysis. So we also classify study types. So using the text input we classify whether studies and modeling study or not. And now we just use a support vector machine to do that, which is based off just like the frequency of words in the text so it's not a very advanced system but it seems to work well enough for us. And we implemented this after the researchers that we talked to on Twitter, till the sour criteria did not apply to their pre print. So they said, Oh, why, why are you, why are you telling us we need these things even though our study clearly does not deal with any of that. So we implemented after that and and right now it just turns on or off the size square criteria so if it's a modeling study, the NIH review criteria don't apply so we exclude them but if it's not modeling them include them still. But we're currently working on expanding this classifier and eventually we'd like to be able to like toggle certain tools based off a specific study type classification. So if the studies have a certain type and we want these tools because these results are still useful but not these tools. So finally I'll talk about the HTML report and releasing it to the public. So we had a question about how can we let users know that this report exists. So bio archived met archive have a heavily moderated comment section that bots cannot use so we tried to post through there but they just remove them. So we decided to make results public through Twitter and hypothesis. So the benefit of using Twitter is that the preprint web page displays like a tweets referencing this article sort of section where you can see all tweets that link to the article. So we can put our our tweet there and hypothesis is a web annotation tool that can be used on any web page and just for us overlays the report on top of the preprint. So here's an example preprint. And you can see so this is on bio archive and you can see a little Twitter icon here. So users can click on this Twitter icon and then it pulls up like an evaluation slash discussion of this paper section and our tweets at the top here. So then users can can click on this link here, which then pulls up this report here so this will just come in on the side of the of the web page. So here's the results it has the size score the limitations all that. So and then users can then look at that side by side with the actual preprint. And if you want users can also click on the on our Twitter account and then see just all the papers we screens. So other groups actually pick up our reports as we published them on hypothesis and Twitter. You know, the people who visit bio archive on society which is realized review platform and the home of public preprint evaluation as they call themselves publishes our evaluations on their website so society users can can click on our like or and click on a preprint it within society and then see our our report. And by archive will hopefully soon also display our report in the automated evaluations tab. So right now we're in this Twitter tab with a bunch of other people that are just like talking on Twitter. Hopefully we'll get our own special little like place here we'll use this and click on this just for automated evaluations. So you may wonder how do I add my tool now. I hope that our framework makes adding new tools easier so you don't have to worry about text parsing making results public or storing them. It can just focus on the actual tool. We have a verification process right now before we include tools in our production pipeline that goes on Twitter, but you're welcome to use our code for your own evaluations if you want and add your own tools to the pipeline as well. Here's a link to the to the code, and you can also contact me at this email if you want to, you know, include your tool, especially if you have performance metrics already for your tool. And we can we can talk about including them in the pipeline. So here's my mentor meter question, which existing tools do you know of that you would like to add and what other, or what else would you think would be useful to screen for. You can scan this code and or visit the website here. All right, then, if you already on mentor meter should update as well. So we're interested in knowing what other tools you know that might go into this pipeline and and what other types of characteristics of a publication would you be interested in screening for. Go to mentor, mentor.com and give us your answer. I stopped sharing. Yeah, I think we're making sure now. We'll take another minute. Donate you want to share your screen. Okay, we're getting some answers. And maybe if you're seeing something there that you haven't thought about and you're interested in it. Maybe you can add that as well just to make it bigger in the work cloud. See which, which tools are most interesting to the audience here. So we see consort criteria stat check. Ethics approval plagiarized figures data sharing. We actually have data sharing in the in the pipeline from not mistaken. Yeah, so we have data sharing, we have them conflict of interest screening and size score also has power calculation, even though that that's one of the NIH criteria. So some of these are already in the pipeline, but others are not fake address. I'm not sure what's meant by that sounds interesting. There is the plagiarized or bought created paper maker. I always forget the name of it, but that is one of our consortium tools that we will potentially put into the pipeline. And there is a question for for Peter, I believe, question is Mario comments. It's more of a comment from Mario who earlier asked whether we planned for users to self screen before posting preference. Yeah, so that's one of the things we're looking at doing. We didn't choose to do that initially just because it was a public health crisis so we wanted to just make our results public but that's definitely something we're thinking about doing as we go forward. I'm going to point out the tortured phrases paper, which makes sense. In the case of addresses. All right. Great so our next speaker is Colby, and he will talk about some of the results that we got from the screen it pipeline. Great. So let me rearrange my windows here hopefully I'm on the right screen. All right, so thank you, Khalil. So I'm going to start by continuing from what Tracy and Peter introduced that is the COVID-19 preprint screening. So as they introduced the group has been feeding preprints from bio archive and met archive through the pipeline to get a sense of how COVID-19 preprints are reporting certain items. And these preprints as Peter explained are from lists that bio archive and met archive have hand curated as preprints that are related to COVID-19. And so we're early in our exploration of the complete data but I will report what we have so far. And just to put forward right at the start here, these results are the raw results of what the algorithms identified and I'll show some early analysis with exact percentages this precision needs to be interpreted, you know in the context of what Tracy introduces as the limitations of these tools. So as of September 4 of this year over 18,500 papers have been screened and you can see the cumulative preprints published in the figure here on the left. On the right, you can see the number of preprints published per week that have gone through this screen pipeline, and I'll orient you to this graph briefly as you're going to see this a lot. The x axis starts at on January 1 2020, and near the center here is December of 2020. And on the far right, we're in this month, and the x axis reflects the count of preprints per week. So in this figure we see preprints that have a statement about open data. Dark purple are preprints with open data statements and light purple do not have one identified. And an example sentence that was picked up by this algorithm is on the right here. It says we have posted our data analysis at the open science framework and followed by a link. So overall the algorithm identified that 16% of preprints have data statements and this didn't vary too much when we look at just the first three months of the pandemic from January to March of 2020, or the most recent three months from June to September. And code sharing reflects a similar pattern. And here's another example on the right of a sentence that was picked up data and all relevant code is available on GitHub, followed by the link. And here we see that overall 12% of preprints were identified with open code. And this also did not change much when comparing the first and most recent three months. If you look at registration identifiers, I'll note here that this does not necessarily reflect pre registration it may it may reflect registration after data collection. And so there's something we haven't looked at this this difference quite yet. But an example statement, again, for you on the right is that we conducted two RCTs. And then they list the clinical trials.gov identifiers with that sentence. And overall registration identifiers were flagged in 4% of preprints with a very slight bump in the most recent three months compared to the first three. And here what you could imagine that the type of studies that would be registered would tend to take more time to plan and wouldn't be published in the first few months of the pandemic. And it's interesting that we know that difference between the beginning and more recently is still is still small. And so here are statements identified as conflict of interest or funding statements. And you can see in the dark purple, which are the preprints that were identified as containing these are in the majority. And then here are example sentences from the screen preprints that are below each figure on the left. The example the authors have no competing interest to declare. And on the right, this work is supported by X sources. And so 98% and 92% of preprints contained conflict of interest in funding statements respectively. Particularly for conflict of interest statements from the first three months at 80% to the last three months up to the 99%. And senses that indicate study limitations were also identified. Here is an example sentence on the right. That said, some limitations should be considered when interpreting the results of the systematic review. So that that's an example that would be flagged by the algorithm. And overall 41% of preprints were identified to have at least one limitation set statement and this increase from 31% in the first three months to 45% in the most recent months. So one of the tools in the pipeline screens for rainbow color maps and figures. And so they give a very brief overview of why we and others are interested in this. You know as Peter said those who are color vision deficient are unable to process much of the information in in rainbow color map schemes and even those with typical vision their issues with how interpretations of figures and data change when a rainbow color scheme is used. And sometimes these interpretations change in misleading ways. The algorithm identified that rainbow color maps for present in 5% of preprints overall, and there was not much variation over time. Another tool in the pipeline screens for what we call problematic graphs. And so bar graphs per se should not be used to present continuous data because the mean and error that are shown in a plain bar graph can result from many different data distributions and other forms of showing the data are much more informative. And Tracy has a great paper on this and that in plus biology and that reference is on the the bottom right of the slide. So an example bar graph in the middle in gray is one that is is not ideal with the bar dot plot in orange showing a better way to reflect that data with individual data points. Overall, 7% of preprints were flagged with at least one. So in this example bar graph and more informative graphics such as the bar dot plots or dot plots were observed in three to 5% of preprints. Ethical statements are also identified and here I show a figure on the left that reflects ethical statements in general, which include both animal and human statements and on the right is is IRB statements. This is specific to human subjects research. And so you'll notice that these look a little wonky. And that's just because there's a change in the way that the data was stored and I haven't yet accounted this for this in an analysis but within these two periods that we do have. We see that about 50% of papers contain an ethics statement and 33% in IRB statement. And with the goal to increase the study of sex differences in biology. One algorithm looks for whether sex is reported. An example of this is shown on the right where both males and females were included who were reported in this particular sentence. So we've got 11% of preprints overall address this and this increase to 27% of preprints where an ethics statement is also detected. So next we'll look at items that are related to design rigor that's randomization blinding and sample size calculations. There's a there's a lot on this slide so we're going to start on the left with randomization. And the first percentage of preprints overall include a mention of randomization. Interestingly, there's quite a big jump from 3% in the first three months to 21% in the last three months which may reflect the change in types of state designs preprinted over time, although this is something we're going to have to look more closely at. So we're going to have blinding for which 3% of preprints mentioned some form of blinding, and this is also increased a bit over time. And finally just 2% of preprints calculation. And this has risen in a similar way to blinding over time 5% in the last few months. At the bottom of each of these you see overall estimates for each of these items when an IRB statement is also detected. And this alludes to preprints that are that include human subjects research. And these numbers are just a bit higher than the overall totals for each 14% for randomization overall 6% for blinding and 3% for sample size calculations. I've just thrown a whole bunch of numbers at you and you might be wondering how all these numbers relate to the broader scientific literature are the preprints better worse or are they about the same. There are a couple of previous surveys that have applied these same algorithms to the PubMed Central Open Access subset. And so we can make some direct comparisons from the literature overall to this set of COVID-19 preprints. And so going through them one at a time we see that for open data and registrations. These statements are present at about the same rate in COVID-19 preprints as the broader literature. And statements on open code are interestingly 9% higher in the preprints. The conflicts of interest in funding statements are 78% higher in preprints than the overall literature. Rather interestingly, randomization, blinding and sample size calculations and sex as a biological variable are all quite substantially lower in COVID-19 preprints compared to the more general literature. And in fact they reflect numbers more like what the general survey observed in 1997 and not more recent years. And so we have a number of items yet to dig into such as the proportion of preprints that use authenticated cell lines, but an earlier analysis based on these COVID-19 preprints by the group up until July 17 of 2020 indicated that it was quite low at 7%. And so there will be much more to come as we we look at these and other items more closely. So what have we learned about this process. It one it's feasible to use automated tools to conduct near real time large scale screening of preprints and provide rapid feedback to authors and readers to complement peer review. We do, you know, these, these large scale surveys, and we found in this case that items that reflect transparency and facilitate reproducibility are overall overall quite poorly reported in the sample of COVID-19 preprints, and that items that reflect rigorous research designs are similarly overall poorly reported. And finally, we want to again acknowledge that there are limitations in this approach as Tracy introduced, you know, some items are not relevant to each paper. A systematic review does not have any people randomized. And so we're working on approaches to try to better tailor what is screened in each paper based on what type of paper it is. And then finally tool performance is not perfect. Some reporting criteria are easier to screen and others but no tool reaches 100% accuracy. And we plan to continue incrementing and trying to get to the best results that we can and really welcome you to join our group to help us improve this. And so at this point, I will pose a question. This question to the audience and I'll leave it up for about 30 seconds. And that is, how can we make these results more useful for medicine. There are a couple of questions, but I will come back to them after after the poll, and maybe run it can share screen response I'm seeing right now is tailoring increasing compliance. That's, that's one of the goals of the pipeline for sure. But also, from the pipeline we're generating a lot of data and that might be that might be something also I think very useful and relevant for medicine. And it looks like that was one sentence. Okay, the running trials that show that this increases compliance I think is the is a one answer. Yeah. And that's something we've been discussing in the group. So, hope we hope to have to conduct a study of that, whether it does or not. Yeah, I think that was actually one of the most important things that we wanted to do when the group was founded. And it just is something that got put off to the side when the pandemic kid and we decided that our time was perhaps better spent just pulling all our tools together and setting up the screening pipeline to run as quickly as we could. So the question from Olavo Tracy you might be in the best position to respond to this, what are the criteria to establish that a preprint has open data and code does it have to have a working link. Since this is odd pub, I think you yes. The odd pub does not check the link, and whether it's functioning or not, it simply looks for statements that would suggest that data has been deposited or code has been deposited somewhere. So, it doesn't do any, you know, going out to other websites to check that the link is valid or that data is actually there or code is actually there or the data or code are actually useful. So the only thing is it's it performs much better for data and code deposited in deposited in repositories as opposed to data or code that are made available in the supplement of the paper. And here is one place. Oh, sorry. And here's one place where we actually have two tools that that can shed light on this. One of them is not yet incorporated into the pipeline, but we're working on on that. And so because SciScore has additionally added a another verification that does go out to the repositories and checks for those IDs. But this is one of the kind of works in progress where we're not quite yet sure what the visualization of that is going to look like when we put some of the tools together. Yeah, there's a question for Colby from Gabrielle. When analyzing the registry randomization you somehow filtered for RCTs only in a sub analysis. I think you can perhaps answer this question. We did not filter for RCTs. So the algorithm and Anita can comment more on this because this is part of the SciScore tool that picks up mentions of randomization. It picks up any mention of randomization. Correct Anita it's not, you know, it's not necessarily finding human RCTs it's not necessarily finding, you know, randomization of well plates, or things like that. That's something we could we could try there are some, some automated RCT, you know, filter algorithms out there that we could, we could try as a first approach to do an initial filter and then, and then run these algorithms on those. But yeah, there's a lot of a lot of things we can still, we can still experiment with, but Anita did you want to add to that. No, there's there's no real way other than looking at the paper to determine if an IRB approval is also provided so you know when we have the ethical IRB statement. Then we presume that it's some kind of a human trial, although it's probably not, or we don't know if it's a randomized controlled trial or not. We can only pick up the fact that it has an IRB, it has randomization, and not necessarily that it is specifically an RCT. So we don't have that classifier yet and I don't think we have one in the group of tools yet either. You know, as Copey said it would be great to see that. Thanks Anita. There's another question from Richard about authors but I think this actually provides a nice segue into Anita's talk. So we'll come back to this question after Anita's presentation. Thank you very much, and I will attempt to actually now share my screen. Here we are. Okay, so what I wanted to take a look at is how are authors responding to this screening and this is actually one of the less easy things to really quantify but I made some attempts in this in this presentation. And by the way, I do have a conflict of interest disclosure I am both a UCSD, Department of Neuroscience Research faculty and also Cycrunch Inc co founder and CEO so that is that is my conflict. So here is basically some user feedback and I'm going to kind of break it down into three different classes of feedback, and then I'll give you some examples. So you know the first class is thanks for checking my paper, which is really nice we love that class I love that class particularly. The second is the you have missed something class and I put that in kind of this middle area. And of course the third is the red area which, you know, we can have some expletives redacted out of this presentation later. So by and by and so that the biggest class of things that I have seen because I've actually been looking at the, the Twitter feed for the last year or so is really you have missed something so it's kind of in this middle portion. And sometimes that statement is more or less negative, but it's generally that your tools have failed to do something that they should have done. And so this but this is let's look at the kind of first category and there are really two classes of these so there are some authors that are very concerned and I put that in the heart attack category. And then I put the, the others like these, these people here with with with the without the heart attack kind of category. So generally speaking we have these kind of state, you know, questions or statements from from authors who are basically saying, hey, I'm going to re-score my paper because and thanks for actually scoring it, which is nice. And another one here is saying thank you very much for reviewing the article. Can you give me more information so all of these are responded to within a reasonable period of time I actually have not responded to some of the very earliest ones. That then I changed that after gaining control because you know these are real authors, real people real papers who are trying to do a good job to understand COVID. And the least that we can do or at least I can do is to respond to these. There is one recent interaction that I had from from an author who's basically saying, hey, there is something, you know, there is actually an explicit limitation section, which is in the manuscript, you know, and he's he's very nice. So, but again this is that most prevalent category. And so I'm responding to this, this person here. You know, thanks for setting the record straight sometimes a tools don't pick up everything that they're supposed to as anyone yelling at Alexa might experience. We will update the note. And so here I'm also going to update the tweet. So one of the things that you can see here is that the tweet, excuse me, the tweets will tell you that we detected three out of five rigor criteria, and one resource. In this case, there is no limitation statement so different things will come into these tweets based on those reports. This is obviously very highly abbreviated where they would have to click on the full report to actually see the whole thing. And, but this author is really happy and and basically, you know, told me thank you for the response and, you know, thank you for complimenting the paper that's actually I did read the paper was very good. So, and then I went ahead and updated the report and the tweet to reflect that there is actually is actually a limitation statement. Here's another one. So, looking at, you know, another basically setting the record straight. So this author is saying that yes there is actually open code and here it is which is great. And then this author is saying, you know, not sure how this is generated, but it is not accurate. And here are all of the things that you know we actually did and did not address within this particular paper so again. This is the most prevalent kind of interaction. This is also some interaction this is one of the first interactions that we actually had, which was not retracted redacted or whatever, which is, which I found kind of to be a little bit funny. This is a researcher who called us cyber bullies. And, you know, unsoliciting defaming bioinformatics studies. You know, why would we randomize or use sex as a factor in this metaviromic work. So then I read the paper that he put in and I also read the papers that provided the the original data for for his paper. And my response to this was your sample looks to be three affected male patients 2437 and 74 years old can't find the sex of the controls in the original study. But why does it hurt to add demographic information into this kind of a data set. And my response to that I'm not sure if he kind of understood that actually this might be a nice idea to improve the transparency of his manuscript. And I actually have not found out whether this particular study actually included now a better description of the of the patients. And then of the kind of angry tweets, end up being deleted by the author. That, you know, we, we believe and I believe that people sort of react very angrily at first, they might put that in public, and then they think about it for a second. So we do respond to, to all of them and try to respond as in as neutral a tone as possible like, thank you for, you know, reporting a bug in our code. Here is what I can do to, to basically fix this. I tend to apologize. But generally speaking, you know, I think people usually start to think about kind of how what kind of message they're putting out to public and then often remove those interactions. So that has been kind of a more prevalent category of the, how dare you, how dare you score my paper, kind of, you know, angry, angry tweets. I have also gotten one very, very long email thread. And, you know, so we're definitely this is definitely not completely neutral is what I would, I would say. So, besides these handful of, of reactions, some positive some negative. Is anybody actually looking at these things so this is the largest subset of the data that you can pull reasonably out of Twitter which is three months worth of data on their analytics platform. And in the last three months we've, you know, our tweets have been viewed 75.2,000 with 75.2,000 impressions over this three month period. And this is when we actually are posting a lot of content and then here's when it is being viewed. Now, I was able to pull more of this information out of out of these visits and I just kind of look at it and summarize it here. So one of the things that we were, we have done is we've tweeted 19,000 times so this is just over one time per article. So some of those are duplicated, but not many. Then there is basically the total number of impressions as of last week was over 750,000 times. So, these things have been viewed a lot. Interestingly, if you look at the profile visits. So this, these are the visits to, you know, the profile on Twitter. So this is more than the number of tweets. So it really does suggest that, you know, while maybe not many people are actually responding or liking, or doing kind of the traditional Twitter things, they are absolutely seeing this and they are actually seeing what the reports. So there are, you know, 150 mentions so people are retweeting and tweeting about our bots, and we have 269 followers which doesn't sound like much when, especially when you're looking at these kind of numbers for how many people are seeing the this. And then looking at the different criteria here in terms of the number of tweets. So in the blue you have the number of tweets and this is on a log scale so that some of these smaller things like, you know, mentions or new followers are also able to be visualized on the same graph as some of these bigger metrics. So here in June of 2020 we actually posted a lot of things all at once. And this is just reflecting that we basically scored everything beforehand. And we had, by that time understood that we couldn't post to bio archive comments sections automatically. And so this is when we really just turned on our pipeline, scoring everything before this point. And then here we basically, you know, generally we're posting at about 1000 pre prints per here per month in some months, and this is just reflecting you know the state of the pipeline. And here the pipeline was turned off to do some upgrades and so we posted those those reports later on. And so you can see that I didn't see much of a bump but maybe a little bit of a bump in terms of views and and interactions mentions. After the paper came out so the paper came out here, you can see that there are, you know, there's not really a huge bump in in in any of the followers or anything else. And certainly it does. It does show that, you know, things are going on people are actively looking at these things. Okay. And definitely, you know, under that category, we also have had a little bit more support from bio archive staff so Richard server now comes to our meetings often, and also has recently. Just a few days ago there was a very long thread about pre prints and whether they are good or bad. And Richard server is basically talking about, you know, SciScore reports and the screened pipeline, and how there are already ways to actually start to think about the quality of these pre print manuscripts. So I think in terms of entering into the consciousness of our of our audience. I think we're starting to to kind of get there. And this is just a little contrast. There's another bot that we actually have created thanks to Peter actually not we Peter has created. And this is just a little bot that thanks. It's the thank you bot for using thank you for using our IDs bot. And essentially that's all it tweets, it just says, you know, hey we found this paper it has our IDs thank you for using those. They improve science improve, you know, transparency in science. And this one tends to be much more positive we still haven't gotten a negative response on this particular account so we know that people will not necessarily discount all all bots and see them as negative, but they are definitely when they're thanked by bots they're okay with them, but when bots are telling them that something is wrong with their paper, they're a little bit less sanguine or a little, a little more concerned about what they are saying. What I wanted to now just kind of bring up on the Mentimeter if you could go to menti.com is to try and answer this question right now that you have heard us talk and you know a little bit more about these tools. Are you more like likely or less likely than before the session to want to have your paper screened. So I'm going to go ahead and stop sharing and yield the floor. To my colleague with Mentimeter. Can you just briefly while people are answering the Mentimeter poll address how many responses you might get in an average month because I buy I mean our by far most common responses don't respond. I think. Yeah, absolutely. So the average seems like it has been going down a little bit. So the average response number is kind of on the order of maybe a couple a week. It's not a huge amount. It's absolutely not a huge amount and and I what I what I count as a response in that is not like liking or retweeting. It's something that I have to attend to. So something has gone wrong with this pipeline you have made a mistake, you know, or or some some such if it's just saying, Hey, you guys are doing a great job. I'm not going to count that. Although, you know, maybe I should. Anita, there was a question from Mario. I'm not sure if this was charged addressed, which is that do you remove tweets, or you just update them. I cannot update tweets. So tweets are not updatable once they're posted I can remove it and put on a new tweet with the updated information. So I will just literally copy the tweet, update the information and repost it. And so completely remove it. What was that I'm sorry. You ever completely remove it on on the author's request. Um, I have not. I have removed parts of the tweet. So, for example, if they say, this is not relevant, you know, these, the score portion, you know, this part of the scoring is not relevant to my paper. I will look at the paper and I'll say, yes, that is absolutely not relevant to your paper so therefore it's something that I will remove as part as a part of the of the response the automated response and I'll retweet it without that piece. And also change the hypothesis. Report. Okay. So there's a question from Richard and you know you might, you might want to take this one. Do you have any thoughts about have to motivate authors to take action on two findings. I think this is for the entire panel. I really don't know how to better, you know, there is this kind of naming and shaming concept. And if we say, oh, we can name and shame. But I'm not sure that's always the best strategy I mean I would have loved to have just made this a little bit more silent, and not so public, but with the pandemic. I think we were fully justified in in going public, especially with preprints. And I, you know, we have obviously all done, or several of our of the people in this group have done large analyses on top of PubMed Central, and I am very hesitant to release those results, because I think they would. I mean, in aggregate, I definitely, everything has been released. But the individual ones are kind of more of the naming and shaming and I don't love that strategy, I think. Figure out how to encourage authors to pay more attention before the papers published. And so, one of the things that we've, we've been doing insight insight crunch, this is on my commercial end is we've been working with some of the society journals to screen manuscripts before they come out as published manuscripts. And so, you know, our tool can be run on, or actually must be run also within some of the publishers, at least once in some cases many times during the review process and recently the American Association for Cancer Research has actually mandated that a particular score is necessary before they'll publish the paper. It's not a high score, but, but it is a score so that I think more people are starting to pay a little bit more attention to that. But I don't know. Tracy, what do you think. So, I think one of the things that we really want to do with this is use it as an education tool, because many authors simply aren't aware that these things are important and they may not be commonly reported in their field. So if you look at the reports the reports actually have links out to resources that can help authors understand why something is a problem and then how to implement better practices. So we have been discussing putting together some more concrete educational materials so things like two to three minute videos from YouTube for each item reported or other more engaging formats potentially of sharing this information with authors but we are certainly open to suggestions as to how we can engage authors more in this process and I would encourage everyone to post. Another challenge that we have here is for some authors, they don't understand that we're doing this with every covered preprint and so there's a feeling of being singled out and they've never seen this before and why my paper. I think if a screening were more normalized and it became, you know, commonly known that this is just a thing that is happening in these tools have limitations and you know, they're there to help but they're not perfect. It might be easier for people to just look at the report there was a comment in the chat about people taking it as personal criticism and I think that that really is an issue for some people I think some people got it and and understand, you know, our intent in creating it but it's hard to really understand that from a tweet and so I certainly understand why people would not understand and I think one of the things Anita has mentioned previously is that you know when she replies it helps people realize that there are actually humans behind these tools and we're trying to make science better. And it's not just a faceless robot that, you know, tweets things randomly and doesn't care what it's doing, we really are trying to make things better. But that takes time and we're not going to be perfect from the beginning, and we probably won't be perfect ever ever it's just a matter of, you know, going through the stages to get better as quickly as we can. So before answering the other questions that are in Q&A which are kind of more broad questions I will be able to do the last poll of the day, which is on future directions. So you can go to Mentimeter and give us your opinion about what the future directions should be for this pipeline. Some of these already came up in some ways, but we're interested in knowing which one to prioritize. And I'll just raise Halil while people are doing this because something was mentioned in the chat about can we do a follow up study to determine whether the preprints that we screened have improved their reporting subsequently. One of the challenges there is that we have no control group. We have screened every COVID-19 preprint. So we would have to essentially compare to non-COVID preprints, which is a little bit of a challenging thing for a study design. Yes. Okay, you can perhaps go ahead and share the screen with Mentimeter. And also, there was another question in the chat about whether we consider, if you consider a pilot in which we would see whether, you know, we would get better responses or better compliance if you send the report by email instead of the public. Yeah, this is, this was a pretty intense area of discussion. So what happened, one of the first tools that was developed was a tool that was screening bio archive preprints. What happened with that tool was that it sent emails to authors and authors who liked the tool emailed the tool creator and authors who didn't email bio archive and sent bio archive some really unpleasant things and didn't understand that the tool wasn't part of bio archive. It was essentially a separate thing that bio archive had nothing to do with. And so that created a lot of complications in our relationship with bio archive early on and we are, we would like to avoid that in the future. We've, we've been a little bit cautious about going to the email approach, and we have been trying to work closely with bio archive and met archive to find an approach that works for us that works for authors and that works for them and it's a it's a very complicated set of relationships on all sides for everyone to navigate it's not a simple issue. Okay, so our mentor poll results shows that better study type specific screening this is the top choice in terms of future directions and I'm happy to report that we're already. We already started working on this using machine learning tools to try to identify the study types. And just going back to the chat. There are, there are also questions about. Even when even when the intervention is not planned as a RCT to see whether this kind of automatic automated screening tools could lead to changes and that's certainly something for the future to do. So now it's time for any other questions you might have so I'll go back to the Q&A. The first question we have from Richard is, is there any indication that journal editors look at the reports when evaluating preprints for publication in their journals. I'm not sure that we did any follow up on that but Tracy Anita, do you have any. So we do have some heartening results from the society integration because this, you know now becomes part of the society is a platform that's put together by some of the elive staff it's specifically not branded as a part of elive because they want to make it much more open. And so we do know that they are picking up these things and using them. I don't know. Again, we haven't done any specific outreach to see, you know who's using them as they publish elive papers and as you know those elive papers are evaluated with by real peer reviewers. We know I do know of another group called JMIR. It's one of the publishers. They're looking at a similar type of integration they're also very interested in this kind of automated screening results on top of this on top of the preprints. So that's another group that really wants to use the information. I don't know how much they have done and how much they're giving to their reviewers. So for, and again for SciScore itself, we do have interactions with editors, but those are direct interactions with our tool through the back end systems. So I'd say we're still a little bit early and we haven't done the outreach that we need to do in order to really answer that question adequately. Anything anybody wants to add to that. Thank you. Another suggestion on the chat is whether we would get more understanding from the authors if bio archive and archive officially endorsed the pipeline and they alerted during submission that the authors will get an email from us. That sounds pretty reasonable. That's been explored as I mentioned it's a complicated relationship. I think the important thing to remember here is that bio archive and Met Archive have a lot of different resources and third party applications that are doing things with data based on papers posted on bio archive and so for them the issue is much more complicated. Because it's, it's hard to make decisions about, you know, one individual thing without having opening up a door to all of the other individual things. So, yeah, complicated. Another question on Q&A by Michael Andradez. He says interesting tool going beyond the COVID-19 papers. Do you think this tool or size score is already mature to be employed in curriculum evaluation grants distribution and other scenarios inside to metrics. Yeah, so I think it. So the size score tool itself and I think, you know, all the other tools have also been screened on top of manuscripts so if we're talking about, you know, curriculum evaluation, one of the things that I would, you know, stress is that curriculum evaluation would have to be, you know, looking at papers before the particular curriculum and looking at papers from those students, faculty, whoever's participating in that curriculum after. Those are not going to be fast evaluations, but they certainly can be done. And then we've, it from the size score side we've definitely looked at some of that we're looking at some kind of department data we for our 2020 paper, the Mankey at all we actually looked at the effectiveness of the nature checklist because we knew when it was implemented and so we've pulled out the data for that particular journal. We looked at all the different criteria that we knew were affiliated or associated with that checklist and we saw how they changed before and after the checklist. And so you're welcome to see all of that I can post the paper again. It's a nice figure. It's from the in there so yeah I think it's it's definitely sufficiently mature to do testing, but the testing that it's sufficiently mature to, to evaluate is only on paper so if you're talking about evaluating grant documents. I, I haven't tested it on that, and we haven't done any of those kind of things. Any evaluation on papers would be right up our alley. And I don't know what I know that there have been other large scale trials that Tracy and Tracy's colleague Nico have been a part of I think our transparent has also been run on a large swath of the literature, but not all of the tools have any comments from the panel on that. I saw that there's a raised hands by manual rush. I will laugh at the talk. Manuel Tracy. No, I don't I can maybe comment on a couple of things that were raised in the chat earlier because it was quite busy during talks. And so I think there was a question about whether we have done validation studies on the tools. So there are validation studies on the individual tools, and I will post a link in the chat to a place where you can find that information. So this is some of the limitations the tool as well as giving performance criteria for evaluation studies and validation studies and provides links to the studies if they are already published and publicly available. And you just need to click on the information for each tool in order to get that data with regards to doing an assessment simply on COVID-19 preprints and whether all the tools work on that particular thing that's something we'd like to do but we haven't had time to do it as of yet. In the chat there's certainly a lot of interest in evaluating the tool by involving the users and have the reactive intervention as an intervention before and after, and seeing the differences. And it's not, I would say that's not very easy to pull off, but certainly a very useful thing to do. Any other questions from the audience? Also raise your hand to speak directly. Comments from the panel. Maybe a question. I don't think that the audience is able to unmute themselves. If they raise their hand, I can. Oh, then you can do it. Okay. Got it. Oh, you have special privileges. Got it. I now remember. Any other questions? Go ahead, Tracy. I was just going to say, I think one, a couple of other things that's important to remember here, our pipeline includes both transparency tools as well as tools that are more screening for the quality of particular criteria. So for example, side score as a transparency tool, it just wants you to report things. If you say, all my animals were male, you get credit for reporting that thing. Whereas odd pub open data open code is more about does your paper actually have open data and does it have open code and not just is there a date availability statement there. And so I think it's when we think about this we're thinking about how do the tools, how are the tools going to evolve as the needs of the community change. So as people are getting better with reporting things transparently, we will hopefully be able to implement new functions on the tool that will start to check for the quality of those statements and encourage people to implement better practices or further improve reporting. I think the other thing that we've been nervous about so far is all of our. The tools that start including or thinking about tools that check for things that are misconduct. Because we don't feel that that's information that should be posted publicly we feel that that would be much better handled through private mechanisms and so right now, the group has continued to strongly feel that tools related to misconduct should not be part of the pipeline they should be handled in a different way and run through a different less public and more private mechanism. I have a question about Peter Peter you said that you made a tool available on GitHub, and others can run it, but also incorporate their own tool as well is that is that an easy thing to do is there an API for it that one can simply plug into. So yeah, we're, there's not like a really easy to use API yet, you know that's something we're working on, but it's not too difficult if right now the tools is implemented this Python methods so the pipeline like the, the process that runs the pipeline, just calls a different function and, and it takes us, you know, any input you want, if you want to PDF input text input on that just goes into the function the function just returns the results to the tool so if you can if you know Python is not too difficult, but it's definitely something we're working on to improve the API to interface with different types of tools. Thank you. Rachel Janssen has a question on Q&A. She says I really appreciate the use of tools focused on the images, namely the one identifying rainbow color maps. Is there any other intention aimed at making papers more accessible with using ableism. So I would say right now, that is the only tool that we have in that space we are somewhat dependent on the interests of our tool developers and our tool developer community. So if there are people working in that space that would like to join us I would certainly encourage them to do so. And one of the nice things about our tool development group is that there are a lot of other tool developers and so when you have questions or might need help or support with something there are other people there who have experience, who either have tools in the pipeline already or who are developing tools that they hope will ultimately meet criteria and be added to the pipeline that you can get advice from that you can work with and collaborate. It's a group that I really enjoy meetings for and hearing the exchange and the conversations between people. So yeah, if there are aspiring tool developers out there or people who already have stuff partially developed or made. I would really encourage them to contact us and think about joining that group. Thank you Tracy. Thank you Tracy and one of the things that I found in the chat where a couple of questions about stat check and other stat validation questions, which I don't think we've really addressed yet. So stat check is part of the group of statistical tool developers, actually, and our transfer, excuse me, stat check is definitely also a part of that group of tools. One of the issues with with statistics is that they're really incredibly complex, especially for a set of tools like ours, and stat check can be can work really well when authors report in APA format. However, knowing that there is that there's a set of journal editors that have run stat check on non APA formatted journals, and basically stat check came up with nothing. So, and the problem is that stat check expects certain things to be in certain places, certain kinds of numbers degrees of freedom and other things to be in particular format. And unfortunately, not all scientists actually report in that format. And when they don't report in the format they can't be tested with stat check so Kobe and others are looking at many other methods of pulling out information. There's a lot of papers about statistics. And you know this is just this is going to be an ongoing set of things that I think will be added to iteratively over many years because the problem is very, very complex. Yeah, I think another challenge there particularly for small sample size studies is stat checks checks for concordance between p values degrees of freedom and test statistics. If those things aren't being reported stat check has nothing to check and I'll just put a link in the chat to a paper that we published called why we need to report more than data were analyzed by T test or Nova, which basically says that in small sample size studies less than 25% of papers have exact p values and less than 3% have the test statistics and degrees of freedom. So stat check would not be useful until the quality of statistical reporting statistically improves or as a colleague of my summarized in order to check statistical reporting there needs to be some statistical reporting going on. Thank you. Thanks Tracy for that clarification. We're at the top of the hour, so we will end the panel. I want to thank all the panelists for the informative presentations and all of you will join today for comments and questions and further suggestions and gave us a lot of thoughts for what to do with the pipeline next. Thank you and have a nice rest of the day and my nice rest of the comfort.