 Hi, everyone. Welcome to the presentation for detecting health advice in medical research literature. I'm Yinya Li, currently a doctoral student from School of Information Studies at Syracuse University. This paper is a collaborative work with research scientist Dr. Dream Wang and my advisor, Professor Bei Yu, we are from the Lentis Lab at Syracuse University. To facilitate the clarity of this presentation, I will first introduce what is health advice. In this study, we broadly define health advice as actionable clinical and policy recommendations. In medical literature, the recommendation could be weak when it is an indirect one for health-related behavior changes. The recommendation could also be strong when researchers give direct clinical or policy suggestions. So why does it matter to detect health advice? Evidence-based health advice plays a critical role in medical practice and public health policies. Health researchers and practitioners are often relying on information services such as PubMed and Cochrane Library to access the advice. However, navigating through the large number of publications can be a daunting task. For example, during the COVID-19, we have also observed a big explosion of research papers about this disease. The strong need for understanding the fast-growing scientific evidence prompt to specialize the data hubs and search platforms such as Lead COVID by NIH. This platform organizes research papers by topics like transmission, diagnosis, and treatment. Currently, it does not provide the relevant function to support direct retrieval of health advice. However, navigation and summarization of health advice are very useful. Let's look at one example about whether to use hydroxychloroquine as a treatment for COVID-19. The paper showing here is a widely cited in a reported study for using HCQ in treatment of COVID-19. This study recommended using this medicine. The study was later criticized for lack of randomization in the study design. And with new evidence and research methods, another group of researchers later suggested not use HCQ for COVID. Without the support for a direct retrieval of advice, researchers and practitioners still need to spend a lot of time gathering the conflicting and changing advice. Besides the need for accessing health advice, whether to give a device, especially based on individual study results, is a controversial issue. On the one hand, medical experts argue that health advice from individual papers may lack a comprehensive review of all evidence and alternative practices. On the other hand, researchers have been encouraged to translate their research findings to actionable practice. Medical researchers also face the question of where and how to give advice. Researchers can choose to give advice in the conclusion subsections of structured abstracts. They could also give advice towards the end of unstructured abstracts when they are reporting study conclusions. Or in the discussion sections. However, the location of advice could affect its reach and impact. For example, clinicians would be more likely to read the foretext or rate a treatment as beneficial if abstracts have discussed its significance or made weak recommendations. So due to the need in the debate for the validity of health advice, a new information service that allows for direct access to the advice could reduce the information barrier and help with the verification. Similar to other NOP applications, detecting health advice has not been well explored. How to apply language technologies to automatically detect health advice and its relevant language phenomena such as advice strength have not been well researched either. We then aim to develop a computational approach to automatically detect and categorize health advice. We seek answers to the following research questions. To what extent can NOP models identify health advice. After getting the prediction model, we also conduct a case study to see what health advice has been made regarding the use of HCQ for COVID treatment. In the following slides, I'm going to describe our research methods. Synthesis continue health advice account for a very small portion in medical literature. So to avoid annotating a large number of non-advised synthesis, we annotated a sample of synthesis from the conclusion subsections and structured abstracts, which are usually a few synthesis long. Study designs can lead to different levels of evidence towards medical decision making. We then sampled equal number of synthesis from our cities and observational studies and labeled each synthesis into no-advice, weak-advice, and strong-advice. After corpus construction, we fine-tune a bird-based model to identify advice. We extended the prediction model to unstructured abstracts and discussion sections to evaluate the model's performance. After prediction model evaluation, we conducted a case study and applied the model to identify advice for COVID-19 treatment in lead COVID literature. As described, our training corpus contained about 6,000 annotated synthesis. The Cohex Kappa value is 0.86, indicating almost perfect intercoated agreement. For prediction model development, we trained and evaluated both traditional machine learning models and transformer-based models using five-fold cross-validation as the evaluation method. Our experiment result shows that BioBird model performed the best. And compared to traditional machine learning methods, the transformer-based approach is a better choice for our task. For the misclassified cases, we found a no-advice synthesis would be misclassified as weak-advice because it has the language cues like is suitable for. Similarly, a no-advice synthesis could also be classified as strong-advice because it has the cues like shoot, must, which are indicators for strong-advice. Now that we have a successful model that is able to detect a health advice in structured abstracts, we would like to see how it works on unstructured abstracts and discussion sections in foretext content. We have randomly sampled 100 papers that have unstructured abstracts and foretext access in PubMed Central. We manually annotated each sentence by its advice type. Now let's first look at the model's performance on unstructured abstracts. Unstructured abstracts do not have these subsection headers, and advice is normally near the end. We then applied a simple location-based filtering technique, assuming all sentences in the first half are non-advice. Using this technique, the prediction model's performance improved and was comparable to that in the training data. Different from unstructured abstracts, the distribution of health advice in discussion sections is more arbitrary. Although health advice, especially strong advice, tends to occur in the second half of discussion sections, about 30 advice sentences could also appear in the first half. So simply applying the location filtering technique would not be applicable. By looking at the misclassified sentences in discussions, the non-advice sentences were very similar to advice ones. They have linguistic cues of either weak and strong advice. However, these non-advice sentences were normally in past tense, describing study process or results, or citing the recommendations from prior studies, which was not the advice by our definition. To improve the performance, we took the following steps. We first added the annotated discussion sentences to our BioBird model trained on structured abstracts. We add the language style features and section information to each of the sentences. After the above setting, our model is able to generalize well to discussion sections for health advice detection. As mentioned in motivation, a successful prediction model could help many downstream applications. Now I'm going to show a case study we have done to demonstrate the model's usefulness for retrieving health advice, especially when it is used in combination with existing health information services like lead COVID. We apply the health advice prediction model to research articles in lead COVID corpus and to find health advice on H-C-Q for COVID-19. Using the MASH term of H-C-Q, we retrieved 3,000 H-C-Q related papers with 10,000 sentences tagged with H-C-Q. After applying our advice prediction model, we got 600 strong advice and 800 sentences, which are weak advice predictions. Zooming into the predictions, we were able to see recommendations to use H-C-Q, objections to use H-C-Q, advice on the doses when using H-C-Q, and warnings made by researchers for this medicine. Even though summarizing the advice by opinions and topics is beyond the scope of this study, if we combine these current prediction results with other NLP techniques, such as stance detection, we could further aggregate the advice by supporting H-C-Q use or not. For the prediction results, we would also like to address that the current model is designed to identify health advice in medical literature. It cannot verify the validity of advice, so further verification by health professionals is needed before applying the advice for clinical use. For real-world applications, we recommend developers provide a function to flag or remove inaccurate advice on requests from authors and health experts. We also recommend users discuss with their doctors whether to follow the advice found by this model, and we also hope them understand that the model does not provide a perfect recall, which means it does not retrieve all health advice that might be needed for health-related practice. In future work, we will combine the model with claims and stance detection for more fine-grained advice detection. We will link the advice in different sections in medical literature to measure the consistency of giving advice and extend the prediction task to news and social media platforms for advice analysis. Our code and data available on GitHub, and this project is supported by the following funding agencies, and we thank our annotators for their help with the corpus construction and reviewers' feedback on writing the ethics statements. Thank you.