 Welcome to GCC 2022. It's great to see everyone. We're going to go ahead and get started and before we get started, I just wanted to welcome everyone. And thank you all for journeying out of your zoom rooms and joining us in person. My name is Jim Wilkenbush. I direct research computing at the University of Minnesota and I'm one of the co-sponsors of this event. As I said, it is great to see everyone in person. And we thank you for traveling here to meet with us again outside of your zoom room. As I mentioned, my area is super computing, but my actual field of work is phylogenetics and I've been doing that for close to 30 years now. So the work that Galaxy is doing is especially meaningful to me from the standpoint of recognizing very early in my career that not everyone was super comfortable with the command line. And things needed to be done to lower those barriers to access resources that would help to advance their work. But it's not really about advancing their or lowering barriers to high performance computing for the sake of that. It really is fundamentally important, I think, for us to be able to make discoveries, specifically those that we might often call grand challenges. Without interfaces that can serve sort of as a nexus to different communities, it's really going to be impossible for us, I think, to make astounding discoveries. So the work that you're doing, the discussions that you have today are really essential, I think, for what is increasingly seen as team science and essential for the grand challenge type discoveries that we have today. So just to get sort of a little bit active here, I'm going to ask the question. Who among you would consider yourself classically trained in let's say computer science or software engineering, that's your first language so to speak, raise your hand nice and high so everybody can see. And then who among this group would consider yourself domain scientists first trained as maybe biologists or chemists or what have you. Wow, that's about an even split. That is pretty awesome. That's great. And so, so one of the things that I think Galaxy has done and certainly this community promotes is lowering bang barriers around the language that we use and I suspect that many of you work together and have probably spent a good amount of time, sort of with your virtual dictionary, aka Google, sort of looking up terms that people sort of throw out there. And, and I think much of this will happen this weekend as we get to under this week as we get to understand what each of us are doing. So again, thank you for attending in person. And we certainly appreciate your presence here. I just want to say a quick few thanks to some of the meeting organizers and then Tim is going to talk a little bit more about the scientific community and thank them, particularly I want to thank Jen Bessio Jenner. Are you here now raise your hand. In addition, a JJ Johnson, I think you're here JJ is in the back. Is here. Where's pretty pretty thank you. We were was not able to come she was traveling last week, or the week before and she tested positive. She regrets not being able to be here in person to see what her work has has done. Last, we also want to thank the Sapphire group is there are other members of the Sapphire group here. They're probably out at the desk but they have also played an important role in helping its premise event and permission. So thank you again for being here and with that I'm going to turn it over to Tim. All right, thank you, Jim. Hi, I'm Tim Griffin. I am a faculty member here at the University of Minnesota and biochemistry and one of those great unwashed non computer science trained people who try to convince people I know some about bioinformatics and galaxy helps me do that I guess. It's really wonderful to see everybody I think it's been three years since we had an in person galaxy meeting and that was right. So it has been quite a while so it is great to see you all here. Yeah, I just want to follow up with what Jim said thanking it takes a lot of people to put together a conference like this. I got the news when we were starting to do this that Dave comments who's here decided to take a new challenge and go to anaconda and so after some initial panic of oh my God, what do we do without Dave. Jen now Jen is she's already been thanked but she's in the back here. Thank you. It's helped us along the way but the transition I think has been made so I'm very happy about that. I also want to just like the scientific committee committee that really worked hard to put together the program. And I'm not sure if everybody's here in person that was on the scientific committee but so as Santa was part of that. Bay of Sarano Solano would also part of that and it's Afghan or you're going to see up here in a few minutes, and also subpoena metta back here so maybe we just think all of them. So we are requesting you wear a mask unless you're eating and drinking. Thank you to Wenglin biosciences or Wenglin. What is it any be. I don't want to miss miss speak and their trade needs but any be helped with the test that you all have so requesting to use those that we're just doing whatever we can to try to minimize risks. Just a couple other things if you are presenter presenting up here, we need you to sign a waiver which is the permission to record we're trying to record the talks as we go here. And also if you could today meet with Jen or subpoena and get your talks uploaded so they can get these on computers that would be super helpful to have those in place so that that's the request. I hope you know the white tables are supposed to be reserved for eating at, but then when we have the sessions in here we can sit up here at closer so definitely make your way up but that's the request we've had for for eating. I think lastly I'll just thank the sponsors you're going to hear from we have a good list of sponsors that are helping support conference so we're very much thankful to the sponsors as well. And all I have, did I forget anything back there we good are excellent. Okay, so welcome. The conference is officially opened. Enjoy yourself, and we will move on to the opening plenary lecture. Good morning. I'll just be on the same page. Alright, so it is my pleasure as well to welcome all of you once again to GCC 2022. My task here is to introduce the scientific program on that we have cited for next four days, as well as our opening keynote speaker. So all the content is linked from our sketch site, which is linked here on site and probably know of it so far. So, overall, we, we have for fellowship awardees that are sponsored by an anonymous donor on behalf of our love of late James Taylor. We have 66 abstracts for the most posters and talks of those 45 talks. Equaling about seven and a half hours of content will be presented over the course of the next four days. We have 51 posters and demos. You can see many of them already on the side of this room. So if you have a poster you're welcome to hang it up at any point from now on. Just please take it down. The last day of the conference otherwise, some of them throw it in the trash and it is your precious work so. Overall, the content has been written review the talk about talk about, but 208 authors, all the abstracts are at the link. So you can get the full content of those if you miss anything in the presentation or the slides, the galaxy training network GTN organized 20 training events so we have five parallel sessions for her for each of the four days. And all this is brought to you or so has been better by 38 scientific committee members that review the abstracts and decided how stuff gets incorporated and suggested some some areas for improvement. I'd also like to thank my partners on the scientific organizing committee. So subpoena Sunda and back for sifting through this organizing it and putting it into a format that we all will appreciate over the next four days. So, again, this is the full program is online. We have kind of grouped the talks, based on their theme to kind of focus areas again, great tables, please move forward slide the screen is not particularly large so it may help you to get the bottom of the content by sitting up closer. Something that I want to highlight from the scientific program is the fact that we have we can serve three keynote talks in over the next four days. So, in a few seconds, we'll have our opening keynote, Hong Fung Lu, who's sitting here, I'll tell you a little more about her in a second tomorrow at 445pm. There's the project update by the project, the eyes, and then on Wednesday, the last day of the meeting at 445pm, we have Chris from U of N, doing the closing keynote. As you will see from these, you know, talks, sort of the inspirational confidence theme as we look to be inspired by the keynote speakers is this notion of translational clinical science we've got around 15 years, and it is increasingly starting with the adoption outside of training in academia alone. And so these talks are here to help us see the future. And with that, I'd like to introduce our first opening keynote speaker Hong Fung Lu. She died for PhD in 2002 from the city of New York University, I'm sorry, City University of New York. And I'm going to introduce the Richard Amsterdam professor and director of the biomedical informatics at the center for clinical and translational science at Mayo Clinic, which is in Rochester, Minnesota, but an hour south of here. Her research focuses on facilitating the secondary clinical data for clinical and translational science and healthcare delivery. She uses techniques from data science machine learning and natural language processing to make that that data more available in a point of point of care environment. She's a fellow of the American College of Medical Informatics and a leader of the National Language Processing Working Group at the American Medical Informatics Association. She also co-organizes a conference, she's familiar with what goes into organizing one of these, hers is an IEEE interaction conference on healthcare informatics who's held last month. And with that, I would like to invite Dr. Lu and ask her to open the conference on the scientific note. Thank you. So wonderful. Thank you for the for the introduction. Thank you for the organizer for inviting me here. And this is actually probably the first in-person keynote in the last few years. So great to be here in person. So when the organizer reached out to me for canals, I would say I was making a mistake. I know Galaxy, but I never contributed. So, and later I check the program and I try to see, maybe I come here to learn from you all and then try to see how do we, you know, learn from here to see how we can, you know, reach out to open the community, I led called Open Health Natural Language Processing Collaborative. So this is also an open-sizing journey towards the real-world implementation of clinical therapy. So as you learn from Mayo Clinic and so the other work, you know, it's not possible without a healthcare institution, which is a nice kind of experimental test bed from a lot of work. So I want to talk, I will first discuss about clinical natural language processing and also try to follow by the clinical application in the context of clinical research. And this is actually a complicated that what I've originally thought as a researcher. Then talk about our effort in the Open Health Natural Language Processing, which tried to set up an open-sizing community for us to collaborate and move the field forward. And lastly, through some of the lessons learned, some of the factors how we want to move this community forward, which I see is vibration already that the Galaxy community have a very nice, you know, this kind of training, which I probably will borrow from that to try to start the Open Health Natural Language Processing Community Forum. So many of you know, you know, with the High Tech Act in 2009, many of the healthcare organizations have their data to be digitalized. So we call this digital transformation of healthcare. And right now, over 90% of the healthcare practice and the providers using electronic health records for their practice data. This gives a great opportunity, which we call to advance the field for learning healthcare system. Basically, the point of care data, which captured us through the, you know, electronic health records and other forms. Those can be leveraged for generate data-driven insights and knowledge about the community for research. And those can be generated to get evidence for practice. And those can follow a loop to advance. So it's a learning healthcare system. Now one of the data in the electronic health records, a lot of you probably familiar as the clinical notes or clinical documentation. You know, we know we try to different, basically different from the experimental data we generally do through the, you know, high truth about, you know, those technologies, which is down in the experimental settings in the clinical domain, where we deal with EHR data, we see a dramatic amount of information actually unstructured, not a computationally accessible. One of the technology which adopted is the more as, you know, how do we get information from those unstructured EHR data. And the technology popular about is natural energy processing. So computer science side, when we look at a natural energy processing, we may look at the left figure where we have text data, from the text data, we go through those natural energy processing, you know, syntactic processing, programmatic process that come up with structured presentation. That's generally what we see from the core P side. But in the clinical setting actually majority of the time, when people talk about natural energy processing, that can be in the right figure can be anything along those lines. So when you see an article talk about I use that, you know, we use our P techniques, it can be anything from machine learning from information structure from information turbo can be from text money technology, text classification. So all those in our learning healthcare system part on, you can see that our code probably natural energy processing. So, just to give more detail regarding one of the common tasks. The task is relatively simple, which is you have natural language here is a severe activity with muscle weakness. The path itself is just to try to summarize and allow to a computation verification that can be used for downstream application. And for this community, I don't need to emphasize on open science, we always take standards, the capabilities is key for the open science cooperation and the same thing in the clinical domain to if we got data. And the standard is necessary for that syntactic interoperability and the semantic interoperability. So here, the popular standards used in the clinical data side is the actual seven fire. This capture both synthetic standards as well as the semantic standards, try their best to so it's captured 80 we say it's 80% of the semantics that can be in station is through the actual seven fire. And the task itself is, you know, transform the text data to structure representation. Take a little bit of deep dive into the task. It's, it's a traditional natural language processing steps, but informed with the criminology ontology standards there. So the things to which we show on the, on the, the processing steps here is involved, you know, some of the syntactic processing steps so which is dependency passing, and where you set up the dependency be among the tokens you have. So what you do, you know, coming out of the nation. So this is a go back to what we show here is the unified medical language system that the ID, so we call cookies. So you extract to those concepts. They will go to relation detection, which is currently the direct relation fashion. You can do a relationship classification after that you can do the alignment with a trial server fire. So, so in general, there are a lot of clinical applications, which are, say, we're doing standardization this is what they are doing, basically. And then another popular task, you know, clinical domain actually is the application period. Basically, you know, I have a specific a task such as a cohort identification. And how do I extract relevant information from the unstructured data. So for the structure data, sometimes we use some of my nation, which is the general process, we will be, but therefore as structure data, the technology release information structure technology. Let me give you an example. One of the condition. Some of our researchers clinical research interest is called silent brain faction. This is a silent stroke, which is incidental findings. Basically, when you take head, see he or MRI. You can have a silent screw, which is the oldest to happen before, which you never know. And those are silent, you call silent brain faction in that cases, it's not a codified, because this incidental findings, you don't have code for that structure called a diagnosis So if a researcher want to study silent brain faction and do we give them stroke prevention medications or not, in order to do those kind of studies, they need to identify those patients. But that's not easy, because it's not a codified. The only way they can get it is from the radiology reports, the neuro image reports they have. Then the technology used here is just have a template hierarchy for silent brain faction, which we want to get is, you know, for sloths, and then from the reports we try to feel those for sloths. So, so those are things which we see some value, which the text data, the techniques to come bring to the coming to the clinical research. In 2019-20-20, we read as the research team, we started to see, you know, we have been doing clinical natural language for over 20 years. What are the current landscape, what have they used. So we did some literature review. This is generally show on that article in 2019-20-18. What do we, what do we did is the, you know, trying to do a literature review, basically getting all the articles to talk about P, talk about EHR, talk about the clinical data. We try to see what's the, what's the situation is. This figure four shows you the increasing number of clinical research. Unfortunately, majority of them still published in clinical medicine journal and the informatics journal. Those are the red line and green line. Many of you probably aware with the, you know, different in technology and transformative as the technology. The general RMP community, there are, there are greatly increased with a lot of publications and very difficult to get in. But here in the clinical literature in 2017, this is the situation, not much applications are wrong from the traditional computer science side. The systems used by those publication generally prior to 2000, most of the publication use two systems, why it's called Madeline, why it's Madeline. Those are two expert based system. Madeline is from Columbia University and a Madeline from the National Medical Medicine. Both are developed in prologue. How many of you know to prologue. From 2003 forward, we see adoption of some of the infrastructure platform tech of technology, including you might get with our popular, popular natural language processing development infrastructure framework. We also see machine learning status to from 2003 forward status to see a lot of machine learning related technology. And of course, from 10 to 17 18 forward, we see those dramatic, the field dramatically advanced just using various deep learning and transformer best techniques. And this will give you some interdictator some methodology review about the how the field was working the clinical natural language processing. So you can see the dramatic 10 to 18 is a pick. So our previous review was done in 2017. So we have now to see that a huge pick of the 10 to 18, 10 to 19 forward. This is the many deep learning related technology being used. Those are showing you the popular, you know, deep learning, large language models, people are using reading moving forward. And in those similar reveal we also see, you know, you know, that this this article is publishing changing changing. We're not actually see the deep learning related technology be translated to the informatics journal to the clinical journal. This is showing you the dominant approach to people taking you have a lot to use in a combination of dual based and the machine learning based technology. Some are in hybrid, but the deep learning only about 8% of the applications using the technology, of course, there are the clinical people use for all different kind of purpose. Some for, you know, clinical research, which is doing the degree study area, some doing clinical walkthrough optimization, and the UC drug related studies as well as deal with a social determinant house. All kinds of the document type being leveraged. Again, thanks for clinical notes you see radiology reports to some use this test summary, and the information they extracted that come, you know, along those lines. And when you talk to clinical researchers for people who get started with the clinical and people might can be quite simple giving me data applied technology and the data. The data in the in the clinical LP side or in general LP side is the most expensive results. So, what do we found is that for any of the clinical projects we're working on any of the applications. We spend a majority of the time to deal with how to formulate that clinical research projects to a reasonable clinical he tasks, and then how to create a gold standard to handle that. And there's a significant challenge here, actually, because it's kind of involve three different entities here. When is the us, which are computer science engineer and data scientists. We have the clinical researchers on there. And we also have the data generate the origin of people who write those data right. Those documents are secondary use the, the, the all sorts of those documents are not really the consumer the researchers will use that documents. And we as researchers, as a computer science test also don't know the language, but in those both sides. And the authors of those documents may also their original purpose, simple care, not really for your community for research. So you have significance in the data quality issue and those things and how to do a reasonable task and meet the expectation require a lot of negotiation and discussion. So, clear the RP. In general, I'll pay it is solved generate involve, you know, the task formulation. Where the assignment already mentioned the task itself is over 80% of the resource to generate that data sets to do the mother development and evaluation. Even you have advanced learning, like future the learning, they are fine learning you still needed to evaluate the performance is still needed to create the resource. The number of trainees that's generally in the clinical P side, or the gold standard is, is only a couple of hundreds. Just to tell you the, the past, even those ones on documents if you do the annotation, do the clinical annotation yourself adding on top of that. That's actually a lot of research needed. So, meanwhile in the call on P side, we also replied to see what the message work. And so there is a shared task with it. Work there to advance the peace signs in the south. So in this article by, so you know, it's discussed a lot of things regarding community challenges seen by medical text mining over 10 years. The article described the key messages that challenge those those resources created for those challenge for those community challenge is critical to advance the field forward. It's important for training and education. This figure shows the biological challenges. I mean, many of you probably aware of this, and also the associate a clinical challenge. You may wonder why we have few clinical challenge. The first issue is related to the clinical data, especially text data, they contain pH information due to the privacy pieces. We don't have much public data available in the clinical domain. And there's only one which popularly available is mimic mimic the data sets. And those are the list of the clinical LP, the challenges organized over the years, and what's their general purpose of the task. So you can see some are some are classification class some are, you know, information extraction class. Some are, you know, a lot of work and resolution all those different tasks. So, why the task is quite a challenge. So go back to the purpose of notation itself. So this is a required manual chat review and not just mocking the linguist linguistic and information sometimes is also clinical information. So the task is to solve actually require domain experts. You may hear, you know, some of the people trying to create culprits using, you know, crop sourcing all those things in the clinical domain is a challenge to because the data kind of sending to, for example, Amazon help to do the crop sourcing notation. So, the second way some of the presentation of those clinical information actually require people with the clinical knowledge in order to interpret it. And that's also additional challenges. So, now, we actually generally spend a lot of time to talk to clinical research team, try to tell them, you know, how to get that information. You needed to start to tell them it's not a magic. You need to train the system. And then let's try to see how people do with all those tasks. A lot of times you probably hear people say I want your system to achieve over, you know, 90% sensitivity specificity, while when we ask the two independent positions to do that location. We have about 50 60 in terms of data agreement. So we actually are using that as a, as a benchmark to show if you know the things that we can achieve at the end with the algorithm is to achieve what you can achieve as a human rather than another. And the system development itself can be many different technology. And we found out when we when we work with clinical research teams, sometimes the role based approach work much better, because what happened is that it's not only algorithm. And that model cannot fix the error where clinical researcher detected. It's not a fixable. They will not trust the system so in the first iteration you say you make a mistake I tell you the mistake in the new model you generally you still made the mistake that's the future. The trust of the technology to be done in English if we don't do that. However, sometimes you use the course of processing words using, you know, their knowledge to translate to the, to the algorithm. They find it's acceptable actually a lot of times say they need to be really supposed to fix those errors, because they detected it. So, in the evaluation side. This is a very different from the traditional on the past. When we evaluate to the clinical tasks of projects. Sometimes so what we need is actually, you know, patient level information. As a patient have diagnosis so fast, for example, on document level means that that's the document, the document to pass with us. Some are concept level, which is similar to them, like your combination and those type of tasks. Some even episode episode is basically the encounter type of information. And so we use the traditional, you know, on P management. A lot of times we have some additional consideration of the approach we have taken. First is robustness kind of performance sale, but remain stable over time and the possibility and the durability. That's that algorithm need to be deployed at a different institution. The productivity transparency vulnerability also some measurements when you see the reason it's critical here is that because you want to communicate about the results to the clinical researchers. As I already mentioned, if it seems they cannot understand as a clinical research and I may not trust the systems output. And of course, they're at the line performance fires and I guys considered. So, in reality, so those are the different p applications for clinical research. So clinical research generally you come from study design feasibility study cohort data connection data versus outcome dissemination. So in the middle layer, as you can use p techniques to help with the feasibility assessment. So we have the electronic health records. We know, you know, how many people, how many patients have those kind of inclusion instruction criteria met. You can also do all the major screening information extraction and the last thing is the related to, you know, systematic review and common analysis, because the evidence generally for clinical practice come from a lot of trials. Those trials generally have published article about those trials and the systematic review. And a lot of the findings is a critical part of the evidence generation, both clinical practice. So, we mentioned about message part. This part actually was done this year. The paper was written this year. The review was done using the connection from January 2009 to September 2021. We tried to see for those EHR based studies, how many of them using natural language processing for clinical research. So I emphasize this is for clinical research. And I say, so we identify studies, we excluded studies for our methodology studies, or pure clinical studies, only those studies which will with clinical conclusions was considered into the US. So we, we, we retrieve that overall at the end only 50 articles, which are actually using the techniques for their clinical findings. And we have about 10 papers about mental behavior related information extraction. And a lot of study are retrospective pulmonary study. Some are cross sectional study and the case control study. And majority of those empowered observational research. They are targeted concepts used is the diseases. We have 4% about medications. But this is, again, the same thing about to demonstrating. We, the, the LP enabled clinical research where the system is used in those applications, which are the best systems. And if you check for the bee, the bee is showing how many articles use EHR to do their observational research, which is in the law of scale. You see the blue line is, you know, we have dramatic increase. The blue line is those which utilize therapy techniques. It's, it's really small. If we believe 80% of the clinical information is in clinical notes. And then you can assume a lot of EHR based publications using incomplete information to do their clinical research. So there is a significant gap actually. And besides that, I will also check the reportings of those observational study regarding thank you to regard and reproducibility related information. Not to our surprise, but maybe surprise us. We see a lot of publications. Don't really mention what they are doing with their own, what LP is so we know LP, you know, it's a, it's a term right but this is a field with many different techniques. So we have 50% of the articles, didn't mention what model is there you think the model definition is missing. The 74% that does not mention any normalization and 58% that does not discuss in what kind of environment that our pieces also develop. So, you, you, of course, when we did the reveal we also look at the reference we also look at the index and all those information by this week, where we can identify those reporting of their office. And very poor reporting for LP clinical research using LP techniques. So why it matters because if you, if the reporting, if the LP, how it get to that data elements was not clear how we could produce that study. So reproducibility for this company don't, don't need to emphasize scientific research is always about reproducibility and just the foundation of a trusted science of the discovery. I think many of you probably through the COVID pandemic the real world evidence so you see so many complex information going around about, you know, clinic evidence. So to us is, is, is a, is a failure in our community to ensure transparency and the reproducibility of the research. So the second use of a trial we have inconsistency in data documentation objective priority. We have this definition, we have very, very high heterogeneity of the HR process and the process variation. They are not experimentally acquired data. What if the data have a compact, which are not explicitly captured and available for the data science. And even worse, because of the HR regulation, also because that is a belief house care data, you know, it's how, how business, potentially lack of property associated with it is this confidentiality in there. So, so the process how, you know, the data eventually come with the data sets, and the policy all things are missing. And there's the information quality described practice of radiation, the complexity, and also the representation. So this is the data pregnancy. The same time we have this reporting gap, the accessibility information quality. If we truly about, you know, believe a clinical research and want the data to be there to support it in the clinical domain, you may at the end, just to say, due to people, the data sets not available for you. They are reproducibility analysis. The sometimes we even longer. Okay, the process, you know, the data not available. Can you tell us how exactly you get that data. Those are also not available for a lot of research studies. Because it's actually, you know, to get that data, how many different the team actually need to be involved and you know, you probably involve the trust system vendors and you probably involve the person who are doing data retrieval. There are so many different layers around the data. But at the end is a nightmare for reproducibility, or replicability, if you try to think about trying to replicate the process, you know, different. So, even that, for the clinical solutions. The, the, the big chunk of why we have not really translate to real world implementation for supporting a lot of clinical professional studies, there are many contextual factors around it. There are many human factors here showing you the research stakeholders, and the distance between discipline expertise need to be involved when you do that implementation. There are many contextual factors. Um, you know, what one year trying to patient is when you try implementation we also notice the one algorithm for one specific research study. There's not a general political, but different the context. And sometimes because of the missing information you don't really have a way to know if that output will work or not without a truly evaluated. So, many, many contextual heterogeneous factors. I, we believe if we can document those explicitly, and we can improve the reproducibility of those systems, and then we are recommending some of those tasks to be, you know, documented. So, those are the recommendations also available in the, you know, or john, you get a page. So, so this is to give you some example is that the developed model is not truly enough and actually you also need to evaluate and do analysis and try to convey with a clinical research team to say if those are acceptable. Some of the errors can be may not be fixable and some of the some of the errors that can be fixed by adding additional post processing rules. So, let it go to the open house natural language processing so all those just to say, you know, you cannot, you cannot to improve the trustworthiness of clinical research. To improve reproducibility of the research would need to the things to be open. And one of the efforts to what's that is open house natural language consortium. This was established in 2009 through the male clinic IBM collaboration. And many of you probably here in the house here to me. They're popular systems. Why is the C tax less come from male clinic and then the other system is for the oncology side. This is from the IBM. This 2009. You may wonder why they have two different systems. Those are your email based kind of infrastructure. From 2009 to 2013, which through this project of the option up here was the focusing on, you know, document the representing the LP structured results to facilitate in some medical capability. So you may not see this diagram clearly but this is just showing you the target of presentation. And semantic representation was showing this, this big ball. And this actually similar to fire standards. And we incorporate that into data validation architecture for for HR. And all those are the technology we're using we're using the no as no SQL database store. So, you know, some of the ETL process and the parallel processing there. 2013 can be 17 the project was funded as our one product. Back then was the folks on. Okay, we have so many of these systems, but they are basically doing similar things. We want to adopt the news and how do we adopt the use them. So, so those are the effort to try to move to, you know, define was, you know, using stalkers and the rappers to rabbit the same systems, and bring LP a front end to any users. So they try to study the interoperability usability in real world use cases, and move to the couch. To back then, which posting, you know, it's, it's receiving an innovation order trying to move all those things a little bit, you know, what's generating data, and what the general community is sharing and use the data for privacy preserving relate to the competition of the typing and how to partner with various entities. So, the M1 actually this work was done at University of Minnesota, the LP team, which is rocking all the popular clinical systems into darker solutions and so that the whole process can be handled, you know, parallel high performance and with many LP systems results available to be processed at the same time. And the work on the M2 which is generated synthetic data for use for training machine learning algorithm. What do we find is that we actually can use a synthetic data generation techniques to generate some purpose available for partner so that people can use that to test some various machine learning related to technology and various applications. We actually found the system perform better than the original raw data. The original go to standard due to the fact of the kind related technology help smooth out of the noise in the privacy preserving field. We did a lot of work in this space is to try to see how can we distribute a filtering framework or yeah that's and to do terms of characterization for privacy preserving collaborative health data. So there are all kinds of techniques which can be leveraged, but this on the condition is that the data institution allow us to be computationally accessible. We're not talking about human accessible. And we also emphasize with the various clinical data research network community to help push the P forward through partnership. And one of the projects we involved in the last two years is this National Co-operative Co-operative. And this is the partnership between private some of the distributed data works. And so, so this is the plastics of the data and three C data. If you are interested, this is I strongly encourage you to join this community. So, this is the largest available patient records anonymously available to to researchers as long as your institution signed data use agreement. We have 5.5 COVID positive cases to 1.4 I mean 14.3 meeting people. So, so there are many people there and you see they have national representations. So I've been asked to contribute to the naturalized processing techniques. It just actually quite challenge. We through the in, in, in, in, in June 20, 20, we've been asked to do that we assess the what are the institutions with our capability. We found that only a handful of institutions have the capability in their clinical data research warehouse. A very few have ability to run and extract and run up here. And a very few institution having computing frustrated with those. Even though we have a lot of models without the algorithm, but there's no, you know, from people process technology started now on there. So you ought to do that we actually end up, you know, trying to get an end to end the minimal vital power from product for their own deployment. We've been handled a dozen sites to do the deployment. The technology basically using a catch beam and a flink and try to set up this, you know, kind of spark environment for people to run their. And today, that's the, we deployed out being about 11 sites, there are totally seven, four sites at the N3C. We only have data from five sites in the data encode. Only 5% of the patients, we have the computer. It's a big gap. I wasn't just trying to emphasize that. How do we move forward. I mean, this, this conference is giving me some inspiration. First, we need to train people. We needed to make sure, you know, we have the platform we have the infrastructure we have the best practice we have all those. But if you don't really have a training. Program going to help the sites with the necessary technology staff, they will not be able to do that. So, so this is just reflecting, you know, even there are so many EHR based observational research, very small number are using LP. I want to emphasize very small number are truly is a multi size majority of those studies is single size study. We know there's a population on, you know, different across different healthcare systems. So there's a significant opportunity for us to work as a firm, you know, to improve the multi size research capability. So this is the thoughts. When we no longer can do what we generally as a co-op researchers you already do, which we need to take a little bit different approach. We cannot ask people to give us data, and we need a federal development and evaluation framework. We need a trust process to be helping getting the domain experts and getting the team together. And then as a tool side probably set up a common environment and the tool kits for those community to adapt. And so we call this as a human center to the AI framework to advance the field. So there are two principles we thought we were thinking need to be there. Why is the fair data principles. And these implementation principles for the clinical for the analytics to be deployed in learning healthcare systems and the clinical knowledge derivation. You needed the process to be correct, you need the results to be correct to make the process of transparently implementable, make the results is clinical and reproducible. It's crucial for for the clinical research to trust the system and trust the data scientists. And we actually explore this more and it cover our institution transfer to handle that is that we want to bring up as a service to the community. And of course, it's fine. It's okay, because they know what what they document where they document. But we need to preserve those in Canada and so we need to preserve that we needed a tool to be human centric to empower them. We no longer can do that say we give me the data I developed for you. This is no longer work. It needs to be a team science collaboration. And we hope, you know, as we already emphasize the many contextual variations across the different institutions, but any. We deployed at the local sites need to have a vote of standard to be created. So we have this allocation tool just to develop your last year to help people generate federal p connect connections. So in general, as our community moving forward, we want to sit in the middle to help the clinical research community to translate their data. To have scalable solutions with no longer afford, you know, so many research studies based on your data don't use as much data or miss a lot of information and the results may not be accurate. And I think that's the end. This is the lab members and I also sent all the funding agency as well as many collaborators. Thank you. Open for questions. Any questions. Process for extracting information from the HR records without violence. So, so basically, So, so basically, for the clinical notes, most of the clinical records are allowed to you to do the clinical research, but with the IRB approval, your human surface. Most of the data needs to be in a high trustee manner to ensure there's no privacy. What is that is veterinary medicine accessible, less privacy center kind of places. They're not parallel to you could study not their work to generalize and get you back into shape. The mattering actually their data is a golden up to to a lot of the community because they have less data fragmentation issue than we see others, but then you know, I think there are a lot of the critical modeling work actually the VA is the organization actually using those, for example, for mental health related support. Yeah. Oh, yeah, sorry. We actually want to look at that to back home I had a PhD student and I am not sure. But he was helping with the veteran of the HR, animal HR records. There, there, there is still a privacy issue because there's the neighborhood information and all those information in the animal HR. But there is a one house, one house initiative, which is talk about how do we use the animal and human together to improve the house because a lot of it, you know, infection disease and environment really things are shared among the human and animal, but it's good. You mentioned that I was. Hi, we're at that side of the room, like, yes, I mean you mentioned this adoption gap a few times and it's really striking. I think you showed maybe just a few percent of groups are using it. You mentioned a few barriers to just having the sort of the technology place to extract from the algorithms and training. I'm wondering if there's other kind of issues that you've identified, and then broadly how can gals help. So, so, you know, basically, so the adoption is a kind of mixture because, you know, most of the healthcare organizations that they are their vendor related kind of since they probably buy products but they're not paid to solve when you use in the without the context is the kind of still a lot of challenges there because the performance will not achieve to the level you wanted for your end this downstream application, but at the same time, we don't have enough expertise to be training what happened because, you know, traditional graduate scores, most of them are not in the healthcare organization don't have access to healthcare data. And that's to give bring a gap and the same time, you know, healthcare organization that's not want to share that. So, they consider their business confidential information there. I mean, they're also legal if I really think but sometimes we feel that's not the primary reason why so much adoption gap. But meanwhile, as I mentioned, because of the natural right to processing automatic. And the path to generally be used for clinical presentation is a problematic semantic processing. We view that as the AI health function. And the only way to be able to handle that is giving them context to be specific there with the evaluation of the client whose purpose created is to evaluate those solutions without evaluation, you cannot deploy that. So this is something as any of the AI models and you cannot simply deploy without evaluation, but how many institutions have expertise to create enough to do the evaluation. Yeah, you mentioned in speaking about validation that if you ask two positions persons to come up with the algorithm or maybe they're very well. And so, when you see that. So, so humans can't do the job. And I'm asking because we face these challenges to a slightly different. So, I think it's the, the, we do the duplication and a consensus of building and all those things around it. So the best practice. We have available, you know, we're trying to get a page, have all the details. So, basically, most of the time when they have disagreement because they don't understand the past. So you create the past, you train them, you bring back down to do the iterative requirements as the agreement what you want to call this. And so this, the training part of the annotation can improve that dramatically. It can improve from Saturday, Saturday, you know, moderate type of agreement to a high in the agreement. But the things he's also come to, they needed to know, you know, you have this agreement with each other. You know, when you don't do that. Consensus building. This is what the system will be. So, but most of the system out of those best practice implementation. They actually can do the task but now because the whole team have a common understanding what he can do. The algorithm actually achieved much better performance for being very useful for the clinical research because this replicating their child review process. You will be surprised. Many of the clinical research data sets I personally don't trust them to have high quality. We notice a lot of the manually child reviews study cohort. There's so many mistakes there so a lot of the times we do iterative error analysis with them. We notice the half of the time is a false positive. I mean, basically, the golden standard of human made a half of a mistake and I can make the other. One more question. We are at time so you have questions. I got a couple. Yeah, I will be here. Yeah, we got a break so come up to see what else she can offer you. We have a break poster session starting about 15 minutes from now so if you have a poster schedule for today, please hang it up and during that time. Thank you again.