 Welcome, everybody, to the second workshop of the Cassander Consultation Series. Today, we're going to be discussing what data content conventions and standards we require in the data asset. To begin, I acknowledge and celebrate the first Australians on whose traditional lands we meet, and we pay our respects to the elders past and present. Today's workshop is going to run for 90 minutes. We'll start with three short presentations. We'll hear back from Adrian, who will summarize some of our outcomes from the last workshop. I'll then explain how we build the foundations required of a data sharing platform in order to find the data that we need in the format that we need it. And Dr. Melina Wilson will give us some insights into the issues she faces in locating and making use of the data she needs when reviewing clinical trials. We'll then break out into groups for around an hour to get your feedback on the topics raised. A quick reminder that your mics will be muted during the presentations. It might be handy to turn off your cameras. Today's program won't feature a Q&A session. However, you'll have plenty of time for discussion during the breakout groups. And depending on how the workshop plays out, we may do a brief wrap-up on the issues raised at the very end. So on screen now, the two focus questions will be asking your feedback on during the breakout groups. In addition to the reference document for today's workshop, you should have also received a worksheet to use during the breakout groups. It was attached to either an email or a calendar invite you would have received. So please make sure you have that ready to go. This time around, we won't be sending out a post-workshop survey. Instead, we're dedicating the majority of the workshop to the breakout sessions. And completing the workshop will form part of that session. But if you choose, you'll have seven days to provide your own written submission via our feedback site. So now, having gone through the housekeeping, I'll hand over to Adrian. He'll provide an overview of what came out of our first workshop. Hi, everyone. I'm Adrian Burton. I work at the ARDC. Just as a reminder, the ARDC is kind of the collaborating glue here that's bringing together the HOSANDA initiative. The Australian Research Data Commons is guided by an advisory committee for this HOSANDA initiative that includes representatives from ACTA, ARRA, NHMRC, the ANZ, CTR, Research Australia, and Cochrane, Australia. And we are in the middle of the first phase of this HOSANDA initiative. The first phase is dubbed Data Development because we are before we build an infrastructure or make new policies or whatever. We're just getting an idea of what is the data that we're talking about in this new national data asset for health studies. And so we're really scoping in on some of the, there's a process of four workshops that you can see on the screen now that are looking at different aspects of the data asset itself. Last workshop, we were focusing in on what was the purpose, you know, what research would you be able to do with this data asset? What on earth are we trying to achieve here? So what kind of data would you need for those purposes is the following questions. So theme A was our last workshop. Theme B is highlighted here in orange. We're now saying, okay, given those purposes, what content and what kind of policy would that data have to be to support those research purposes? I have a further workshop on looking at some existing data standards and practices and a fourth workshop on the systems and the barriers and enables to such an initiative. What we hope to get out of this whole, this first consultation process is what we might loosely call a set of business requirements for an infrastructure. What kind of infrastructure should it be? What kind of scope should it have? What kind of content should it hold? So that's what will be the output of this set of consultations, the data development phase of consultations. And then that sketch of the Hessander business requirements will go off in two directions. One will be for an infrastructure arm. What kind of infrastructure could we build that would actually address those business requirements? And then we will also take that kind of sketch of Hessander off for much wider stakeholder engagement, saying, okay, if we bring together this kind of a national data asset for these kind of research purposes, what do the patient groups think about that? What do the trialists think about that? What are the institutions, etc. So that's a much more broader, I suppose it's the more of the culture and the framing of such an initiative with these key stakeholder groups. So that's where we are in the process. We've done theme A. You are now in theme B. So if you've joined the wrong Zoom, now's the time to leave. So we are now in the second stage of this data development process. What did we learn from the theme A? We've got a group of an editorial board who's working in the background. Here I won't mention their names, but we will thank them for their contributions. They have been capturing some of the discussions from the theme A. The first thing to report is that these research purposes that were proposed have been identified and confirmed, and if you like, endorsed by this consultation crisis, that building a national data asset from the outputs of clinical trials and other health studies would allow you to perform meta-analysis, clinical guideline development, use study design. This comes from just to remind you at that previous workshop there were 41 respondents and we received an extra set of structured respondents through the survey and other just sort of general feedback. So this is quite important in that for all the decisions we make into the future, we will come back to this to say, well, what kind of data is needed? Well, there's no right or wrong answer to quality or conventions or the information systems in general. You always have to come back to say, well, what do you want it to do? This is what we want our national data asset to do. We want it to support these kind of research purposes. So we will keep referring back to these as our guidance when we have to make calls about the scope of the initiative. Tiffany and a few of the others in the editorial group did some really nice work just looking through those. These use cases were, if you take the 48% researcher and the 15% trialist, 4% systematic reviewer here, a good chunk of the use cases that came from those research purposes were really aligning with what the researchers, health studies researchers need and want to do in this area, as well as having a healthy, you know, another sort of 25% of other applications to do with, you know, broader impact and providing value to the health consumers and, you know, a number of other stakeholders. But it kind of shows is that there's a strong research. This is a research infrastructure initiative and that it has application elsewhere. The kind of data, so for those purposes that we've identified there, the kind of data that people said would be required in order to do that kind of research, you can see here at the bottom individual patient data is strongly comes out as a requirement for the Hercanda initiative to focus on if we want to support those kind of research. It's the IPD, which is going to enable that as well as a second category, which is to do with having the details around the protocols and the methods, the methodologies and the details around the data, like the data dictionaries, those two things clearly come out as the key sort of priority areas, as well as a little sort of hump around standards, which we'll get back to as well. When we just take the researchers and the research requirements, those themes that we saw in a previous slide are actually just magnified even more that researchers said were well over 75% saying that the IPD was the key data sets that they were looking for, data components. Here look, there's a number of, we asked what would be the value if we did build up this national data asset from health studies data? What would be the kind of value that you would be able to see out of this? You've got about a quarter of the responses highlighting the efficiencies and cost savings and research productivity. That's probably also related to the other two big categories that are standardized platform and standardized ethics and consent. That standardization comes through as one of the key efficiencies that can be captured out of such an initiative. I'll just come over here. Look, the initial observations from the theme A is that there are a number of very important research uses that have been confirmed. Those use cases line up very strongly with researcher needs. The Hassan initiative should look at promoting data standards and trying to embed those inefficient research practices, but with having a good consideration of not increasing administrative burden and that those standards actually should be really catalyzing secondary use of data into the future. Then to sum up there that for the purposes that were identified, that there is a strong need for individual patient data and methods protocols and data dictionaries to support them. That's just a quick overview of where we got to from the outputs of our previous workshop. I'll hand it over to Kristin to cover where we're going to in this particular workshop. Great, thank you Adrian. Today I'm going to be discussing the foundations for building a data sharing platform now that we have some idea of the research purpose and use. Designing any kind of platform starts with identifying the purpose as I said and that's what we did in the first workshop. Once we know this we can pin down what kind of information it needs to provide to the users and what traits qualities or conventions need to be present in the information for it to be findable and usable. In the context of the data development process this is what we refer to as the data content and the data quality of the asset and in this context I should point out the term quality should not be confused with the statistical or scientific concepts of reliability or validity of the data. That's definitely not something that we decide what we meant by quality of the characteristics and conventions used in the data. So that may be sounding a bit abstract for some of you so let's discuss this in terms of some existing data sharing platforms and what they do. When using a data sharing platform the first thing most users interact with is the search or browse functions. On the Vivly platform apart from searching via keywords we see a number of search filters that allow users to search for trials using fields like the study design trial phase and sample size. On yoda and clinicalstudydatarequest.com we see search fields for medicine or treatment type and medical condition. The categories used in these fields influence the findability of data as they determine the search results we receive back from the platform. Here on Vivly view our search results we see that there are additional categories of fields that are applied to each trial. Again we see information about the medical condition and treatment studied in the trial. If we click on one of those search results and look further into the information available in that trials record we see not only additional fields for categorizing the trial but also an indication of what categories of data and documents are available from that trial. On yoda we see something similar albeit with a different layer and it is worth noting the predefined categories at the top so those green buttons on the screen which indicate whether things like the study summary document is available or if the data specification or data dictionary is available. If we now look at an example from a platform used by dementia cohort studies so not clinical trials we see that they list the specific kinds of patient data available for example the kind of variables relating to physical health information cognitive testing and so on. On the previous screens on those other platforms that information wasn't available and you would need to read through the study summary document from a trial to find out what kinds of data they have collected but on the other hand when we look at how the dementia platform lists study documentation we see that there are no predefined categories for things like study protocol or data dictionary case reports or so on. In the example on screen we see that this particular researcher has chosen to upload their data dictionary but it's not a predefined category of what the platform requires. This dementia platform has a different research purpose to do in yoda it has a narrower focus on providing participant data from cohort studies that can be pulled together for data harmonisation and mega analysis and so for these guys information types like adverse event reports are not important for their aims. The point that I'm trying to draw out here is that having clarity on the categories of information we need about a trial impacts both the findability of the data and its usability relative to our nominated research purposes but if we go back to the search results and click on the study report link for a trial I might have skipped. If we go back to our clinical trials platform and take a look at one of the study reports it holds we see our next layer down a finer grain detail about one of the trials. This is a screenshot of just the first page of this particular report but we see that the document includes information like the trials objectives and on later pages which I won't display it goes on to summarize the protocol and outcomes and so on. Go back to the search results again and click on the study report for another trial we see a similar looking document that provides similar information but it uses different conventions for reporting that information so on this particular platform a clinical study report is a standard category of output or information from a trial but the information contained within those reports is not standardized. Similarly if we look at an example of one of the data specification documents or data dictionaries we see PDF lists particular sets of traits for all variables present in their patient data so variable name variable type and so on but if we open up the data dictionary from another trial listed on that site we see an entirely different layout this time spreadsheet and it contains similar information but it's not identical information in addition to variable name and type they also report the number of observations or sample size for each variable we also see that they report variable type using a numeric code so again the thing that i'm highlighting is that on this platform they have developed their preferred approach for categorizing the research output of trials and they use this as a standard for reporting of trials but as you get more granular there is less standardization on how the information is reported within each category that inevitably will be the case with patient data itself as here we move on to the finest levels of granularity and the greatest variety of information types so most of us would know that there are many different ways just to code something as basic as age and education and this kind of variety in standards is present in most patient data types so whether it be blood tests clinical assessments or what have you there may be some methodologies and data types for which there is a widely recommended best practice or standard and it will be important to get your feedback on this but our goal isn't to list out the possible standards for every kind of data and information you might possibly encounter thank goodness for that instead it is decide what categories and standards we require or prefer or at least find acceptable when finding and reusing data and what the minimum amount of information we require to meet our nominated research purposes of systematic reviews secondary analysis and so on defining these things is how we will build the foundations of Jacinta they are the things that will dictate how we design our repositories and catalogs so that they are the most effective for researchers and the quality of these foundations is what determines the kind of tools we can then build on top of them in the future whether that be search engines data request application forms or potentially down the track even more advanced technologies like secure virtual analytical workspaces so in our first workshop we started to discuss data types but in our breakout sessions today and in the feedback we're looking from you we really need to get to the specifics of what conventions we should use so what are the best ways to categorize trials so that we can find the projects and data we're interested in what are the minimum amounts of information those trials need to provide to meet our needs for secondary use of their data and are there standards that are required or at least recommended for recording or reporting of that data however because we need to approach this practically we need to prioritize these and decide which information the standards are essential to our purpose and which of those are desirable and we could consider working towards perhaps in the future as the standard grows okay so that's my overview of data content and data quality or qualities in data sharing platforms and I'd now like to hand over to Melina who's going to give us some examples of the issues she faces when working with shared data thanks Kristen and hello everyone can everyone see that yes okay so today I'll briefly share some of the data challenges and needs from the perspective of conducting a Cochrane systematic review and as many of you know Cochrane has a very long list of mandatory requirements when developing and reporting the findings of a systematic review and these requirements exist so that our judgments and decisions made about the trial data and our confidence in the overall evidence are transparent to the reader so today I'll just highlight two key challenges that we we come across and that's you know whether we have found all the all the evidence on the systematic review topic and whether we can use the data for the systematic review so this presentation will focus on systematic reviews and meta analysis of aggregate data instead of individual participant data or IPD this is because in reality most of us work with aggregate data it it's this is because essentially we're usually working with limited resources and we need to complete the task within a specified time frame which is usually shorter than might be required if you were doing an IPD review so some translations before I begin from my perspective when I think about data I'm translating this into that means a trial protocol a clinical trial registry record a conference abstract or a journal publication and when I hear standards required of the data I'm translating that into reporting standards or guidelines so for example for example there's the spirit recommendations and this is what a protocol this is what needs to be addressed in a clinical trial protocol for example so to find the existing clinical trials on the systematic review topic this can be quite a lengthy process and the process begins by searching a wide range of databases as you can see here and also some regional databases such as LILACS and subject specific databases before you go about searching these databases one thinks about what type of approach you're going to take and also the different types of syntax you might end up using and I've highlighted here some clinical trial registries or platforms such as clinicaltrials.gov and the WHO's international clinical trials registry platform because they all play a role within a systematic review they capture information about completed trials ongoing trials and trials that have stopped prematurely so for example if a trial has stopped prematurely then we will still include the trial and report data if available in a Cochrane review. The COVID-19 pandemic has brought to the forefront for us the value of pre-print servers such as med archive where findings are available online and have not been through the peer review process so in addition to this when we're looking for clinical trial data we also contact experts in the field we end up screening the citations within existing systematic reviews and then the systematic reviews buried within the clinical practice guidelines we also look for unpublished documents and that can be through various means including you know finding them through google scholar so overall what I'm trying to say is that it's quite a fragmented process and as far as I'm aware there's no one-stop shop that would allow one to search for clinical for studies and in such a way that would also meet the requirements for Cochrane so whatever databases do exist around the world as a systematic reviewer the preferred scenarios would be that I could search those repositories in an intuitive fashion and that could be through using some control vocabulary for example. So one intermediate step that is often left undescribed in is the actual manual process that a systematic reviewer undertakes that actually involves stitching together all the data for the one trial so we do this so that we have the most complete view of the clinical of the trial data under investigation and we can make assessment of the trial's quality so this is what it typically looks like in a Cochrane review here we have a trial on the right hand side it's called the ABCTCG trial and we have all the different outputs from that trial we have the clinical trial registry record we have a conference abstract for example that reports an important outcome that might not be reported in the an important outcome that from the systematic reviewer's perspective that might not be reported in the trial publication and then the trial publication and the challenge is that these links between the the registered clinical trial and the trial publication might be so it's sort of uncommon there's still this high degree of manual linking so a preferred scenario would that be that there are these automatic links between the registered clinical trial record and the data outputs from the clinical trials the other common challenge that we face when working or actually I probably should say common observation that we find when we're working with the trial data from is that we'll often start with a large number of included studies in the systematic review for example in this case we ended up with 15 studies with over 11 000 women included in in the in the review but then by the time we drill down to the data in the trial publications the number of studies and the number of people contributing information for an outcome may be a lot lower than expected so for example for a very for an important outcome we end up with six six studies with just over with over 5 000 women this is a pretty good example actually so some things that play out and why we might see this is there are you know there are unclear methods in the trial publication or through the information that we can find the outcome data may not be reported in a useful format so there's no numerical information but a single summary sentence and also the summary statistics are not reported and as is the case for some oncology trials so the strategies strategies that we end up using to overcome these is that we contact trialists with specific data requests and you know ask them to apply with inserted time frames where possible we transform the data that has been reported in the trial publication to other into a useful format for meta analysis and more recently there are new methods called synthesis without meta analysis so we can you know try to use at least some of the information from the trial publication so the best practice scenario would be that clinical trials data reported in line with internationally recognized reporting standards and I'm referring here to the many standards reporting standards that exist and they have been around some of these have been around for quite a number of years the fidelity to these standards can be variable so I've already mentioned the spirit recommendations we also have consort which I'm sure everyone knows about on this call they've been around since 1996 that's for RCTs there's also the consort extension statements that cover cluster randomized controlled trials for example crossover trials the list goes on there's the template for intervention description and replication the tidier checklist so reporting about the intervention in sufficient details so that someone if needed could replicate it and then standards for diagnostic accuracy studies so just a final point around the role of clinical trial registries and the usefulness of these registries when conducting a systematic review the WHO requires that trials registered on the clinical trial registry complies with these minimum data set that you can see here and just as you can see there the number 23 says that a minimum requirement is the summary results reporting and here you can see the view of the US based clinical trial registry and down the bottom there there is the study results tab and when you click on that what we find is we have the outcome data being reported with number of participants and those who have had the event and we also have the summary statistics and statistic and the statistical analysis performed so yes this is perfect so to speak so in terms of the preferred scenario it would be great if and I think there's there's work in this area going on that the clinical trial registry is also put together some guidelines of the essential criteria criteria for report results reporting so in terms of a wish list in some it would be better linkage of the data outputs from a clinical trial improved implementation or compliance of those existing standards that I mentioned and also if there there could be some guidance developed on the essential criteria for results reporting on clinical trial registries so that's it from me Kristen thank you great thank you Melina and I might now hand over to Roxanne Foster from the Australian Institute of Health and Welfare I think you're muted Roxanne yeah I think I think that's sorted now sorry having some technical difficulties unmuting there can everyone see and hear me yep yep wonderful okay let's just check that the screen is right and so we've got data content and quality requirements up so thanks thanks Kristen for handing over to me and thanks to Melina for your insights on data sharing from a clinical trials perspective I'm Dr Roxanne Foster from the Australian Institute of Health and Welfare the metadata and meteor unit we're providing expertise in data development principles and process to inform the Cassandra consultation phase so before moving into the breakout rooms of briefly just reiterate what we hope to achieve with today's workshop so today's breakout sessions are an opportunity for open-ended discussion and further exploration of use cases that came out of the first workshop we've considered feedback in extended sessions to about 50 minutes providing more time for meaningful discussion the aim is to gather feedback on these three topics uh Cassandra's data and metadata scope minimum information requirements to facilitate data reuse for your research needs and existing standards that can be leveraged or required standards that need to be developed so standards in this context refers to common or routine reporting practices and data definitions so this includes stock taking what information and data standards are already out there how comparable they are across institutions and understanding how information capture might be aligned and standardized across the diverse systems from which data are drawn discussion will be guided by two key questions question one will be considering existing data sharing solutions that you've come across and assess their pros and cons so what works well what doesn't work when you try to share or use data or trial information so this exercise will identify and confirm the common issues confronting researchers in data sharing and inform a list of requirements for the Cassandra solution question two will capture the range of data content required using a shared works worksheet and expand on the specific requirements considering issues that were raised in in question one so the aim is to document and prioritize the requirements to overcome data sharing issues and inform Cassandra development and part of this involves canvassing existing conventions or standards that work well for you which Cassandra could leverage so without further ado I'd like to ask you all now to move into your assigned breakout rooms and please let us know if you have any technical difficulties thank you