 All right, everyone that's 331. So it looks like it's time to get started with the next one. I'm Brian. I'm a data scientist. And I'll be moderating this final portion of today. My next presentation is automated data quality assessments for observational studies presented by Lisa. And so I think they hear the video. Got it. Hello, everyone. I am Johnny Marino and with me is a Lisa Casper. And we are delighted to be presenting at our met on behalf of our collaborators on conducting automated data quality assessments for observational studies with data choir. In our talk, we will first have a quick intro into data choir and then go over the required metadata setup. Then we will show specific examples of data quality checks that can be conducted with data choir. And finally, how to put everything together by creating a report. So data choir generates extensive data quality reports based on the input of a study data, of course, but also relies heavily on metadata, which are attributes that describe the expectation about the study data. So such expectations can be quite diverse, ranging from the number of expected observations in a data set to properties of single variables such as data type or inadmissible values. The main strength of data quality reporting with data choir is that it is based on a formal data quality framework, which provides a workflow for data quality assessments. The framework has a hierarchical structure, starting with dimensions that we can see here integrity completeness consistency and accuracy, which contain domains that are the next level below. And with indicators that are nested within each domain. Indicators show the actual or potential deviations from the requirements specified in the metadata. For example, the completeness of the study data. In this way, the basis of the data quality indicators are the checks of observed data properties against formalized expectations. And this is the main idea behind the implementations in data choir. To be usable, the study data and the metadata must be organized in a structure form. For data choir, a spreadsheet type structure with several tables is necessary. The first table that we see here to the left corresponds to the study data where columns correspond the variables and rows represent the records or observations per participant. In the metadata, the rows now contain the information on the variables. This is the table on the top right. And so we can see that we have variable names, variable and value labels, missing codes, etc. And this is also the metadata here also includes information to control the output of the report. And here is important to highlight that this metadata is an enriched metadata because it is more than a normal data dictionary as it includes more columns that are related specifically to data quality. So in this way, data choir uses the relations between study data. For example, the table that contains IDs and clinical measurements with metadata attributes to conduct data quality reporting. Because data choir is based on a data quality framework, there are different metadata requirements for different data quality checks that need to be conducted in a sequence. So starting from the top level for the integrity dimension, we have the data frame level metadata, which refers to descriptions and expectations about the provided study data frames, which are the actual tables with the study data or just the tables that need to be checked initially. Then we would also have the segment level metadata that includes descriptions and expectations about study segments. These are, for example, the different examinations of a study where variables are nested dependent on which part of the examination they were measured. And these tables depend on the study organization, and this is what's checked after the initial data frame level check. Next we have the item level metadata, which refers to the descriptions and expectations of single data elements. So this would be the variables or the items. For example, the every column in the study data table. Next to this we have also a missing table that allows us to indicate how the missing values were coded in the data. In the consistency dimension, we have item level as well, but we also have cross item level metadata that contains descriptions and expectations about how to use groups jointly. For example, two or more data elements for the purpose of data quality assessment. For instance, this can be contradiction checks. So when we compare two variables to see whether there is a violation in contradiction. And now that we have a basic understanding of the metadata schema that is needed to use data choir. We will continue with some examples that illustrate how this can be put to practice. We have prepared five examples for data quality checks that you can perform with data choir. We use here a data set from our local cohort study ship, but we have added noise and some artificial data quality issues. In the top, you see metadata about the provided study data frame. We call this the data frame level metadata. We can give the expected number of variables as element count and the expected number of observations as record count. You can also link to a reference data frame, which contains the IDs of participants and specify whether this list should be matched exactly, or whether your data set should contain at least a subset of these identifiers. We can also specify where the entries in the ID variables may be repeated and whether there can be identical observations for different identifiers. A large study might consist of several parts. For example, you could first have an introductory interview, then a physical examination and a questionnaire. We call this a segment of a study. You can see the segment level metadata. You can enter here the expected number of participants for a specific segment of the study, their identifiers, and again the possible presence of identical observations across different participants. This metadata enables a number of data quality checks for data choir. So for example, the single function call performs several checks on the data frame levels. A summary table on unexpected data elements, data records and duplicates. And as you can see here, there were no issues for example data in this domain. The check on the segment level is quite similar. Again, it's a single function call, which performs checks for each segment of the study. Here you can see that there were some unexpected data records for some of the study segments. Another important issue for observational studies are missing values. Ideally, the reason for a missing value should be recorded during data capture, so that you know if a value is missing because there was a technical problem or because a question or examination was not applicable or declined by a participant. With data choir, it's possible to list these missing values with their labels, or you can link to a missing table which measures these missing value codes with codes from the American Association for Public Opinion Research. By this, you can calculate rates such as the refusal rate or the non-response rate. For our example data, you will get an output like this. If you look into such an output, you should spot or look out for large deviations because these might point to an issue with a specific examination or question. As you've heard before, it's possible to check for contradictions with data choir. The contradictions are stored in the cross item level metadata. With the newest version of data choir, you can use a red cap inspired notation to specify these contradictions. For example, the first contradiction will here states that the age of participants at the follow-up examination should not be below their age at the baseline examination. For our example data, you can see that this error occurred for 5% of the participants, as you can see in the bar chart. This was generated by the function call on the left. So finally, let me show you two data quality checks from the accuracy dimension. Let's look into the outliers. This information is stored in the item level metadata, and currently there are four methods available for outlier detection. Data choir uses them in combination. If not all of them shall be performed, you can specify the subset of the available methods that you want to use in the column univariate outlier check type. The column n rules states the number of methods that are required in consensus to flag a single observation as an outlier. So for example, for our study data, here for age at baseline examination, you can see in the plot how many rules flag a single observation as an outlier. This helps to develop a better understanding of the gravity of the problem. So finally, another issue might be that measurements could be affected by different examiners or devices that are used in the study. So you can store the information on different examiners and devices in the grouping variables columns and the item level metadata. Additionally, there are modeling approaches which require to specify co variables, and you can add them here in the covas column. For example, you can see we have added age at baseline and sex at baseline as co-variates. So for example, this is the analysis on systolic blood pressure measurements. We check here for a possible bias introduced by different examiners by estimating marginal means. So marginal means means that you first fit a regression model using the grouping variable and the co-variates. Then you estimate marginal means for the grouping variable from this model. Here, these estimated marginal means for the examiners are shown by red or blue diamonds with a confidence interval. This is plotted on top of violent plots and box plots which are derived from the data directly. So you can compare the estimates from the model with the actual trend in the data. This helps to spot if there is maybe just a mis-specified model or if there's actually really a trend in the data that you have identified. On the right hand side, we also show the density across all examiners and the solid red line shows the overall mean. The deviations from the mean shown by dashed lines are derived from user-defined protocols. So you have seen that you can call a lot of data quality checks on their own by single functions, but you can also compile a complete report using this code that I show here. So you see it needs only a few lines of code to create a full report. But actually, which data quality checks are included in the report depends, of course, on the available metadata. For the metadata argument, you can pass the name of the data frame as a string or a filename or a URL. These tables are cached when they are looked up. And if you use it again, then they will already be there in the cache. You can also populate the cache manually. So if you would like to know more, please have a look at our documentation. You can also check for updates on our GitLab project. And if you're interested in this topic in general, please consider joining our data quality user group. We are currently setting it up. And we plan to have monthly meetings to discuss new developments and existing tools. So thank you very much. And we are happy to answer all questions about data choir. I did see one question in the chat on Dr. Higgins. I saw Johnny was in the chat earlier. Did you see that and are you unable. Are you able to unmute yourself. So that this data check approach will generate data queries for certain measurements so outliers that might be real is wondering if there's a way to record if that data has been manually confirmed so that when you rerun it will not generate the same queries again. So that's another question. I cannot start my video right now, but I can, I hope you can hear me. So, yes, that's a nice idea. So currently that's not possible. But the outliers would not be removed you would just get like a red dot for that. So it doesn't affect the further processing of the data here. That's a good question answer question. But thank you for the questions really nice. And it looks like someone else is just looking for the, maybe the link to the GitHub repo can drop that in the chat. Yes, we will do that. Thanks. Thank you.