 Beth, I thank you for the introduction and I want to thank the organizers for inviting me to give this presentation. This was really an interesting time for me because when Cordo came out and I saw some announcements, I thought it had a lot of promise and I thought a good way to learn it would be to learn it under fire by writing a book while I was learning it. That's what I did and it also taught me various ways I could extend Cordo and I would say it turned out to be a very successful experiment and I've been very happy with it. And since then I've rewritten two other books that are much larger, more than 500 pages in Cordo, converting from either latex or markdown and that conversion process was very successful too. So what I'm going to be talking about in our workflow are these areas here, including a little bit about report formatting, a little bit about importing data, getting an overview of data, including filtration of observations and missing data patterns and learning about your data and sort of a meta sense. So can we do sort of an overall or a higher order analysis of our data? A little bit about data processing and this is where the talk actually gets the most controversial because R gives you so many different ways to process data. And what I'm suggesting in this electronic book and in this talk is really a way of living. It's a way of living as a data scientist, statistician, data analyst that is very, to me, very logical and very linear way to think. And you'll see that those of you who use the tidyverse will see a drastic difference between this approach and the tidyverse approach. So that's probably where a lot of the controversy is. I'll get a little bit into descriptive statistics and analysis. I won't really cover very much. That's so project specific, but I will talk about some principles and a key principle is staying close to the data. And one of the ways we translate that phrase is that we never dichotomize ordinal or continuous variables when we're analyzing data. So that's the data analysis sin that we want to avoid, which mainly means that tables are no longer viable as a tool in statistics unless all of your variables are already categorical, especially your baseline or independent variables. So there's controversy there. And then there's chapters each on caching, parallel computing and simulation. I won't really cover any of that. Those things are more standalone topics. So if you look at the preface to this e-book, there's a lot of motivation and a lot about what I'm trying to accomplish. And then there's resources for learning quarto and some other details. And then there is a chapter two, which is a very detailed case study. You see this that comes up here? This is Mermaid. So you can use this book as a template for a quarto because if you click on that code button, you can see the source and you see how easy it is to specify little relationships. And in quarto, these are rendered instantly because Mermaid is pre-installed with quarto. And so you immediately get these very simple flow diagrams. So this case study illustrates a lot of the things that are in the book and a lot of the philosophy of preparing data for analysis and analyzing data. And I would encourage you to go through this case study first before really looking at the individual chapters. But what I'm going to be doing is going through things in individual chapters. I'm not going to cover any of the basics, but this is one way to help people learn R, who are not already very exposed to R. So let's get into the report formatting. And so there are many aspects to formatting, including how do you lay out figures and quarto has some really wonderful capabilities. As we saw earlier in the conference, you can also make customized HTML tables to make tables as advanced as you want. And then there's ways to place things. So do you want to place things in marginal notes? Do you want to place them in tabs? Do you want to expand or hide material? I have this GitHub repository with a file called RepTools for reporting tools. And these are helper functions that were written largely to extend what Kordo can do, such as making dynamic tabs and dynamic marginal content. And I'll show you a little bit about how those are used. Then you have decisions to make about your overall format. I didn't list Microsoft Word because I use that as seldom as I humanly possibly can, but Kordo is really great support for Microsoft Word. And then there is a section in here about how to do multi-format reports because some of your content needs to be dynamic. For example, when you're producing PDF, you can't have an interactive graphic. So it needs to be. Let's say if you were using ggplot2, you need to produce PDF. But if you're producing HTML, it can be interactive, so you can use plotly. And so you might have code in your report that is sensitive to what the format currently being produced is and use different graphics calls, whether it's PDF versus HTML. A big theme is metadata or data dictionary. And so this involves at the simplest level, variable labels and units. And there is a section in the book about making making the claim that you really don't want to have highly descriptive variable names because that means the variable names are so long that they're unwieldy and your code is actually much harder to read. But you really want to have fully descriptive variable labels. And then those labels are attached to the names in a variety of ways and can be dynamically looked up and popped up in our studio windows. And I go into that in some detail. So I use the variable labels for the full description. I also make heavy use of units of measurement for better annotating plots and tables. So that is about annotation of reports. And then there's other material in this chapter that goes into detail about tables and multi output format. But the next thing I want to cover is analysis file creation. So we typically import things like binary files, could be a SAS dataset or STATA or we import CSV or we have an API that goes directly into a system such as RedCap and we may want to import 50 files at one time in a loop and I give examples of how to do that very, very easily. So when we're importing, we want to have variable, we want to have manageable variable names and we want to add data dictionary content or metadata and so we can import labels. We can have external metadata that's in another place that is then associated with your primary data to define labels and units and comments about variables. And then we need ways to view data dictionaries. So the book tells you how to easily call a function that will pop up a data dictionary into the RStudio view window. So you can use that data dictionary as a guide to help you remember variable names and help you formulate your analysis code. So there is a section about importing and I just want to mention that say this is one way you can import a dataset that doesn't have any metadata with it and you can associate the metadata on the fly very easily using this old function in the HMS package called Updata. So you can see that we're starting with the dataset D and we're renaming various variables and the .q function is just a quoting function in the HMS package so you don't have to use as many quote marks. Then we define a little function Y in because we have a whole list of yes no variables that are coded exactly the same way and we wanted to put them as factor variables with levels yes or no. So we're just transforming those as Updata is building your dataset. We're defining levels of a variable. We're dropping variables. We're defining labels. And then we're defining units of measurement. So this would be what you might do when you don't have any external metadata and you don't have any metadata coming directly from your data system. So we'll see in a minute when you're importing from RedCap you'll have metadata you won't be doing this like this except for units of measurement. So RedCap used to fully support units of measurement and Paul Harris took that out a number of years ago and I've been lobbying him ever since to put it back in because I think units of measurement are really fundamentally important attributes for continuous variables. So there's separate sections in here to how to deal with various kinds of input data whether it's binary like Stata, SPSS, SAS, Excel, reading multiple files. And for RedCap I'll just mention this is for people who are not using one of the at least two R packages for reading directly with a RedCap API. This is using this import RedCap import RedCap function in the in our GitHub repository. So when I say import RedCap parenthesis parenthesis that will read in the last exported file into your directory from your directory and it will read the R code and it will read the CSV file and it will clean that up so you know RedCap makes two copies of all your variables if they're categorical. We don't really want two copies of variables and we don't want the word factor to be part of a variable name. So this cleans up a lot of things and creates a very streamlined data frame with all the metadata that we can get out of RedCap when you're not going directly through the API. So it's very simple to get very streamlined ready to analyze RedCap files just using the standard R export into CSV and .R files. So that's all I'm going to cover from the analysis one. The missing data section you know we all have to deal with missing data and there's a lot of ways you can understand your data. We can look at the extent of missing data per variable or per observation. Those are both very important. We can look at patterns. So how does a missing this cluster how if you were sequentially excluding variables so which which variable is missing the most and then if you were excluding on the basis of missing on other variables how many exclusions would you get. And then there's interesting patterns you can look at such as the association between values of non missing variables and the number of variables missing per observation. So this is actually a statistical analysis that's automatically done where you run an ordinal logistic regression to understand the strong predictors of the number of missing missing variables and observation has and whether you have more missing data for the sicker patients or for the less sick patients or whatever. So that gives us a statistical analysis of missing this patterns and so there's this rep tools function that defines a lot of other functions. We're importing a data set from our data set repository that's fully annotated the support data set converting it to a data table and running the miss check function on that saying that we do want to predict the number of missing per observations. There's a variable we want to omit while we're doing that that's really kind of redundant. So you see the number of missings per variable and per observation they the minimum maximum and mean number. So out of a thousand patients we tend to have 141 people missing on a variable and we have the whole distribution of number of missings per variable or per observation and then the miss check function creates all these tabs for you automatically. It gives you sort of packaged analysis so the number of missing values per observation is a dot chart that was per variable sorry the number of missing variables per observation is a different dot chart. This one is has been handy for me the mean number of other variables that are missing when the indicated variable is missing. So when serum creatinine is missing that bottom dot there on the average nine other variables are missing on the same patient. So it's just one of many ways to understand the missing patterns. You can look at sequential exclusions. ADLP has the highest number of missings but you have additional patients that are missing on other variables. We can look at combinations all possible combinations of missingness whoops using this kind of dot chart and if you hover over things you get a lot more information about these combinations so you can hover this this is using plot length type of graphics you can see marginal counts you can see combination counts and more information at the top. So those are all possible combinations of missingness and this is where a model is run to predict the number of missing variables on the basis of the values of the non missing variables the variables that are never missing. So you can see the time until death has the strongest ability to predict the number of missing values so it turns out people that die quicker have more missing data at baseline because a lot of them are on a ventilator and they're unable to be interviewed to get certain data points. So that just gives you a lot of different ways here in these tabs to explore missing data and then data checking we'd like to do range checks and cross variable consistency and like to do this with minimum coding and get listings and summaries. So if you look at the way most people do this it takes a lot of coding and you can really capitalize on R to do this with a lot less coding because you can create expressions for your checks so these are conditions that should not happen very often. So in this data set someone who's younger than 30 years old or older than 90 it should not occur very often in the data. Somebody who's female and has a maximum heart rate above 170 I just made this up I don't know if that occurs often or not but it's just an example of a consistency base suggestion fraction is between 72 and 77 or is greater than 77. You just have this whole array of expressions and then you pass that to data check and it will execute that array passing through the observations and we can tell it to also give me a report by ID and so you get a separate tab for each check with the list of observations that met the criteria. See that was H less than 30 age greater than 90. So you see that here's one for accommodations and you can also get your summary by ID and site with the check that was triggered the values of the variables that led led it to trigger that as a positive check and you can also get a summary which is giving you counts of how often certain triggers were were met. So those are all just automatically produced by virtue of having data check create tabs that that quarter knows about. So that's data checking and then data overview we want to look at observation filtering so we need to create diagrams where we insert computed counts that we're making and we can do that many different ways but two ways are with consort diagrams and with the general purpose mermaid and then we also want to get more data about the data. So we have various missing value snapshots some of those were presented just a few minutes ago and we can have data characteristics and hopefully that'll make sense in a minute when I show you an example. So that includes breaking variables down a discrete versus continuous variables. How many ties are there in the data? How much information is in the data? This is something you probably not used to seeing but we have an information measure that goes from zero to one so a variable that would have near zero information would be a binary variable where almost every observation is a zero. So that would be a very low information content where a high content variable would be a continuous variable where there's no ties in the data. Measures of symmetry of the distribution rare values and common values. And so for filtering of observations we can here's an example of using a data table to create a clinical trial where as different criteria are met people are randomized and then they're randomized to treatment A or B. And then that's our simulated clinical trial data. And then we're going to set up the counts that we're going to need that are passed to the consort package using the consort plot function and giving it the labels for the various branches or nodes and certain other information will get a classical consort diagram and you have these options of putting in sort of large categories to organize your chart but that's just a standard standard consort diagram. And you can actually more naturally make a consort diagram if you use the building blocks instead of using the actual consort plot you use all these building blocks that the consort package comes with like adding a box adding a side box and I just found this a little more intuitive to calculate things myself and pass them to these building blocks and I'll get the same output but just the inputs are much different. And then with mermaid you can make all sorts of plots but mermaid is going to construct the plot the way it wants to so it won't meet your actual consort rules but one of the things that makes this pretty general is you see when I define the mermaid mark up here it's all in quotes it's all under this object X and so this is brace brace is a macro so this is using the net expand function in the knitter package we can pass macros so it's just like passing our variables and by having brace brace the variable values are inserted at that point in the chart so this is just a very general way to associate variable content with nodes or parts of nodes because it's parts of the node labels so if you use this little mate mermaid function that I wrote this is where you define all those variables that are referred to by by these symbols up here so like this one and that's just a count of how many people were on treatment B and they had the response variable measured whereas this is a count of people on treatment B whether or not they had the response variable already assessed so when you pass that through make mermaid it will create the mermaid diagram so this allows you a lot more flexibility than what you can do with a consort diagram but it doesn't doesn't meet the classical formatting criteria but I think it's still useful for a lot of things you can also have nodes that are more dynamic and and they will have more pop-up content so this is done with a callback where I have a table that pops up when I when I click on a certain area this is one one one of the few problems I've had with Cordo it's a little bit buggy and things like this so this is supposed to work that you're supposed to hover and see a pop-up table but I found this really browser dependent so that's not quite there yet but I I've had it work with some browsers but not with the main browsers that are used so that's supposed to be a live link you can also have URLs and link to things by clicking within the nodes of the mermaid chart descriptive statistics is a very wide area and we need to really respect the nature of the data and so we have various tools that I use over and over for descriptive statistics the primary one being the describe function which is a very old function in the HMS package for univariate summaries and so this will give you statistical summaries and also give you graphical summaries and for continuous variables it gives you these little spike histograms which are nearly full resolution displays of your distribution you'll see examples of that in a minute of course for categorical variables we just have frequency tables and frequency dot charts and for continuous variables we also besides spike histograms we use extended box plots and then for longitudinal data there's special displays that you need that I really want to be covering now and there's also special displays you have for multiple category event charts and timelines and various displays for showing relationships such as variable clustering so let's just look at a couple of examples very quickly this is reading this stress echo data set and I've made it so that Cordo hides that output by default and I click and I can expand it so I see statistics that are appropriate to each type of variable and you see it has different font for the label compared to the variable name and a different font for the units of measurement so you see quantals genies mean difference which is better than the standard deviation for measuring dispersion the mean the information quotient how many distinct values there are and then you have what's the most important output is is the raw data distribution with high resolution so either one or 200 bars you can see extreme skewness you can you can see extreme digit preference here in in blood pressure and and you can see you won't see digit preference when you make regular plots that most people use such as broad bars or CDF's you won't see it in a CDF very well but you can see by modality and digit preference and all kinds of things when you when you show the data at high resolution and you can see it doesn't take much space in the report to be able to do that this is just capitalizing on HTML output which can contain little little widgets little little graphs that are converted to base 64 encoding for use in standard self-contained HTML so that's what the describe function looks like and you see that for for binary variables and so on and then there's ways to pop up various things in windows to help you view the output which is a guide when you really get to the analysis and then here's an example where we use the make tabs function in the rep tools repository where we're plotting something that produces two plots it produces one plot that's labeled categorical variables and one plot that's labeled continuous so you can see the continuous variables are just proportions in each category and you see this is a variable has two categories here's one that has three categories and you have these three proportions here for history of cigarette smoking and then when you have continuous variables you'll see the spike histograms now that gets more interesting when you say let's make these plot late graphics and we're going to make tabs like we did before and make these dynamic tabs so now when you hover over something you see a lot more information you see the statistics that were in the table that the describe function created and and these are the categorical variables and you see always numerator and denominator so there's never any question about what these fractions represent if there was missing data these these dots would be color coded by the amount of missing data and then for continuous variables you see what we saw before but you can see the individual bar so this is a $138 these were bends to the nearest two millimeters of mercury there were eight patients at the 138 bend but then importantly when you go over here you see all of the detailed distribution of characteristics and number of missing values and quantiles and so on so that's just the plot late version which works in any HTML report and it gives you this hover ability to look up much more information there's a lot more ways to get summary data and the summary M function is for summarizing multiple variables so I give a formula like this that has lots of dependent variables and I'm making history of my currently on the fraction an independent variable so this is going to be my stratification variable and the rest of these are analyzed separately by levels of history of MI so that formula is past here and we're going to use make tabs where you can say the first tab is going to be empty so if you look down here you see a blank tab that just means that by default nothing's going to show so nothing will show to you click on one of the tabs the first tab is called table one which is the HTML version of the summary M function so let's see what that looks like when you show this on a wider screen all of this formatting is automatically adjusted to be very beautiful like you're used to seeing in a table and it's using different fonts and so on so that is the HTML version of the summary M and if we want to look at a categorical variable plot we see something like this if we want to look at continuous variable plots we see extended box plots so if you hover you'll see the definition of what the corners represent so this corner represents the 95th percentile and here's the 5th percentile down here which is 104 and you have the mean and you have the median the interval that contains 25%, 50%, 75% of the data so that's how you look up what the points in the extended box plot means that's a lot more information than a regular box plot you can also do summary interactive spike histograms that have the same information on them as a box plot and with plotly you can turn certain things off if you decide you don't want to currently display that you just click in the legend and that will take something away and you can click and put it back in so that is a whirlwind tour of various descriptive statistics that I use as standard and data manipulation and aggregation is where we tend to spend more of our time and here is where data table is really the central component so the biggest take on message of this talk is if you haven't already learned data table you need to start today because it's probably by far the most important add on package in all of our it gives you a logical way to work it is super logical, it is coherent, it is consistent and it's blazing fast so it has nothing but pluses but you do have to learn it and I think one of the things we face as data scientists there are people that want to get up and running as a data scientist without mastering the mastering their trade so if you don't really master the base of R the base R methods and code and to me if you don't master data table you're really not mastering your trade and so data table is not the fastest thing you'll ever learn but every hour that you invest in learning will pay off with multiple hours of personal efficiency so using data table and using base R we can do sub setting we can modify variables we can do various recodes and reshapes so this chapter goes through this and this is the schematic that really explains data table in a nutshell you have a data table which looks a lot like a data frame it's rectangular you have something that tells you what rows you're operating on what to do and what columns you're operating on and by grouping you can also have an on operator here when you're matching on certain variables that might have different names and different things and then we can have lots of extra arguments so this chart goes into a little more detail to help you with that when you type the name of a data table and a bracket that means you're entering into the environment that data table and everything you do is within that environment and so you enter the environment you're dealing with rows columns and by or on so the rows are the rows to fetch or to change you have the columns to fetch or the columns that you're creating or redefining and creation of new columns and redefining will be done with a column equals and then you can have by for grouping or on for matching and then when you get out of that environment you have to figure out what are you left with so when you leave that environment by closing it with the bracket if you didn't have a colon equal you might just have printed or plotted the output and you might have assigned the result to a new data table if you had a colon equal that means the data table in place and if you have a very large data table to be able to add a new column in place is a major time savings so you can add and modify columns leaving the original name original data table in place so that's really the setup for data table and then what I did in writing this book is I had a bunch of notes dating back 10 and 20 years of how to do common tasks so I had all these code snippets lying around and it decided they weren't useful to me or anyone else unless they organized them so I brought all these code snippets in from a variety of little files and annotated them and it's tried to write down what are the tasks that we need to do all the time such as finding observations in November finding all the observations from the month that's like MB in other words that's MB in the name of the month combination criteria pulling off one variable pulling off two variables out of a data table these are just things that we do all the time and just try to show prototypical examples of those and so you'll see a lot of examples here this one just is translating an existing variable to make it uppercase how do you analyze selected variables and subsets but really I want to get to the recoding because that's where we that's where we tend to spend a little bit more of our time or before that this is just how you change variable names using data table or you can use the dot queue function just to keep from quoting so many things so renaming variables very easy and we can rename variables by matching pieces of strings and that's just very very easy to do but when we get into recoding you have so many options and one of the most powerful and logical ways is with the F case function in data table and so it's going to recode according to the first condition that's met so if the patient died if this is true the result for this X variable is going to be quote death if X2 was either stroke or MI quoted we'll call that a stroke slash MI will be the category to assign if the patient was not one of these but was symptomatic it's this otherwise we default to none so F case is just one of many many ways to recode and it's a very logical and powerful way to do it sometimes you're recoding it's just a simple table lookup and you don't need data table or anything else for that you can just use named vectors so we can pull in state using state abbreviations we can look up the state name and then there's an example here for hierarchical recoding so we want to recode things where we have plants and animals within plants we have vegetables and fruit we have animals we have domestic and wild so we have a hierarchical tree and we can navigate that tree to do various recodes and without going into the detail that this will show you how to do that so recoding is a big topic operating on multiple data tables is a little more advanced topic but there's a lot of capabilities there that you can look at later to get into summary statistics we can summarize all the variables we can summarize subsets we can summarize using functions that return multi-dimensional results we can do marginal summaries and data table really shines for this and there's a new package that works well with data table and other systems which I think is called Collapse and I'll be adding that to this chapter because that is a very general way to compute summary statistics and it's ultra-fast for huge data sets but what's built into data table is pretty amazing and one of the things that's tricky for people to learn but once you learn it .sd stands for the whole data table so you can do operations on all the variables each variable separately the operation we're doing here is to run the nd function for a number of distinct values for a variable so we're saying to run the nd function separately for every variable in data table d and we just get a count of the unique values that way we can say do it for the variables that match these patterns the variables that have hx in their name or they have either d or m so that's an or symbol there so we're saying this is just the variables that match these criteria here and we're running that analysis just on two variables this is creating a function to calculate the main ignoring missing data and we're going to run that on every variable in the data set that is numeric so it's very easy to put conditions when you're running on a lot of variables let's look at this one here we're defining the function c-mult which is that the the variable is not numeric and it has more than two distinct values so that's a true false function and we're going to just run that function here to see which variables qualify for the analysis so that's like saying is.numeric but it's more general these criteria and then we're going to do a table and we're going to summarize each variable with a very brief frequency table hoping it doesn't have too many levels for us and that's the output that you get just more examples of creating functions and I'm going to make a major point here this won't seem like much of a point but when you write a little function like this and you get that function debugged you can use that function in the more complex context but debugging it here means you can understand it and make sure it absolutely works on one variable so you only have to debug it on one variable to make sure it's right and then you're using that same function for multiple variables so you'll see that philosophy pay off in many different ways with this approach and I'll show you a more advanced example so we're just sub-setting data and doing lots of different observations and then when you're using the data table function you also have access to multi-way summarization so the cube function when we're saying by-gender or let's say by-gender and the history of micro-linear infarction we're going to summarize the mean basal heart rate with mean we're going to summarize heart rate with the mean and with the number of non-missing observations so we define these little functions up here and then we run it by gender and history combinations but we're doing it with cube and not with a regular data table operation because when you're doing it with cube you'll get it for all possible combinations like you expect but then you get all the marginal summaries so here's a summary by gender ignoring history summary by history ignoring gender and the summary ignoring both which is your overall grand summary and then if you want to have more control over that you can use the grouping sets function in data table and you can tell it restrictions on how you want to use cross-classifications when you're summarizing the data so that was summary statistics and merging data I'm just going to show this this is a front end to both the regular merge and the data table merge merge with a capital M it helps you merge lots of data sets at one time but more than that it gives you a report on how the merge went first of all how do you do merging in base it's very very easy and very very fast if you want to use this front end capital M merge I'm merging baseline with longitudinal data merging on ID and you see I get a report if I don't say print equals false here I get a report so in the baseline data set there were two variables for observations there were four unique IDs in the longitudinal data there were three variables, 12 observations four unique IDs and only three of those IDs were in the baseline table in the merge table there were four variables total 13 observations five unique IDs and four of the IDs that are in the final table were in the baseline table and then you also have this information that's reported so this is just to help you feel good that the merging went well now longitudinal data is a huge topic I'm not going to cover very much of but we have various needs for processing longitudinal data and data table is just exceptionally powerful for processing longitudinal data and so you can deal with longitudinal data of various types and you can convert between various types and so you might have a uniform number of rows and you might want to do something like last observation carried forward which we usually frown upon as statisticians but there are built-in functions in a field and set in a field in data table that make it real easy to do that but in other cases we have variable number of rows and you might want to carry forward in that context means you need to create new observations that didn't exist before so you're going to expand the number of observations and this just shows you how to create observations past the last observation out to the maximum possible day you're creating these new observations and you're combining those with an R bind onto your original data set but this is where data table shines even more so you want to say what is the first day in which Y was greater than or equal to three so what is the minimum day such that Y is greater than or equal to three we're going to call that first three what's the first day in which Y is greater than or equal to seven we're going to do that separately by patient so this is what happens when it never did pass the threshold of seven puts in an infinite which we can go to a little trouble to make that be an N-A instead of an infinite but where this philosophy really shines is when you have more complex conditions so if you wanted the days to be consecutive you can say that let's say we needed Z to be greater than .5 and the Z on the previous day to also be greater than .5 we're going to do that separately by patient so that's just an N condition with a lag of your measurement but that's not still the most general way to think about it the most powerful and general way to think about it is to write a function that counts so even though you might be interested in the second day in which something happened let's count how many days in which it happened how many days in a row was something satisfied so this function will do that very concisely so when something happens in a row so this will give you a count of two but then the count starts over and this would be a count of one and a count of two there then it starts over and this is a count of four so this will calculate the number of consecutive true values for doing your table data table operation you can say give me the first day so this would be the first day the minimum day for which the number of consecutive occurrences of Z greater than .5 is equal to two so once you write this function and get it to work for one patient and get it debug now you're using it separately for all the patients instead of just writing condition that says Z is greater than .5 in the previous was also greater than .5 let's use it as a counter because we might want to have three days in a row and we just change that to a three so this means the second consecutive day was Z greater than .5 and so once you learn how to write counter functions and you learn how to debug a function for one patient which is much easier than debugging it for the whole dataset you see you just plug those functions in and use those in the data table context when you're doing overlap joins you can get a lot more complicated and data table is extremely powerful for doing that so without going into details what we're trying to do here is we have an events dataset with zero values per subject containing starting in times and a measurement X representing a daily dose of something giving you the patient between S and E starting in times you have a base dataset B that has one record per subject with times C and D and you want to compute the total dose of the drug received between C and D for the subject you do that by finding all records in E for the subject such as the C and D has any overlap with the interval S to E and for each one you compute the number of days in the interval S to E that are also in C to D so sounds pretty foreign but once you get your dataset set up we run the F overlaps function in data table and we calculate we do this by for the left-hand side the left-hand data table we do it by ID low and high and for the right-hand one by ID start and end type equals any and now this calculates the elapsed time for the right overlap intervals and the total dose over that correct elapsed time so this sounds like a pretty complex thing and this code is not the easiest code to read but you can see that the once you get a hang of it the amount of programming is is pretty minimal so lots more about manipulating longitudinal data and then graphics is a very general chapter with some graphics principles mainly and a few few tricks about using ggplot2 which I I use very very heavily and I also use plotly a lot and then the analysis chapter is really showing you how to really do better than table one by staying close to the data and I'll just sort of stop with showing you an example of a what I think is a pretty rich presentation that's a lot better than table one would be this presentation so this is using ggplot2 instead of showing the age distribution in treatment A and treatment B it's a randomized trial we don't need to know the age distribution by treatment because it was randomized it's irrelevant what we need to know is how does age relate to the outcome of the trial so the relationship between age and the probability of dying in the hospital is given by this non-parametric estimate and then using this gglayers function and add easily extended box plot and spike histogram to the ggplot2 standard output so now we have the univariate distribution of age which would be the combined treatment A and B distribution of age or the raw data for age instead of stratifying this by treatment and giving people confusing data we give them the overall trial of patient characteristics in our chart which is augmenting the relationship with the outcome which is data that table 1 doesn't try to show usually but it's a lot more valuable data than table 1 and this uses a little function called melt data which is just a way to say you have one dependent variable lots of independent variables and you're going to reshape that into a tall and thin data set to pass directly into ggplot so that you're faceting on the variable so you get separate analysis for each variable with a free x scale as you go across variables they don't need to have the same x range so sort of to conclude things I think data table is one of our most important tools by far I think you've seen how you can put things together with quarter and what I really didn't show you is how in some of these chapters I produce marginal graphics with some functions that make it really easy to put things in the margin quarter has all these beautiful formatting capabilities and quarter like data table gives you a unified way to think about reporting whether you're making a web page a book a report or a slide presentation quarter can do all of that without having new R packages so I'll stop there and see if I've stimulated any comments from anyone thanks for listening thanks Frank for always interesting presentation lots of chatter on the different things with data table so one of the questions is that when will we move beyond static paper and PDFs for research manuscripts and FDA submissions that's a wonderful question I think those are FDA submissions are probably going to move faster than journals I can't really speak for FDA even though I work for FDA now I think FDA will eventually have more flexibility because it helps their reviewers review easier you can do a global search within an HTML report you can drill down to see details I think journals are hopeless and are not going to change and the best way to deal with journals is to quit submitting papers to journals and just take things under your own control so when you're able to self publish or put something in a preprint archive or write a blog article you have total control and you can use the latest and greatest interactive graphics and another question is do you have a catchy name for your replacement for table one I need a catchy name or it's not going to catch on I would be open to suggestions but the purpose of the replacement to table one is to show that are not preordained you know we look at comparison of treatment A and B it's really irrelevant at a randomized trial especially one that has more than 30 or 40 patients in it but what we don't get is just you know who's in the trial and then what happened to them and was it if death is your outcome was that mainly in those over 70 or did the deaths get spread across all the ages that's the information people are not giving that is absolutely interesting and sometimes unexpected so Peter suggested baseline outcome plot bop yeah and it needs to connote it's baseline distribution because it's study population description but also outcome relationships combination yeah and then Peter says that he agrees on public publishing but it's easy for him since he's senior but hard for trainees mentees who are trying to build up their reputation and get 10 years you know any thoughts on how to do this without hurting a generation of junior people that's really probably the trickiest of all the questions and I think promotion and tenure committees are doing a major disservice now by promoting the wrong people they're not promoting people whose research is reproducible they're just reporting promoting people they have high volume output or have splashy splashy results but I think the tenure and promotion process needs to be totally redone but there was an announcement this week from NIH that's going to have a lot of ramifications which is NIH is really changing how they want reporting to be done and they want it to be much faster and not wait on journals they want the results more usable quicker so I think the step that NIH took this week which I need to read more about may get us going yeah and so many institutions use easy metrics to count publications and things like that for reviews for promotions and things like that and it's not necessarily quality right so have you considered using the code link true in this auto YAML setting it's hard to know that's my next thing to do because and the only reason I haven't used code link already I think it has a dependency that I didn't have installed so I need to look at that one more time but I think for those of you don't know code link it will when you have a function call in your code that's listed in your report you can click on the name of the function for that function I think that's what code link does and I think that's pretty cool so that's on my to-do list that's now on my to-do list as well yeah because sometimes it's hard to know what's coming from HMISC or data table where are things coming from so that might help with some of that and I've also really appreciated having being able to collapse code so that I can insert a figure but not disrupt the flow of the diagram so people can see the code of how things were created but not necessarily all the details if they don't want to and I also let me show you just one thing in this analysis chapter there are certain analyses that you do where you want some information I see your PR equals margin these are some denominators that define how the smoothing was done and that will appear in a marginal table so sometimes you put things outside the body that are supplemental information that don't get in the way of the body of the report and then other times you'll use a command or just hard-coded in Courtault to have an optional section that's a show or hide or collapse collapsible section to have more information and I think those are very valuable for reports too yeah and I like the use of having the figure legends in the margins I really like that yeah okay from Travis I don't know why we would want to analyze baseline variable relationships with outcomes and trials that aren't designed to evaluate those relationships could get misleading results for example if there's an interaction with treatment assignment so I never want to criticize doing an analysis because something more complex might be in play because if you don't try to analyze it you'll never learn anything about that interaction so the presentations that we do clinical trial reports now do nothing but hide interactions so I can't really agree with the premise of that question but I think most trials are actually designed to do exactly what I'm suggesting they usually have some kind of age distribution and they usually have a severity of disease at baseline and the study report doesn't even tell us whether those patients with minimal severity of disease ever had any of our endpoints so by showing the relationships between severity of disease and the outcome we're giving new information so I think most clinical trials have the information you need to create these relationships so the question I'm always left with in a clinical trial is what happened to the patients and which patients did it happen to and this is an attempt to solve that right any other last questions for Frank all again thank you very much for this presentation and we will take a quick break until 26 after the hour coming back with Peter and talking about wrangling medical data thanks for having me and thanks for the great comments and discussion yeah generally we've generated lots of chat great I have to go back and look at some more of it