 Hello, today I'm going to be presenting the our package see trials gov package which provides access visualization and discovery of the clinical trials database. So this package is a collaboration between myself Taylor Arnold at the University of Richmond, Michael Cain at Yale University and Austin way at Cleveland Clinic. I can currently be installed using dip jewels from our GitHub website and should be installed installable on Cran by the end of the month. clinicaltrials.gov. What is it so it's a website which provides a database of clinical studies from around the world. It's maintained by the US National Library of Medicine and the NIH. The website does a really great job of documenting and making available different forms of this database can easily download a dump of the entire thing online. However, there are two challenges that we have found in working with this database. If you want to do visualizations and analysis of many different trials. So the two challenges are first that the data are normalized across dozens of different tables, and you need a lot of expertise to actually put them back together. And secondly, a lot of the most interesting data that we want to work with within the database. It's actually contained inside of free text fields and requires using some text analysis techniques to actually extract them. So here's an example of the scheme I've rotated it 90 degrees just so that we can see the entire thing so you get a sense of the complexity and all the number of different tables which are contained in the clinical trials.gov database. It's very well documented but you have to put it back together it's really too much analysis with it. So here's an example of one trial is actually from the website interface. And you can see, again, very well documented, but a lot of the interesting fields are in these free text fields such as the brief summary and the study population. So the goal of the package is to provide a set of tools, which allows to easily query the clinical trials.gov database from within our and return a single combined data frame that we can then use for analysis and visualization. And the package also provides some helper functions, which can allow us to visualize and explore the results, and in particular some function for working with the free text fields. So I'm going to give a brief overview of the usage of the package and some of the most common options that you might be interested in, and then I'll explain some of the post processing text analysis tools. So the first step in using the package is that we need to load a version of the database into the into that you're our memory. So there are three ways to do this I'll explain them in increasing order of complexity. So the simplest one is to use the CT gov underscore load underscore sample function so this loads a small 2% version of the database which we package inside of the package. So this is really great for prototyping using all of our tests and all of our examples and I also use it in this presentation. It just requires no external resources and it runs pretty quickly. The second way to build a data set which will then query from, and this is the one we expect most people will probably want to use is to load an entire the entire data set usually if you're working with this you want to see all the clinical but rather than having to produce it yourself, downloading a static version from our GitHub website and we create this static version every month. So this function CT gov load cache will download if it hasn't already the large files from our GitHub package release, and then we'll load them into memory and then do all of the queries off of that. The third one which is the best if you really want an up to date version of the database say you want to run this to run a report on a daily basis. So this one actually connects to a database back end so here a Postgres SQL back end, which has been populated with the clinical trials.gov database so this could either be through their API or lap or loading the entire data dump and running it locally. This takes a few minutes because it actually has to do some processing of the data, but then you have a completely up to date version of the database and this is actually the function we use to create the other ones. So regardless of which way you populate the database the sample, the static dump or creating from scratch all the other functions work the same, the same way. It's just, they're just working out with different versions of the database. So we're loading the data once we've loaded it. They all work through this function on CT gov underscore query is kind of the main function within the package. And we can provide different options for this query to query different fields of the database to return different subsets of clinical trials. So I'll show you the three different types of fields although I won't document the over 20 different things that you can search by, but that's all in the documentation. And some of the fields are categorical so study type for example, and it has three or four different things that it can be equal to so we'll search for here and as an example, interventional studies, and it will return a single data frame of all of the interventional studies which are in that sample data set. Another categorical one is the sponsor type and here we would get all of the interventional studies which are produced by industry partners. And the other kind of query that we might do is on some of the continuous variables. So for example the number of people enrolled in each trial. Continuous values are queried by providing a vector with two parameters, the lower bound and the upper bound so this would give us all studies which contain between 40 and 42 patients. Setting one of the bounds to missing just only searches on one bound rather than the other so this would return all studies with 1000 patients or more. And then a similar interface exists for dates where we provide the date in an ISO format. And then the third way that we can query the database is using keywords so here we'll search for the keywords lung cancer and in order to turn all of the studies which in their description field. I'm use the phrase lung cancer by default this is case and sensitive, but we can make a case sensitive by setting a flag. We can also pass multiple keywords at once by default they will take the union of those search results, but we can set a flag to look at their intersection. Similarly, we can put multiple we put many of these together if we would like either in a single query calls or this we get all of the cancer studies with a certain enrollment range and a certain date range. Or we can actually use the CT gov underscore query function in a pipe. And rather than if we pipe in a data set to the query function, rather than working up the entire data set it will work off with something we've already queried and through users testing we found this is actually one popular way of using the package to sort of slowly pare down the set of studies we're looking at one by one rather than having one single query. And then actually by pasting together these results we can get more complex types of queries than you can get from just a single call. So, I'll finish up here by showing some of the tools in the package for doing text analysis on the results of doing the query function. Text analysis tools so one is a key words in context. So this takes a vector of text of vector of characters and shows us all occurrences of a certain term so here for example we can see all occurrences of the word bladder in the brief titles of the interventional studies from the sample data set and we can also put on an ID there if we want to actually link it back to the original study. So this is a CTF ID app. So this uses a computational technique to determine automatically generated keywords for each trial. So here for example, we'll pass up the descriptions and it will figure out which words are most specific to each trial and then we'll generate a set of automatically generated keywords. We can choose to respect case if we want and we've actually found this is often more readable than the, than ignoring case. And this is a very fast way of being able to look at the text fields without having to read the entire text fields and we can kind of summarize them and visually inspect which ones maybe we want to look at more closely. And then we can also set parameters so we can get rid of particularly rare terms so maybe a drug that only appears once maybe we don't care about that as much and we want to look at terms which might occur in multiple studies and there are a number of different parameters to control this based on what you're looking for. And then finally we have a technique for document similarity. So this takes a set of text fields and tells us which ones are most like the other ones and this is particularly useful if we have a study that we're interested in we can find all the other studies, which, based on the text in their fields tend to be the most similar and then we can visually inspect them and find other trials that might be interesting to us based on their description or their design, or their interventions. So that's a very brief introduction to the sea trials gov package. Remember, you can download it from our GitHub website. And if you have any questions or comments we welcome them through GitHub issues or directly.