 Hey friends, a while back over on Twitter, a good friend of mine, Dr. Mark Martin of the University of Puget Sound, asked this question, how can I make a plot of the number of times a different word is used in the literature? Now I have used this type of plot before in talks, in grant proposals, to try to justify that the area that I'm interested in is relevant. And what is the term that I usually use? Well, microbiome, right? And as you can imagine, that term has really exponentially grown in the literature over the past 20 or 30 years. In this episode, I am going to show you two different ways that we can make that plot. One using PubMed directly, another using a package in R that accesses the data that's in PubMed. So let's start over here in PubMed. If you go to the pubmed.gov website, it'll look something like this. It gives you an area that you can plug in your search term. So what I'm going to do is go ahead and put in microbiome. So what we find is that in PubMed, of the papers that are indexed in PubMed, there's 116,375 papers that use the word microbiome in its title, its abstract, its keywords, whatever the information that PubMed is using in their search engine. If you look over on the left side of the screen, you'll see this handy dandy plot, where you can see that clearly the number of publications that use microbiome is growing, and you can kind of use this tool tip to hover over the plot. And so here you see on 2018, there were 13,261 papers that were published with microbiome in it. Well, what I would like to do is get this data that's behind this plot. So the easy way to do that is to click on this download button, and you'll see then that this downloads a PubMed timeline results by year.csv file. And then I've got this literature analysis directory that is synced with GitHub. So if you want to get a copy of this file, as well as all of the code that I'm working with, you can get that if you go down below in the description, there's a link to a blog post. As you know, I've also got a video that'll show you how to get caught up. This is going to be the first episode in this new series of videos. So again, if we then look at data, we've got this PubMed timeline results by year.csv file. Over to RStudio, I've already got library tidyverse in my file called npubsplot.r. I could of course do read csv because it was a csv, right? And then we'll do data forward slash PubMed timeline results by year csv. And what I'm finding is that it's got, it appears some type of header because it's not appropriately parsing the file that I'm using, right? So it's a comma separated values file, read csv should be separating these two values into separate columns. And I think that at the top of the file, there's a header. So if we go into our data, let's see, let's go ahead and open this up to view the file. And yeah, so sure enough, we've got the search query as the first row. So if I go to help and search for read csv, I can then I expect to find an argument called skip. Yeah. Which I can get skip equals one to go ahead then and skip the first row of the data, right? So I could do skip equals one. And sure enough, now that reads in the data, going from 2022 back quite a ways. Let's see how far back it goes. So we could go ahead and do tail. And we see it goes back to 1956, where there was one publication published. So let's go back here and let's kind of use our slider here to go back to 1956 and 58. And sure enough, it's using microbiota. And so I think PubMed search engine has gotten smart enough to know that microbiome and microbiota can probably be used interchangeably in these searches. And so that's kind of cool, right? So I encourage you to go back and check out these papers. I was looking through them earlier and they're pretty cool. Just because the way they're written, nothing's changed. It's pretty cool. Go check that out. We could think about, you know, making a plot from this. So we could do ggplot, aes, x equals year, y equals count. And then we could do a geomline. And then this gives us a plot of the number of times the word microbiome or microbiota is used in papers published and indexed in PubMed, okay? So this is kind of a two step process where you think about your search term, you plug it into the website, you then can download the CSV, you read the CSV into R, right? And so there's a bit of manual manipulation of the website, the browser, the data and going into R. What if we want to do this all without having to go to PubMed? Well, thankfully, there's a really nice package called REntree. So Entree is a tool that is used as part of the NCBI NIH website to search all of their different databases. PubMed is one of those databases. So what I would like to use is REntree, which is the R package to access the NCBI site to regenerate this plot, and to perhaps do a little bit more sophisticated analysis along the way. And that's what we're going to dig into in the rest of this episode. So we'll come over to the packages tab and do install. And the package we want is R-E-N-T-R-E-Z, right? And so we can then install. This then installs the package. If I want to use it, I of course need to do library REntree, and very good. We now have access to all of the search tools that are part of the Entree database system of accessing all the great tools in NCBI. So if you did this search along with me, and you look up at the URL, you'll notice to the right of PubMed NCBI-NLM-NIH.gov that there's a question mark and then term equals microbiome and filter equals years, period 1956 to 1958. Again, this is because we had narrowed down on those two years to see those first two papers that were indexed in PubMed. So this term equals microbiome is exactly what I put in here. So what's going on up in the URL is that it's making use of the API, which is the programming interface between the browser and the database, the front end being the browser and the back end being the database. So what REntree allows us to do is basically to skip the front end browser and to use R as our front end. And so we're going to use REntree to basically create these terms, send the query to the database and then get results back. And so REntree is a really nice package because it figures out all that jargon for us so that we don't have to make up our own URLs, and so we'll use REntree and its ability to generate those URLs for us to get access to the NCBI API. So coming back here to our studio, let's start to use REntree and learn a little bit more about its functions. We'll start with Entree's DBs. And this is a function that will tell us all of the different databases that we have access to through this API. And so the first one is PubMed, which is what I'm going to be most interested in for today's episode. But know that you could search nucleotide, protein, there's SRA, any database that you can get access to through the browser you can get access to through the REntree package. So the next command that I want to show you is a summarization of each of the databases. And so we can do Entree, DB, Summary, and then we'll do DB equals PubMed. This then gives us the database name, the menu name, the description, the build, the count, the basically the number of papers that are in it, and the last time it was updated. So next I might want to know what are the terms that I can search using within the PubMed database. And so to do that, to figure those values out, I can do Entree, DB, searchable. And again, I can do DB equals PubMed. And again, this gives us all of the different fields that we can search on in PubMed, right? And so I can look at all of the fields. I can look at unique identifiers, the PubMed IDs. I can give the author names, the journal abbreviations, right? All sorts of different things. I can look at the data publication, right? So just a lot of powerful tools. Normally when you do a PubMed search, at least when I do, like I'll do it like I did on the website, we'll type microbiome into the search. And that basically is searching all of the different fields. And so to simulate that, what we might do is Entree, search, and I'll do DB equals PubMed. Term equals microbiome. And this tells me that there were 114,664 hits. The output object contains 20 IDs. So it only returned the 20 most recent or most relevant IDs from the search. And it tells me that the search term was microbiota or microbiota or microbiome. I can get a better sense of this if I do call this, say, like S for search. And then if I were to do, say, glimpse on S, and we've seen this in previous episodes where we can use glimpse or STR to look at the overall structure of a list object being returned from a function. And so we can see that the 20 IDs are in here and stored in a value of the list called IDs. We can get count, right? So I could do like S dollar sign count to get 114,664. The max return was 20. You can change this to return more values or fewer values. This was the query that was actually used in the search. So we see microbiota mesh terms or microbiota all fields or microbiome all fields. So conveniently it threw all those together for me. And then finally there's a field called file that represents this information. As an XML document. So I'm not really interested in that. And really for today's episode, what I'm most interested in is going to be this count value, right? And so it's nice to know that it will return the count for me. So let's play around with this search term a little bit more before we go to our specific question of looking at the number of papers that have that term microbiome by the year. So let's go ahead and grab this entre search. And in terms, instead of the microbiome, let's go ahead and put in Schloss PD, that's me. And I can limit to those papers that I have co-authored by doing AU in square braces or I could use AUTH, which I think was the actual search term. You know, when we went back up here to the output from entre DB searchable, right, so back up here, yeah, we had off here, right? And so if I do that search, we then see that I have 123 papers. And it's again, giving me the 20 most recent papers. Again, I could do dollar sign count to get that 123 out as a value. Let's go ahead and copy this again. And so let's look at those papers that I've authored and where I used microbiome, right? So I can then use and and then do microbiome. And so this is going to look at that intersection of papers that I authored, those 123,000, those 123 papers, and the 116,000 or whatever microbiome papers. And look at that overlap, that intersection. And so we see that there were 62 papers where we had that intersection. So to build a bit more complexity to our search, I'm going to go ahead and copy this again. And let's look at those papers that I authored or that my colleague Vince Young authored. So we'll do or, young, VB, and we'll do AUTH. And we see there's 320 papers that either of us were co-authors on. And as we saw in the previous search, we could use and to look at the intersection. And so we see there's 15 papers that we co-authored, right? And we could make this even more complicated by saying, which of these 15 had the word microbiome in them? So what I can do is I can wrap Schloss and Young in parentheses. And then I can say, and microbiome. And so we see there's 13 hits, right? So there are two papers where we were co-authors, but we didn't use the word microbiome. And I'm not really sure what those two papers would be. So what we could do would be to say, not microbiome. And we see there's two hits, right? And so then I could say, well, give me those IDs. And so I might want to figure out, well, what are those papers? So let's go ahead and call these non-microbiome. And then I could do entree fetch, and I could say DB equals PubMed. And then I could do ID equals non-microbiome. And I need to specify the return type. And so what I will do then is ret type, and I'll say abstract. And this then gives me a lot of output to this screen, right? And so we see, well, this first one I can see is decade-long bacterial community dynamics in cystic fibrosis airways. So this was a microbiome paper, even if we didn't use the word microbiome or microbiota in it. So we see that the search isn't perfect, right? And then the second I see is down here, and it seems like the abstract mode, it's kind of concatenated all the records together. This is, again, another cystic fibrosis microbiome paper, even though we didn't use microbiome in the paper. So again, this kind of, I think, highlights the need to use the words that you expect people to search your paper on in your paper, right? And so we would expect a cystic fibrosis microbiome paper to find these two papers and for whatever reason, we forgot to use microbiome in those papers. But again, what I want to do is go back to my good friend Mark's question of how can we build a plot in R using search results from PubMed? What I effectively want to do is to go ahead and take this search, right? This entre search where I'm using the PubMed database, the term microbiome, but I want to add in a specific year. And so I might say like 2020 and then do pdat. And this then tells me that there are 21,285 hits from the year 2020. If I want to double check that, I could type this to glimpse to make sure that the search term was correct. And I can see that sure enough I have 2020 pdat. I could always copy and paste this up into the PubMed website to make sure everything works well, but I'm reasonably confident it will. And then I could output this as count to get that count variable because that's really all I want to be able to include when I plot the data. So what I want to do is basically make 70 different copies of this. And so I could make a for loop where I iterate over this. But we know about the map functions. So let's go ahead and use those and keep everything within the tidyverse. So to build out the 70 different search terms, I'm going to use a function called glue, which comes to us from the glue package. Glue is automatically installed with the tidyverse. So if I do library glue, I will now have all the great functionality from glue in it. So to get started, I will create a vector called year, which will be from 1950 to 2022. So this year, right? And then I will say search. Maybe I'll call this you buy them a search. You know, in case I want to search for something else along the way. And I'm going to go ahead and grab this search term. And I'm going to pop that into the argument for my glue function. And instead of 2020 in here, I'm going to put double curly braces like this. And I'll pop year inside those curly braces. And so now if I look at you buy them search, I now see that I've got all of these different great search terms, right? So now what I can do is I can go ahead and make a table. So I'll do a table and I'll do year equals year. I'll do you buy them search equals you buy them a search. And so now I've got this data frame with the year and the search term. I can now pipe this into a mutate to create a you buy them column. And the you buy them column will be generated. This is me the count by taking the you buy them search value and mapping over each value of you buy them search and putting that as the argument into entree search. So I can then do map DBL. So I'm going to use map DBL because I'm going to get back account value. And I will then give this you buy them search and then comma tilde entree search DB equals pub med. And then term equals period X because you buy them is sitting in that period X argument slot. And then the output I want from this is the count. And I've seemed to have misspelled mutate. All right, we'll fix that and we'll run it again. Very good. And so now we see we've got our table. And yeah, the first publication was in 1956. And I would like to save this as a search count. So now I can take search counts and pipe this to ggplot. X will be the year Y will be you buy them. And then we'll pass this into GeoMline and we then get our curve, right? We see we see very flat and then starting about 2000. The numbers start going upwards over the past 22 years or so. So there's not a lot going on between 1958 and 2000. So I think to kind of help this go a little bit faster, I'm going to change my start year to 2000 to go from 2000 to 2022. And another question I might say is, well, what does something like cancer look like over that period, right? And so I'm going to add another search, which will be cancer search. And I'm going to copy this glue instead of microbiome. I will do cancer. And we'll go ahead then and add into here cancer search, cancer search. And then we will do the same thing here. But instead of the you buy them, we want cancer and cancer. We'll look at search counts here and we see we've got those different searches. And then we've got you buy them and cancer as our columns to plot this out. I need to let's do a select for a year. You buy them and cancer and then we'll pivot longer. Everything but the year column. And so this then gives us year name value. All right. And so X is going to be the year Y is going to be the value. And then we will group by name. I don't want the quotes so name and color by name. So as we can see, there's far more cancer papers than microbiome papers. The other thing I see is that, wow, it looks like papers are just crashing for these two terms for 2022. No, we're only, you know, five months into 2022. And probably PubMed doesn't have everything all the way up to today. So what I'd like to do is let's go ahead and filter out year equals 22, right? So we'll do filter year not equal to 2022. And so now we lose that fall off, right? And so we can see as popular as the microbiome is, it ain't cancer, right? All right. Well, something else you might think about is that we know that the number of papers being published over time is also increasing quite rapidly. So let's go ahead and add a third search term, which I'll call all search. And this is going to be glue, where I'm going to look for everything published in a year, right? So I'm not going to have a general search term. And so we'll have all search term and I'll throw this in here. All search equals all search. And then here, we'll also add in all here. Let's see. And then instead of cancer search, we want this to be all search. And I've got a M here instead of a comma. And then my search counts has those three different searches, as well as the counts. And so again, I will add to my select here all. And let's see what they all look like together. And we can see that, yeah, like the all just is constantly going up. And so we might say, is microbiome growing faster than all? Is cancer growing faster than all? Or are its rises really being driven by the total number of papers being published? There's just a lot more papers being published today than there were 20 years ago. It's nothing special about the microbiome. So to look at that, let's go ahead and grab these first three lines of the code, which will again give us our data for Ubiome cancer and all for each of those years from 2000 to 2021. And let's do a mutate where I will look at I'm mainly interested in Ubiome. So I'll do a rel Ubiome equal in Ubiome divided by all. And again, we get that column. Let's go ahead and multiply this by 100. So it's in percent terms. And then we can pipe this to ggplot AES x equals year y equals rel Ubiome. And again, we'll do geom line. And we can see that even though we've had huge increases, the number of papers being published, the number of microbiome papers being published has been rising far faster than we would expect, right? We could put this on a log scale on the y-axis and I don't want to pipe. I want an addition sign so I could do scale y log 10. And we can see that it's rising fairly linearly on a log scale, which tells you that it's actually growing exponentially, right? So relative to all other papers, papers with microbiome in it are growing exponentially, right? And so that's pretty cool because that tells us that like, yes, the microbiome field is, you know, it's very popular. There's a lot of papers being published on that. I don't know if you notice that, but I find it impossible, frankly, to keep track of all the papers in the microbiome literature. And if you're in this field, I'm sure you do too, which makes it really appealing to put a plot like this into, you know, a grant proposal or into a presentation where you're trying to justify the relevancy of studying the human microbiome, right? So say this is a really popular area. We're in an up and coming area of research and you should be too, and you should be just throwing buckets of money at us to study this more. Anyway, you could again go ahead and clean this up a bit, make it look a bit more attractive. I will leave it to you as an exercise to add the cancer line to this. Is cancer growing faster than microbiome? You tell me, tell me what you find down below in the notes in the comments and I will tell you whether or not you're correct. All right. Well, I hope you found this interesting and I hope my good friend Mark can maybe get something out of this to make a plot for his own search terms. I think this is a really powerful tool to show people again the relevancy of an area of research. As always, it's a bit challenging to pick the right search terms and so I would encourage you to kind of think back and forth between using the browser approach and using R like this. Certainly the browser has its advantages for testing things out, but if you're trying to automate things, clearly doing it in an R script like this is the way to go. All right. Practice with this. Tell your friends about what we're doing and I'll see you next time for another episode of Code Club.