 Today we have two presenters, Dr. Gray Nearing is assistant professor in Geological Sciences here at UA and is currently serving a temporary appointment with Google as a visiting faculty member in research. We also have Mashra Kuru Rahman who is a PhD student working with Dr. Nearing and what they're going to talk about today is some really fascinating research they've been doing which essentially is you know focused on mapping the literature of hydrology so without any further introduction I'll just turn it over to one of you. Okay so I'll be the one to present okay so I'm just going to start sharing my screen if that's okay cool so can everyone see my screen all right yeah looks good awesome okay so so Gray do you want to say something before I start or should I just go start yeah I'll just say a few words so Mash is going to present Mash is my PhD student he's been here since 2017 18 I'm sorry 2018 and so you're in your second year now right yeah it's just started my third year like yeah I can add it on good days so Mash is really taking over this project he's going to tell you all about it but but I'm just going to basically turn it over you might as well do yourself Mash yeah okay so so today we're going to talk about leveraging data science towards a deeper understanding of scholar literature so this is basically the project that I've been working on recently like doing topic modeling on hydrology literature so before I get started I want to thank NASA advanced information systems technology program for funding this work and the Department of Geological Sciences for their support and the University of Alabama libraries for their support as well so without further ado let's get started so I'm going to start with the background so let's have a look at the problem that we are facing nowadays so there's a lot of like text data that's available to us and so so we want to make the best use of it right so we want to support like all the knowledge that is available through the textual data to support like policymakers and like even help our scientists and like nowadays like since this is like our work is like related to hydrology nowadays like there's a there's sort of a drive for hydrologists to advocate their own work so so we want to like we want to be able to like support everyone with the knowledge and the tools to knowledge and the tools for their work so if you consider the amount of the amount of like papers published in hydrology between 1991 and 2019 and we are considering here like six major journals in hydrology so they are hydrology and earth systems science sciences journal and hydrological processes hydrologic sciences journal journal of hydrology journal of hydro meteorology and water resources research so if you consider just by the sheer numbers of these six journals there are about like 42,154 articles like published between 1991 and 2019 so so these these articles like if if someone wants to get into this field or like synthesize the data it it's a nightmare so so we ideally we want to make use of all these like textual data but we also want it to do it so want to do it like faster and in a way that is like you know interactive and like comprehensible to everyone okay so moving on let's have a quick look at the our discipline that like well our work sort of lies in between the intersection of natural language processing and hydrology but let's have a quick look at what natural language processing is right so if you consider computational linguistics and artificial intelligence natural language processing lies somewhere in the middle so there is natural language understanding and then there's natural language generation right so and also when you consider natural language understanding we have an intersection with like deep learning and machine learning so this is the part like we're interested in so if we have a quick look at the history of natural language processing it kind of started during like the 1940s like late 1940s with like Claude Shannon in his paper like in his paper on entropy like he talks about like words as like you know discrete sources of information that can be like conveyed over a certain like media right so embed like encoding them and then like you know communicating the signal and then decoding it and then in the 1950s you see the IBM Georgetown experiment where they first use machine translation to like translate Russian sentences like 60 Russian sentences to English and then like late 1950s and 1960s you see like the famous works by Noam Chomsky and like so he basically like talks about natural like natural language the statistical nature of natural language and then we have some more development over like from 1960s to 1990s and like you know late 1990s we see the advent of the first word embeddings so word embeddings or word representations or word vectors are basically statistical representations of words so like human language converted to statistics and then and then like in like during 2008 and like you know early 2000s 2010s you see natural language processing and multitask learning coming into effect and you see like Christopher Manning's work with work to vector like at Cambridge and so and then we have sequential like sec to sec models and then convolutional neural networks and then nowadays we are using transformer models that are based on this attention paper and like very recently we have like you know huge pre-trained models such as GPT-2 and like Megatron that that's Megatron that was like a huge like transformer model that was trained by NVIDIA for like you know for millions of dollars yeah so so natural language processing is taking off this that's all I'm saying so if you consider the conventional applications of natural language processing we're using it all day long right so we are using when we are doing doing google searches when we are texting someone when we are sort of when we are talking to Alexa or some other AIs right so so some of these like like some of these applications include information retrieval sentiment analysis which deals with human emotions in text and like you like in text and then there's information extraction question answering so these would be like chatbots and automated like you know automatic answering machines or something like that and then we have machine translation which basically translates between languages so we are like for this project we are working on the project that we're working on we are interested in information retrieval and information extraction okay moving on to the approaches that are adopted in natural language processing so there's distributional approaches so this makes a proper use of statistics these are like you know machine learning methods like they make use of big data and then like we have frame best approaches basically stereotype approaches like you know when you are solving a crime you sort of provide some clues within the sentence and it pertains to something and then there's the model theoretical approach where like it takes into account the model theory and the compositionality and then we have interactive learning approaches where the machine learns by interacting with humans so this is a very interesting approach so what we are interested or like what we are working on or what we're working with is distributional approaches okay so moving on let's have a look at data and preparation so in order to like you know get the data or acquire the data we used sort of two sources the first source is the web of science core collection and if you can access it you can actually download the abstracts and their corresponding metadata in the form of bip files and so this is the website that you can go to so it's pretty easy to acquire so that's that's how we acquire the abstracts and the metadata and if you sort of like if you can if you consider the journal websites so this is this is for the full text so that was for abstracts and now we like acquiring the full text so for the full text data we use custom web scrapers to go through to iteratively go through like different doys and like acquire the acquire the like the full text PDFs and download them um so some of these journals will require you to have a click through license but that you can acquire through crossref text and data mining api and so you can get a click through license and then obtain an api token and then sort of induct it in your web scraper to download the files iteratively so that's a very useful tool so this is again the website that you can use to acquire the api we used to acquire the token okay so moving on to the next part which is topic modeling so here's the interesting part so let's have a look at the literature of topic modeling and so topic modeling has been has been in effect for a while and people have been using multiple different or like different types of approaches for topic modeling using latent semantic analysis latent Dirichlet allocation lda2vec and so on and so forth so you see the application of like topic modeling in multiple fields so you can see it being applied in transportation research uh cox science social science hydropower research and like so you'd also see it being applied in like you know foreign language uh so history of german studies and cloud computing and so we are so sort of doing the same work in a different way for like hydrology literature so this this is this is a screenshot from an earth archive and so it's on earth archive and it's like we're it's submitted to wrr and it's in review so let's quickly go through not quickly okay let's go through the pre-processing routine so in a precursor to any good topic model or any model is to give it good data you have to feed it good data in canonical format to like sort of get good quality of topics and so on and so forth so so here's just a simple like schematic to sort of depict how we did it so after we downloaded the data so we converted it into text files right and then we cleaned the data by removing the punctuations nonsensical text and symbols and stop words and then like we and then we prepared the data by tokenizing building bigrams trigrams and then lemmatizing and then converting it into this format the term frequency inverse document frequency format that is required by the ldm model inside the gen inside a library called jensem and so we so so that's how we sort of pre-process the data so if we if we go into details so what are the libraries that are required to do it or like you can use various libraries but we used a jensem natural language toolkit and then spacey so spacey is a much faster like you know a tagger and like you can use it to like you know do lemmatization very like efficiently and forms diagrams trigrams efficiently so so these are the functions so before i go to that i want to mention that like you know people can choose different languages to the coding languages to do it but in our case we use python so this is we are this is here like this is us defining the functions for cleaning and preparation in python so we also remove the stop words right so stop words would be words like in the on so some words like small words that do not actually contribute to the like you know equality of the topics so taking them out is a better option so we remove the stop words and then we form bigrams so bigrams are something like okay so you have words such as climate and then change but like it's more sort of intuitive to combine them together right so instead of just climate and change we combine them together to form climate change and then trigrams is for three words right so if there is something like climate change impact so we we can combine like climate and change and impacts together to form climate change impacts but so you can actually keep going up like you can go for four words and five words you can even do phrases but there's a problem with that so like you know with text data there's a high chance when you're forming these trigrams and like n grams and phrases there's a high chance for data sparsity so it's it's imperative that we be careful about that and so and lemmatization lemmatization is sort of axing the words so if you think about something like you know going go gone something like that it lemmatization what it does is that slashes the words to just go right so it removes some like I think nonsensical sort of meanings or nonsensical words that won't necessarily contribute to our like to a better model okay so that's preprocessing and now let's move on to our model which is latent Dirichlet allocation so latent Dirichlet allocation it was first sort of published or like you know written by Bly in 2003 so what is the idea behind this latent Dirichlet allocation algorithm so think about it like this so consider this document so this is a this is a document from genetic sciences and so this is a document about like how like whether some genes contribute to the survival of some species over time right so the idea behind LDA is that documents exhibit multiple topics now here in this in this sort of picture you can see words such as like computational numbers computer analysis and predictions colored in blue and they sort of indicate sort of data analysis and computer science right and then like the words that are like colored in yellow such as like the genome sequenced genetic genes so they sort of pertain to genetics right and then you have words such as organisms survive life and organisms so they they sort of pertain to biology okay so now imagine coloring like all the words in the same way and then throwing out like you know words such as in the and on and just and then like from a distance if you squint at this paper you'll like you you'll say that like you know okay so i don't i don't really know what this paper is about but like you know looking at the words it seems like the paper blends together like data science and genetics and biology so what latent judicial allocation does is that it takes this intuition and you know casts it as like a formal probabilistic model of text okay uh we have a we have a quick question in chat how did you go from a pdf to a text uh yeah so we we used a converter the pdf it's a pdf to text converter so there's this library available and we did it on like you know through our command line through a shell script probably yeah so yeah we did that so uh so what i'll do is like you know i'll take the questions after the presentations if that's okay yeah sure okay cool all right uh yeah so uh so let's have a like let's try to understand how like let's try to understand the intuition behind lda a bit more okay so as we previously discussed documents contain topics right so let's say we consider 100 topics so and each topic would have a mixture of words or in formal terms there would be a distribution of words within a fixed vocabulary or a vocabulary or however you want to pronounce it so so lda so we have we will try to understand this the best way to understand this is to understand the generating process that lda does right so if if we consider the distribution so this is the distribution that like sort of lda assumes right and then so let's say and this color so this colors each of these colors are topics right and each of these topics would have some words so what lda does is that it iteratively sort of picks one color and then picks a word associated with that color and counts it and it keeps doing that for all words within the document and then it goes to the next document and then it keeps going and keeps going and keeps going okay so so let's try to understand it in a formal manner so latent Dirichlet allocation so it as we already discussed it discussed it assumes a generating model so it tries to model like how the corpus was produced right so if you consider like this so this is a graphical model representation and it's an intuitive way to understand the model so there so we are considering three random variables so these are the per document topic distribution so per document per topic word distribution and so alpha here is the Dirichlet prior on the per document topic distribution and beta here is the Dirichlet prior for the per like topic word distribution and so zdi here is the topic assignment for each word within the document so we ideally we want to infer the per document topic distribution and the per topic oops yeah so ideally we want to we want to infer the per document topic distribution and the per word topic distribution and the per word topic assignments to like get the posterior expectations and then we can use the posterior expectations to do all sorts of analysis we can do information retrieval we can do like you know information extraction like and you know posterior analysis we can do a whole lot of things okay so that's basically in a nutshell what lda is okay moving on so if we if we want to like you know so if we are training the model we're looking to form like you know good quality topics and we need evaluation metrics so there's there are two like fundamentally two different like types of evaluation metrics so these are intrinsic and extrinsic so considering intrinsic evaluation metrics a very popular one is topic coherence so this is basically a measure of the similarity and semantics between the high probability words in a topic or in a more simpler manner it it just it's a measure for the quality of the topics whether the word that lda you know thinks that belongs within the topic does actually belong in the topic so it's uh it's a coherence indicates that so ideally we want the coherence to be as high as possible and then we have another evaluation metric which is perplexity and it's basically a measure of how surprised the model is with the introduction of new data and so lower value indicates better model phase so we want like a lower value of perplexity with increasing number of topics so uh so we when so when we want to like find out the optimum number of topics and this is very important because you don't want to overshoot the number of topics or undershoot it or so so we want to reach that sweet spot right so and the best way to or you know the way to do it is to have a look at the perplexity and coherence's course the coherence and perplexity scores for multiple topics so what we did is we trained the model on multiple topics starting from zero till 50 and and we and we like observed like how perplexity and coherence varied so we sort of like you know it gave it gave us a ballpark between like 25 to 30 topics as being the optimal and then we use extrinsic evaluation metric which is like our like you know subjective perception to understand like you know what's the best like what's the optimal number of topics okay so let's have a quick look at model training uh so there are multiple like you know libraries that you can use in python for for training the lda the most popular is of course jensim and so jensim has both the default lda model and the lda multi-core model so lda multi-core model is what it does is that it parallelizes the routine and so it's much faster so if you can do that if you have some computational resources available to you it's always better to use that so there are certain hyper parameters that you can adjust to like you know get good quality of topics so there's chunk size and passes being the most important one so chunk size is the number of documents or the number of documents that the model iters through each time and the number of passes so you don't so you don't want to like you know set the chunk size to like too high because you know the model would update you know the weights that topic weights like over like each iteration so if you set it too low then like it won't update properly and similarly for passes like you want to set it as high as possible so there's like when you think so one must think so okay if i put it like you know if i tune this parameters hyper parameters too high like too too high so like if we set the passes and iterations to too high then is there a chance of overfitting so so you there's a very minimal risk of doing that with with lda because it follows follows Bayesian statistics and so the risk of overfitting is much lower so let's see yeah there yeah okay so have a let's have a look at the quality of topics that are or like you know the topics that were generated from our training routine so but first of all we need to identify those topics right and so so we used a library in python known as word cloud and so the word cloud like if you look at these things so these are like you know these words there's like sort of agglomerates of words these are like basically showing the most likely words appearing within a certain topic so the model like here model seems to be the most popular sort of word within this topic so it's like larger and so if you consider like snow hydrology here you can see this the word snow appearing in that topic like much more frequently and therefore you have like you know snow and cover like and snow melt like these like the larger words they show like they're most likely to appear in the topic and so we identified the topics in multiple ways right so we we did sort of like subjective analysis of the topic so we sent you know they sent it over to like we sent these topics over to like experts and they they would be hydrologists in the united states and all over the world and so they they gave us in percent that kind of helped us identify the topics so and we also then looked at the trends of the topics and i.e. like how they are varying over time so you can see topics such as precipitation variability and extremes and climate change and like water management and precipitation observation they are like sort of increasing in popularity over time in in the six journals in question so and within the six journals that we did our analysis on we have like like these topics trending over time and like if you consider like the converse so you'd have like soil moisture sort of decreasing in popularity and statistical hydrology and like sediment and erosion and hydrogeology sort of decreasing in popularity subsurface flow and transport decreasing popularity so it sort of shows like you know some of the subsurface topics so subsurface would be like you know when we consider like hydrogeology and subsurface topics so these are basically like as hydrologists what we think of like surface hydrology and then subsurface hydrology so water which flows like over the surface of the ground so that surface hydrology and water which flows below the surface that subsurface hydrology and that also includes like groundwater and those things okay okay so these are sort of like this is an example of the you know the data we sent over to the experts for you know identifying the topics and as you can see like each word within the topic has a relative strength of relationship with the topic and and we also backtrack into our corpus the corpus being the i mean our data set so corpus is like the natural language processing term for a set of text documents right by the way the plural for corpus is not corpi it's corpora so if if you're and it's latin right so if you if you are saying corpi then you know you didn't study latin in high school uh so okay so so these are the papers associated with topics and so we sent this over and we got back expert feedback and we sort of incorporated their sort of feedback into our findings so there are certain other interactive exploration tools such as pi ld of this so pi ld of visualization is sort of like it what it does is takes like this like you know high-dimensional topical problems and projects it into a two-dimensional intertopic distance map and so you can browse over these circles so each of these circles are are a topic and like you can browse over the circles and have a look at the words that are within these topics right and you can adjust this little bar here to adjust the relevance metric and so with a higher relevance metric you'd have words that are more likely to appear within the topic on the top and the converse is true as well okay so moving on to the next section so that's basically our methodology in the next section we're going to have a look at the results and the analysis so so this is interesting because let's have a look at the evolution of topics so we wanted to understand our like we wanted to understand ourselves and also give other hydrologists and like you know stakeholders in this area you know an idea about how the topics are like sort of segregated by the model with increasing number of topics so we train the model on different number of topics right we train the model on two topics that is on the like left here and then we train the model on like five topics and 10 topics and 15 and 25 so 25 being our optimal number of topics right so you can see like you know when we are trained when we train the model on two topics you can see a clear distinction between surface and subsurface hydrology so surface hydrology and subsurface hydrology is clearly sort of like segregated by the model and and then you have like you know surface hydrology sort of you know splitting into modeling and terrestrial hydrology and climate change notice that like climate change you know does not have like a subsurface like you know a split you know merging into it so you can also see like modeling you know modeling sort of like you know being you know containing both like surface and subsurface processes so modeling is basically for those who are not like hydrology so modeling is like a computer a simulation of natural processes in like you know in a very crude sense so we can we can see other nuances happening so climate change is splitting into like forecasting and like extreme events and soil moisture and like you know these sort of nuances keep increasing but what we also see is like you know merging so we can see like hydraulic modeling and like flow and transport so these are basically subsurface processes so they like you know combine or like you know merge into flow transport and modeling and you can also see the same for uncertainty research like you know there's subsurface uncertainty research and like you know surface uncertainty research that are sort of merging so so this sort of like gives us an intuition about like how the model you know things or how the model like you know considers the topics and it sort of helps us understand it better and so if we move on to the next sort of our analysis and this is a very interesting one we we are looking at the inter-topic correlation so we are considering two topics and it's and within the entire dataset and we want to understand like how the two topics are sort of like you know appearing or sort of like how likely they are to appear within that corpus so you can see so this the left part here is the left diagram here is positive correlations and the negative diagram here is the negative correlations or the one on the right is negative correlations so you can see you know you can see things that make sense okay so you can see like hydrogeology and subsurface flow and transport sort of like you know appearing together quite a bit and you can also see the same for like ground water and hydrogeochemistry and water quality that's because like these are all subsurface processes and you can also see that this is very interesting like a strong relationship between climate change and human interventions and also climate change and precipitation variability since like you know precipitation research of like a lot of it goes towards climate change research as well and so you can see like the two modeling topics of modeling and forecasting and modeling and calibration they're sort of like connected with uncertainty and they're sort of like sitting there so that's just shows like you know like you know there's more scope of modeling like calibration and modeling sort of forecasting been applied in other disciplines so you can also see some like you know correlation between like uncertainty and subsurface flow because like you know ground water researchers and hydrogeologists like also use subsurface like uncertainty quite a bit and so here in the negative like correlations part so you can see a lot of these correlations that are like sort of negative and you can see like ground water hydrogeology and subsurface flow and transport like they're sort of interconnected so what what these analysis does is that like it helps us like you know see the like the topics that are like you know that that are like communicating with each other more like more compared to the topics that are not so this is some interesting insight and yeah okay so moving on to the next analysis which is journal diversity so before I start explaining this I want to quickly go over entropy so this is not like Boltzmann Boltzmann's entropy entropy so we're considering Shannon's entropy so I considered this improbable events will always have more information in them according to Shannon and so like you know improbable events so like you know suddenly you know we have an earthquake here in Alabama so that's an improbable event so that would like you know probabilistically contain more information so and thus more entropy right because there are like there aren't many like you know like plate boundaries near Alabama so so that's yeah so that's kind of what entropy is so we we applied the same principle to understand whether like the distribution of topics in different journals are you know similar so you would see that like you know this is journal of hydrometeology and the and it has some dominant topics because of course you know that that is kind of obvious because it's a hydrometeology journal and it deals with like atmospheric sciences and like precipitation so you'd have things like you know these topics such as like precipitation observation and like you know precipitation variability and extremes sort of dominating this journal but that also means that like you know we can predict better like which topics are going to appear in this journal so that reduces the amount of information i.e the entropy for these like you know data set like for for this journal conversely like you know hydrological processes journal like you also you see here a lot of like you know a more uniform distribution of topics so we can't really you know you know it's harder to predict like which topic you know it's gonna appear in this like in this data set for hydrological processes so the entropy is higher so you can kind of see that so like journal of hydrology and hydrological processes and hydrology and our system sciences and hydrologic sciences journal like all of these have kind of the same entropy so if we move on to the next analysis which is journal uniqueness so journal uniqueness so basically what we are trying to show here is the distance between journals so it's how much like the two journals are like you know related so have similar topics or dissimilar topics so this is a confusion matrix and it sort of shows the distance so if you consider like journal of hydrology and journal of hydrology since this is and i'm like this is predominantly a hydrology journals and this is a hydrology journal you can see the distance between them is much higher so like you know a darker shade of blue means like the distance is higher and even and for a wrr too right with journal hydrology and wrr the distance is very high like you know very high so that means you know the distribution of topics that are within these journals you know they are you know far apart they're more far apart so this sort of like gives us an insight into like you know how the journals how the topics are distributed within the journals uh okay so moving on to the next uh moving on to the next uh sort of like our result or slash analyses so this is where we uh like you know look at the temporal trend of the distance of journal to uh to the entire corpus and and the term we use the term we use it for the term we use for explaining this is uniqueness so so we can see like you know again hydrology journal is uh more unique compared to the other five journals but then like you you also see like the uniqueness sort of the index here the uniqueness index is decreasing right so for some of the journals like a hydrologic sciences journal it's the uniqueness is increasing but for but for this for this uh other three journals like hydrology or system sciences and like water resources research and a journal of hydro hydrology like the uniqueness is decreasing so that means like you know they're the topic distribution within them is becoming more uniform so yeah so that kind of brings an end to the results in analysis section so so now we are going to move to the uh well to the application section so we are working on building a web application an interactive web application for exploration of hydrologic literature and we name it hydromind so yeah so this is the logo I designed you know like very recently for this tool so let's have a look at the features of hydromind so what we are doing so what are the features of hydromind that are like you know that we are considering here so it's it's going to have an interactive web-based interface so anyone can access it over the internet and like you know search and explore the knowledge in hydrology and so it will be represented through coherent networks in two-dimensional spaces and then we the user would be able to do journal time and topic-based exploration and so each of these documents would have some like auxiliary slash ancillary information to help you know navigate to help the user navigate and it would have a section for user feedback and modularity and and the user feedback would like we incorporate you know the feedback into our research and that's we'll ensure the modular nature of our of our tool so if you consider the framework of these of this web application so what is what hydromind is going to do is accept information from the user retrieve information from database create update and delete information from the database and display the information back to the user and so the front end is going to be powered by html css and bootstrap files so so we are going to use like flask so flask is basically our framework for the web application and it's so some people prefer jango but like i i found flask to be more sort of like conducive framework for you know building this web application so this the back end of the program is in flask right and so yeah so if you consider a very high-level overview so all the data that we will have from our or we have from our like topic modeling research is going to be stored in the database using sql alchemy or sql alchemy that's using flask is basically sql and so there will be interconnectivity between the back end program of the python python flask and the front end which is which the user sees and interacts with so that's a high-level overview of hydromind so yeah okay so moving on so let's have a look at the visualization that we are going to use for these uh tool so if we consider um or like you know the tool that or the visualization engine that we have chosen for uh this web application is d3 force so this is a library in javascript and so what it does is it's really interesting because it sort of mimics newtonian physics um so before I go into like depth about this I just want to quickly say that you know each of these nodes are like you know each of these nodes is a paper and each color is a major topical theme and the distance between the nodes would be the like the similarity and the dissimilarity right so higher distance would mean like you know less similar like papers and like you know distance closer would be like you know more similar papers so so I'm gonna I'm gonna go into the actual visualization in a bit but like I just want to like you know go over like how it does it so um it assumes like the engine assumes a constant unit time step and like it assumes a constant unit mass so force acting on like each of these nodes is equivalent to the acceleration over unit time so those those of us who remember high school physics it uses like force equals to mass into acceleration and so the mass is constant right and the time is constant so the acceleration is basically defined by the weight we put on in between the nodes and that weight is going to be a measure of the similarity and for that we can use multiple methods so we can use um like we can only use the probabilistic similarity that is that sort of like we like that that's from the like the posterior expectation of our lda model and then we can also combine it with like bibliographic coupling and like you know co-cytation analysis to you know to like you know explore whether that's going to be better or not but that's a question for a later date so i'm i'm going to uh talk uh this presentation and uh so quickly show you uh so this is sort of like you know the web development in progress or web application development in progress so um what we are doing here is like you know we we are enabling the user to choose that uh journal so let's say water resources research general hydroponology and blah blah blah and then uh we the user can also choose the year uh that they want to search by and then the climate sorry and and the and the topic they want to search by so any topic they choose and then after they click submit it's going to like generate this interactive visualization right so notice like how it's uh arranging right so notice how it's arranging uh following sort of like or sort of mimicking Newtonian physics so when you browse over each of these nodes you like it'll show you the different sort of like papers that are being represented here and again like each of each colors is a major topical theme here you can see a cluster here so these these all of these papers belong to a major topical theme a topical theme and then these papers also belong to a cluster together also belong to a topical theme so you can also do a fun thing you can just play around with it and like you know see how um you know it rearranges itself uh so and and so we would also have like you know ancillary information so whenever you click on these sort of nodes we you'll also get uh you know information about like authors and like you know co-authors and some other information that we can extract from the metadata awesome so yeah so that brings an end to the presentation um and like you know we we can take questions nice mash okay cool all right should i should i stop screen sharing um whatever you want to do if if you think you might need to refer back to it it's fine to leave it up but feel free to drop it off yeah i'll i'll just leave it up for the time being if if i want to refer back to it at some point do we have any questions about the processes or anything else that mash has been discussing i guess one question i i'll go ahead i'm sorry um my name is heather templeton i work in the college of nursing um at ua and i appreciate you sharing this uh very interesting presentation i'm just wondering if you have any and i got here late and i apologize i had a um conflict but uh do you have any publications that um we might be able to look at that might share a little bit more about your process yes yes we uh we have we have it on earth archive i should have shared or like yeah i probably should have shared it before the meeting but like i'm going to look forward it to kevin and like you know kevin if you can like you know send it over to the participants at a later time yes i can send out an email follow-up um we did record this session today so we'll also be posting that somewhere and i'll let everyone know and i can provide that link as well to the publication yeah yeah so yeah so that paper is in review like at water resources research but like um so the preprint is uploaded on earth archive so you can access it and have a look at it and if you have questions like feel free to email us we'll be happy to answer them thank you here we have another question in chat uh were there any comparisons done between how much more efficient the retrieval of good journal articles was using uh this versus using advanced searches of same topics using library with subscriptions to all these journals so i yeah i think the question is or did you compare the information you were finding through proprietary routes versus uh maybe open source routes is that am i getting that right margaret yeah okay yeah so we we like haven't yet done that analysis but like it's on the cards so right now we are exploring so so if i understood the question correctly you're asking like whether we compared like the full text articles versus the abstracts that we acquired uh well i think what she's asking is um more than that would be um did you see i guess she's wondering about differences that you might see um between the information you can access through something like the proprietary databases that ua provides access to through its libraries versus uh more open access uh type journals or um through something like google scholar which uh you know will not necessarily connect with a library uh database i guess it depends if you're on campus it will but yeah we yeah we have not we have not explored that yet but it's a really good like you know arena um so this basically like what we are trying here is to like you know allow a sort of enable a contextual understanding a contextual understanding of the thematic structure of uh like you know the journals in question right so uh what are what are the topics and uh so how are these topics varying and over time we want to be able to do like you know we want to be able to build more refined topics and sort of like you know segregate the data according to that and like we want to want to be able to like you know or at least in my mind i want to be able to predict like you know papers uh that or you know predict the topics that are going to trend in the future or like you know that are going to be popular in the future um so uh so that's sort of like you know in a strictly research perspective so uh that's that i also want to build like sort of an alexander library of uh like hydrologic stressors uh so so those are the things we are looking at but like we haven't had a chance to compare it uh like you know with uh the tools that are available um yeah uh you're muted yes sorry i think i was unmuted and then i muted to speak it was strange um do you have a hypothesis yet uh regarding what a tool like this might or how a tool like this might affect the field and what i mean is you know in your research thus far it seemed to indicate that uh thematically most of the journals are getting less unique uh and would uniqueness increase or continue to decrease with more access to the information a greater understanding of of uh you know how um how everyone is approaching these things differently do you think uh do you have any thoughts on that yeah so uh the thing is like you know these are only six journals and we want to like induct like more and more of these journals uh to have an understanding whether like you know the the uniqueness is decreasing because uh like you know some like authors are preferring more specialized journals or vice versa so so we have to do some you know analysis on that like we have to study that and see uh like you know take into consideration the entire like hydrologic corpus and see that um yeah so what's the other question you said looks like we have one uh what kind of software have you utilized for this project uh python or other programs yeah so we we the language the program language we used is a python um and uh so you can you can also do it on r or javascript there are multiple libraries that are available online and there's just so much resource on the internet nowadays now that everyone's trying to do like machine learning and natural language processing so make make use of those resources but for me like uh for me like i find python as the most conducive to data science and like javascript more conducive to like visualizations uh so uh so that like you know some of the languages that you might consider yeah and um Vincent's asking a question that i i think you and i discussed the first time you came in to talk about this project and he's curious if the data in the abstracts is enough to capture the thematic information looking for or it is full text uh really what you need yeah yeah that that is something like you know i'm still exploring so uh full text versus uh like the corpus but like we're right now we're also exploring dynamic topic models uh which like you know takes the temporal uh sort of the timestamps into account uh we're also looking to build like a priori sort of networks for these like for training this topic models apparently one of our colleagues at the University of Maryland Baltimore County like she she did this and like it's sort of like what she thinks like it gave her better topics so that those are the things we are exploring right now so it's a it's a dynamic process we're learning uh as we go by yeah i guess one question i have um since you were the first you know person i've worked with um that had utilized cross the cross ref api this extensively did uh how was that process did you think it was easy enough to work with so if if you know how to build a web scraper it's easy uh like it's easy in a sense that like once you have the api token you can have it like uh within the web scraper so gray is the expert in this so he built he built those scrapers and and we have like you know we have the papers in our repository uh so yeah if you know how to build those scrapers then like you're in luck do you have any other questions i just mentioned this is gray just since mash mentioned my name um we didn't have any problems with the apis once we got everything approved through the library um we had to be careful about down uh maximum uh downloading under sort of the minimum daily volume and i think you helped us work with that but other than that the process was pretty seamless it's been a while so i don't remember all of the details but once we got things worked out with you at the library um it worked pretty well the the only other thing i had and mash sort sort of said this but i just want to clarify some journals are really easy to do to write web scrapers with and we did that um just because we could figure out how to construct the pdf links and and in retrospect we probably should have worked with you to do that but we did all that before we kind of met you and then once we found some some journals that that were harder to to download um you know kevin and and other people at the library were really helpful to help us get all the resources we need thank you yes um let's see um another question from adrian um could this be used to streamline literature reviews uh what do you mean by streamlining like i'm sorry english is so so what kind of what i'm asking what i'm asking is could this be used to gather a bunch of information you need um as background for to write another paper without really reading the scientific literature uh yeah so okay so what you're talking about is synthesizing like information in a sort of or summarizing the information uh so yeah so this i mean like like this is a precursor to such a tool like if you want to do that like this sort of like acts as a precursor to do that's because like so like you know when you're when you're exploring like topic uh like based on topics time and journals and um like you know you're you're looking for a certain topic and then like you know you have this like visualization space where you can see like okay these papers are together so these authors must also be related to each other and then like you know over time like you know when we develop this tool further uh and have more metadata associated with it maybe like you know we can build some like you know synthesis engine uh to like you know uh synthesize or summarize the information but uh there's still like you know a fair bit to go before we reach that point but yeah it's a precursor to that got you thank you anyone else well one thing i'll mention um since you know grabe brought this up you know several years ago when folks started doing this type of work and and I guess in the early days of web scraping um you know there weren't quite as many copyright protections in place that would um stall your efforts or cause problems for you in that process and in recent years all of the publishers have sort of caught on to the things that were being done uh you know not just for legitimate research but also for nefarious reasons uh you know um gaining access and and and uh making freely available you know content that they want to charge money for um and so that's changed things a little bit and I'll just let you know the university libraries is committed to uh supporting text data mining uh but there is a process that has to occur these days where we get in touch with the the publishers and and sort of work through them to make sure that we can get access but um you know if if this type of work is of interest to you just feel free to reach out to me uh and if i'm not the best person we've got several folks within the library who sort of work with this type of text so um I'll do my best to point you in the right direction and we'll work with vendors if we can to gain access if we don't already have access um through the crossref api we we can access a lot of uh the springer content and Elsevier content which is you know as you know uh you know several thousand uh different journals so uh we've got some some good access there however um you know just reach out if if you want to start a project like this you're not really sure where to start but um if there's no other questions uh I'd like to thank mash for his time and uh a really interesting presentation so thank you mash yeah it was a pleasure thanks to everyone for you know coming in and listening to me uh I really love this project so yeah shoot me an email for with any questions if you have any so again and thanks Kevin for like you know arranging all of this no problem yeah I'm happy to I'm happy to communicate science in any way possible great great and I will send everyone an email follow-up with this with a link in case you want to share it with anyone else that maybe couldn't make it any of your colleagues uh and I can share mash's email address as well if you if you're looking to reach out to him with additional questions thanks so much uh we'll see you next time we've got several uh coming up and um I'll be in touch thank you so much thank you