 Are we good? In a moment, I believe, as of approximately right now, we should be live. Let me I got to come back here and bring everyone into the new session room. The second it should ask me to do that. Am I go? Look, I'm glad. I'm glad everyone got the memo of wearing a checkered shirt and drinking a ton of coffee. That's great. All right, that works for me. So OK, I think we are good. Hello, everyone, and welcome to the second talk of the meeting. Let is my absolute, absolute pleasure to introduce a squad from Arizona State out of the the orbit of Manfred Laubickler's group. So this is Cody O'Toole, Ken Aiello and Michael Simione work they also did with Manfred Laubickler there at ASU. And they'll be talking to us today about the thematic evolution of human genome research. So with that, take it away. Awesome. Thank you, Charles. So yeah, I'm Cody O'Toole with my colleagues here. And then we're going to be talking about the thematic evolution of human genome research. Oh, there we go. So I'm a PhD candidate at Arizona State University. And I'm focusing on biology with a special focus on biology and society. And then the reason for doing this research is I'm looking to go into industry. So I'm looking into the efficacy, the reproducibility, and the scalability of this model that we go ahead and make here. And then also, I'm with Ken Aiello, who is a postdoctoral researcher with ASU-GBCI. And then Michael Simione, who is the director for Data Science Analytics at ASU. And then, of course, our PI is Manfred Laubickler. And then he has many towels, but he's the director of school of complex adaptive systems and the director of GBCI-ASU. So to go ahead and get started, just like a little bit of an abstract, human genome research is an important landmark in biomedical research and has received the benefit of considerable resources and attention due to human genome project and an important role of genomics in modern medicine. So due to that attention and resources given to the human genome research, there are long-standing conjectures about what themes are present at specific times and which genes will emerge in the future. So to do that, we go ahead and take this human genome corpus and we implement a novel dynamic topic model. And then we use domain expertise via two landmark papers as a guide to what themes should appear in this corpus at specific time slices. And then, so overall, our results will show the evolution, duration, and trajectory of the topics and themes in the corpus. So just a little bit of a timeline in history about the human genome project. And then this graph is made with our data. So papers per year, you can go ahead and see it increase over time. Then towards the end, it kind of plateaus and starts to decrease a little bit. But to begin in 1989, the HHS created the NCHGR, which is later in 1997, called what we know now is the NHGRI. And so the kind of the big first paper that we see in 1994 is the Human Genome Project reaches its first mapping goal, which is a comprehensive human genetic linking map. So these genetic linkage maps show the relative order of an approximate spacing between specific DNA patterns positioned on chromosomes. So these linkage maps can be used as one of the first tools that researchers can use to find disease-causing genes, which is really important. And then later in 1996, which is kind of this big thing that happened during the Human Genome Project, which influenced the way that genome research is done today, is the Bermuda Principles, which encouraged the open access release of data to boost the benefit of human genome research towards society. Coming towards the end of the Human Genome Project in 1999, we see that they sequenced the first human chromosome and then seeing this organization of the human chromosome for the first time at this level paves the way for the rest of the Human Genome Project, which is really influential for them to get to the finish. So in 2001, we see two papers from Ventra et al at Solera and then the International Human Genome Sequencing Consortium also in 2001 from the NHGRI. And then this is their draft of the Human Genome Sequence. And then two years later, their final data comes out and then the Human Genome Project is concluded. At that same time, Collins et al published a blueprint for their vision of the future of genomic research, which is our first landmark paper that we'll be talking about next. Then in 2011, Green et al reflect on the previous 10 years of genomic research and plots a course towards an era of genomic mess, which is our second landmark paper that we'll be talking about. So overall, the impacts of the Human Genome Project on human genome research as it moves forward and as we see it today is spurred of technological advancements in sequencing as NIH and Solera race to the finish, which in the end, they ended up combining forces to finish the project. But at the time, they're racing, which caused these big advances in the way that sequencing is done, which enabled them to actually do this in a shorter time period than they initially predicted. Additionally, this is a new era of digital biology emerged from parallel developments in computational power and human genome project sequencing technologies, which is also part of this Bermuda principles, where it also helped them with these open access of data, which is really important moving forward with all this. And then with all these advancements and everything, the power of those were revealed in a series of post-human genome projects. So we have the haplotype mapping project, the 1000 genomes project, the cancer genome atlas, then we also have the human microbiome project. There are others too, but these are some of the big ones that occurred after this. And then so these projects were emblematic of advancing advancement of scaling digitization and sharing that was sparked by the human genome project. Additionally, we see some key shifts in other fields that this was sparked that the human genome project caused. So we can see in medicine, there's publicly available resources that have identified multiple genes associated with disease. We also have in enabling more accurate and objective diagnoses, and then diverse types of cancer tubers are more easily identified. And this all came from the, basically what we learned about the human genome project and then our increase in technology and everything allowed us to learn more about these genes and be able to identify diseases a little bit more easy. So in forensic, we have CODIS and then so basically we can identify people with extremely small sample size. So it could just be like a few skin cells, saliva, hair, blood, semen, things like that as an increase in sensitivity over time. In anthropology, this is one of the really cool ones that I found out about was that through the open access comparative DNA samples, we're able to learn about where modern populations came from. And then so this enabled molecular anthropologists to confirm Africa is the cradle of modern species, homo sapiens, which I thought was very interesting. And then of course the big one is in biology. So this is the essential for the emergence of systems biology, enabling the researchers to have all the parts for a complex biological system. And then this also led to the emergence of proteomics and transform the understanding of evolution. Overall, these genomes provide insight into how diverse organisms from microbes to human are connected and the genealogical tree of life. Clearly demonstrating that all of our species that exists today to send from a single ancestor. So that really transformed our understanding of how evolution was. So to go ahead and go over our first landmark paper, which is really important for the rest of this research is the vision for the future of genomics research. This is by Collins et al in 2003. This basically lays out the blueprint for the future of genomics research. And this blueprint is built on three foundational interactions as you can see from the picture on the right. It goes from genomics to biology, then to health, then to society. So basically what their kind of goals are is the more we learn about biology, then the more we can help and learn more about human health. And then once we have all those facets, we can go ahead and build on that and then expand into how that'll benefit society. So it's kind of this stepping stone of basically starting with this into biology, then we can go to health, then we can go into the larger society. So this blueprint is based on the themes of identifying structural and functional components encoded in the human genome, defining genetic networks and protein pathways as they show how that they contribute to organizational phenotypes. Then they also wanna focus on how the variation can correlate with phenotypic differences in the genomes and then how we can understand the evolutionary variation based on this kind of phenotypic variation across species. Then one of their last points is developing policy options which facilitate the use of genome information. So this calls for interdisciplinary collaboration to reach these goals, but additionally they call it for the need for many parallel developments so genomics research can progress. So basically we have, we need computational powers, storage facilities, more sense of sequencing tech. Those kind of technological advancements is what they need to be able to take these goals because they would need massive amounts of data and more sense of equipment to get to all of this. So this suggests these large scale genomic strategies will be imperative to empower human health advancements and they also put special emphasis on the genetic contribution to diseases and how they can be used to improve drug development and predict drug response. In our second paper charting a course for genomic medicine from base pairs to bedside, this is more focused on the future of genomics medicine. So they're really looking at taking this next step from everything that we've learned in genomics to really applying it to human health and making human health more accessible, more easily diagnosed, things like that. So this is basically a vision for the next 10 years of genomic research and then a little bit past that as well. And then this argues how genome research will advance by further understanding the biology of genomes, genetic basis of disease, the science of medicine and the effectiveness of genomics healthcare. This paper further builds on the 2003 paper that we just mentioned and calls for more understanding of genome biology and sees it as its implications of human health as the next stepping stone. So this brings a much more articulated plane of how improving genomic understanding of disease can improve healthcare in areas of diagnostics, therapeutics and clinical trials. So in diagnostics, they're trying to increase the accuracy and make it a little bit easier just to go ahead and figure out which disease a person may have. In therapeutics, they're trying to make properly effective drug development through pharmacogenomics. So they're basically trying to make it to where they can map the response that a person will have to a drug and then also give a person that a drug is going to be effective and then won't have adverse side effects. Then also in clinical trials, they're focused on the efficacy through genomics-based stratification. So they're kind of improved the validity and make it so it is a more stratification system where they can basically stratify the clinical trials based on their genomic backbone. That way they can actually get a better reading of how the drug does because right now they do it in a more homogeneous approach where they assume that everyone has the same genome. So then it doesn't actually give us the best results for the drugs so that some people have really bad adverse side effects. So this again calls for many parallel developments in computational power to accuracy to assess the big data that is genomics. So this also puts emphasis on the further understanding of the human microbiome to make robust disease treatment. They note this is an important role in the genetic interaction networks. So basically the authors value the emergence of metagenomics which in their words offers unprecedented opportunities for understanding the role of endogenous microbes and the microbial communities in the human health and disease. Since many diseases are influenced by microbial communities that inhabit our bodies. Overall they focus strongly on specific drug development for subsets of human populations based on their genetic similarity and then this can all come from our understanding further understanding of the human microbiome how that impacts our disease understanding. So to go ahead and go over our data. So we went ahead and gathered 7,965 documents from PubMed Central ranging from 1989 to 2018. So over this 30 year period and we use this key search term of human genome project which is gonna give us a look into what human genome research was doing. So overall we have 25 million tokens. So that's just all words of all types going through. And then we have 422,000 unique words. And then so our token type ratio is 0.0165 and then this token type ratio shows that a corpus is how diverse it is lexically. So we can see with such a small value there's more, it's lexically undiverse. And then because the closer to one it is the more diverse this corpus will be. So this is likely due to the solidified nature of the field. So as we saw in the little graph that we made in the later years, there's so many papers and then this is more in a plateau type of growth. So we're seeing that this field is more solidified and then their ideas are more cohesive compared to the early years where there's many new ideas, many new things happening. So this kind of overshadows than the earlier years where there might have been more life school diversity. So our data analysis methods, we use some basic corpus linguistic methods such as word frequency and collocation. Then we use a natural language processing method as T of IDF and then we go ahead and use a machine learning natural language processing combination that is dynamic embedded topic model, which I'll go ahead and get into later. But to go ahead and start, here's our word frequency here. So this is really important step to get a high level understanding of the corpus. It really helps us check whether or not it's clean and let's us see kind of how things change, what's important in the corpus and we can kind of just overall learn more about it. So there's nothing really unexpected here, but this is a helpful way to verifying how clean it is. Like I said, and then seeing which words emerge in the corpus just from raw frequencies as we can see on the right. So we can go ahead and see that and after the human genome project finishes, we can see that cancer really appears into the corpus around the same time the 2003 paper came out. So it's really interesting to see. And then also in the later years, we see that clinical starts to emerge into the top 20 words. And then so this is basically historically validated on the predictions from the 2011 genomic medicine era along with the continued use of genomics and therapeutic research. So we can kind of see that our data is clean and it's following the historical trajectory that we expected. So the TFIDF is the term inverse document frequency. So basically it is we find the term frequency and then the inverse document frequency and then we multiply those together to get this TFIDF and then it gives us the most important words in a document and comparison to a collection of the documents. So in this analysis, we use each time slice as its own document and comparison to the collection of other documents, i.e. the other time slices. So doing it in this way enables us to find the most important words in each time period and comparison to the other 25 years of human genome research. So it really helps us see what's most important in each time slice in comparison to everything else. So we can go ahead and see like in the final years when clinical is starting to appear we can see just how important clinical was in that time period compared to every other year that might have been used in clinical. So we can go ahead and see here on the left we have our TFIDF table and then on the right we have jacquard similarity. So that's just helping us see how similar each year is over each time slices over time. So we can go ahead and basically if the, so we have 1989 to 1993 and then we did the jacquard similarity of 1994 to 1998 compared to the first time slice and then that's how it goes. So that's why we see on the graph that starts in 1994. So we're basically just comparing the T plus one to T. So we can go ahead and see that these are fairly similar across time, but there's a drop in similarity between 2004 to 2008. So directly after the first paper and the finish of the human genome project. So what we really learn about that is that basically we're seeing that this, that the human genome project kind of has this kind of cohesive focus that they're doing that after it finishes it really expands the ideas that are possible. So this causes a dip in similarity. And then we go ahead and see that the similarity increases. So then we can see that it's starting to plateau and the basically the corpus is starting to come back together and becoming more cohesive again. So basically then at the end, we can see that the field stabilizes and finally the found two time slices suggests that the field is solidified like I just mentioned. And we can see that cancer appears as an important word and a rise in TFIDF score. And then we can also see that even though it's not really, it's only highlighted in one spot, we can see that patient rises also in score. Then on the other side, we see map clone health and family disappear out of the top 20 TFIDF scores. And health is replaced with more specific variants such as cancer, patient and clinical. So we can kind of see that this general idea of health is basically disappearing while a more focused version is appearing into the corpus. So this suggests that genomic based medicine is converging to specific applications as technology advances rather than focusing on the overall concept of human health. Next, we went ahead and did call kits. So as you can see by J.R. Firth and the top right, I think this is the best explanation of call kits. You shall know a word by the company it keeps. So call kits provide contextual information about words and the meaning of words not explicitly mentioned in the text. Call kits are often used to show the variation in language use and context. So these words co-occur with this node word repeatedly in combinations throughout the corpus. So the significance of these combinations are measured through association measures. So we use the T score association measure since it's often robust in breaches of normality, in breaches of the normality assumption and then is often scores call kits with a higher frequency of co-occurrence with a larger value as we found in our previous research in the law of nuclear lab. Call kits here are used to validate the historical trajectory of the human genome research and learn how specific words change in meaning over time. So based on our two landmark papers, our word frequency list and our TFIDF words, we were able to select seven words to analyze through call occasions with the T score association measure. So we chose clinical disease, drug, gene, health, microbiome and policy. So our first one, we have clinical and here in clinical, the call kits, we see a high degree of similarity over time. But with the highlighted cells, we can see that we can verify the suggestions from the 2011 paper where the genomics management error was focusing on proving clinical trials, the validity as well as the field of pharmacogenetics being an important driver in this move. So this is really, we can go ahead and see that these predictions in the 2011 paper are happening at relatively the same time. And then we can go ahead and see that this, that clinical is relatively increasing over time and similarity even though there's a massive increase in frequency of the word. So disease, this is a really interesting one to analyze as a similarity of disease call kits increased over time but we can clearly see that there's shifts in specific diseases that were of most importance to study in human genome research. So we see Alzheimer's keeps an importance over time but honey tins that starts in the first column really disappears afterwards. And then we can see a wider range of diseases appearing ranging the entire body, indicating parallel developments in technology and a deeper understanding of genome biology and genetic networks has given research just the opportunity to tackle a vastly wider range of disease related problems. Next we have drug which is a really important part of both papers. So we can see that the development of call kits increases as the human genome project finishes and the 2003 papers introduced. So even though that there is a substantial increase in the raw frequency of the word drug the top 20 call kits stay similar over time. And then at the beginning of this we can see the 1989, 1993 time slice that HIV is an important topic for drug development then later in the last time slice we see that the anti-cancer drugs become a much more important focus which is historically verified by the substantial increase in cancer research. And then an interesting finding is the appearance of pharmacogenomics and genetics in the 1999 to 2003 time slice where the 2003 paper put some emphasis on further understanding the field and the 2011 paper put a stronger impotence on this importance and then the future and the importance in the future of genomic medicine to improve human health. So it's really interesting that it appeared much earlier than the 2011 paper indicated. Next we have gene. Gene was the one that I was really most excited to analyze because gene is basically a really variable concept is important to see its influences in the air of human genome research. And then what I've seen in my previous research is that there's basically a there's multiple gene concepts over time. And then so to see that the similarity is basically the same is really interesting to see this in this air of human genome research. So we see that in these highlighted words that there's this emergence of need to understand the heritable variation in gene to cross populations. And that becomes an important aspect of the research and the timeline that the 2003 paper suggested and the 2011 paper emphasize. So we're kind of going, we're seeing here that the predictions that those landmark papers are really coming to fruition just from our call kids. Next we did health. Health stabilizes in similarity over time as the field solidified. And then we can go ahead and see that disease and disparity emerges after the finish of the human genome project and the 2003 paper. So this is historically verified through our landmark paper where they called for the need for this research to be used in a way to reduce health disparities across the globe. So this one was also really interesting because the microbiome didn't actually appear in the corpus until 2007. So that's why we go ahead and have no call kids for any time before that. And then this is just before the official start of the human microbiome project. Though during this time the NIH was funding the research of microbiome heavily. And then we see that the increase in similarity and its relation to health starts to appear in the top call kids. But by the time that our analysis time ended at 2018 there was not as much as we'd like to see in the microbiome. So I'm assuming that it probably continues to appear more heavily into 2021. Then the last one policy which both papers emphasize at the end that we needed policy on basically open access to the data policy on drug development and technology. So the policy increases in similarity over time. And after we see the 2003 paper that the implications of drug development technological advancements and open policy through open data release because it's an important part of the conversation. So we can really see that start to change here. Summary and next steps of everything that we've done so far. So human genome research has advanced through the completion of the human genome project and parallel developments in multiple fields leading to an era of genomic medicine aimed to improve human health. The analysis to this point has verified how clean our corpus is. And then it has also been verified by the historical trajectory of human genome research and shown that the predictions of the two landmark papers came to fruition. Next week we use the novel dynamic topic model to track the thematic evolution of the human genome corpus. This analysis will be validated by the word frequency, TFIDF and collocation results along with the two landmark papers as guides. This will enable us to also see how influential each paper is in the field. So real quick, topic models enable the characterization of themes found in the corpus by highlighting words that co-occur into clustered ideas or phrases. The topic model shows themes, language context and high level knowledge found in the corpus. And then this can help us find distinctive patterns of words that appear together. So we can go ahead and see in this picture that it's basically finding these groups of kind of words that are following these same ideas based on its other co-occurring words and then puts them into similar topics. So the most common use is the LDA is an unsupervised machine learning applications that uses each document as a mixture of topics. And then it predicts the proportion of topics for any document using Markov chain, Gibbs sampling and Bayesian nonparametric. I can go and explain those later if anyone has any questions on that. And then a downfall of this models that suffers from inaccurate topics when using a large corpus, I mean a large diverse vocabulary. And then to combat this, researchers can often prune their data sets but that can limit the scope and miss hidden themes which is basically the whole reason that we're doing this research to find the themes so we don't want to do that. So there's this other method called the embedded topic model and then this basically marries the traditional LDA with word embeddings. And then so word embeddings in low level dimensional spaces are able to keep semantic properties of words thereby increasing the coherence of topics and enabling the use of larger vocabularies. So the ETM assigns topics based on the words location in the embedding space. Then it uses a long linear model that takes the inner product of the word embeddings matrix and the topic embedding. So with this form, the ETM assigns a high probability to a word in a topic by measuring the agreement between the words embedding and the topics embedding. But this still has many of the same problems as the LDA because these both are unable to model over time. And then so they're static fuzzy and give different results each time due to the probability signature of the model. And then also it doesn't help understand how meaning changes over time because they're static. So to analyze the evolution of topics we need to have a dynamic model. So the first dynamic topic model by Bly and Lafferty in 2006 is an extension of the LDA. It basically is the same model but uses a probabilistic time series to enable models to the topics to vary over time. And this is done by basically chain the parameters of each topic in a state space model. So as you can see from their picture on the right from their paper that is running through like it would normally but then it's chaining the parameters together that way it's able to model it over time. So it's actually changing each probability based on the time slice as well. And then this model since it's an LDA will suffer from the original LDA model. So we need the dynamic embedded topic model. This is made by Dying and Ruiz and Bly. So the DDTM uses a varying vector on the word embedding space to model each topic over time resulting in more accurate topics, topic coherence and topic diversity where semantically similar words are assigned to similar topics since the representations are close in dimensional space. So we split the documents into sentences and treat each sentence as its own document which makes the model a little bit easier to work with. And then it also makes it so there's not words from other sentences making it into our skip-gram model. So after using the skip-gram this embeds them into dimensional space we were able to run the DDTM and with 10 epochs a learning rate of 0.001 and a batch size of 200 normally we would do at least 100 epochs but we ran this on my laptop which is only eight gigabytes of RAM and we have a massive dataset. So it made a little bit hard to do more and then this model basically ran for about five days on my computer. So we weren't able to run it any longer at this point in time. But we can see as I mentioned with the corpus size so the documents that were being used were 718,798 so that was basically all sentences and we trained on 611,000 basically and then we validated each epoch with the 35,941 so basically that just goes through and then we're able to validate and see if the perplexity over time is changing and if it's lowering then it stays but if it gets high again then we reduce the learning rate. Then we test on 71,879 and then our vocab size was about 14,000. So what we have on the right here is basically the datasets from the Jiang et al 2019 paper that invented this model and then also from then we see that the second table is the perplexity ratings and the third table is the topic coherence, diversity and quality which is the quantitative validation metrics for the UN dataset which is the most comparable in size to our HD corpus. So we can see that our perplexity which is the predictive performance at which lower is better is 2065 and then so if we see that on the right that is better than almost all them except for the DETM for the UN dataset. So we believe that if we were to run this for more epochs this would be relatively the same size as the UN perplexity. Then topic diversity is relatively similar for the, it's actually a little bit better for the, oh wait, sorry. Yeah, topic diversity is the one in the middle it is way worse than their other ones and then this is because the topic diversity is a percentage of unique words and the top 25 words of each topic and we found that 82% of our topics were 95% the same top 20 words which this would account for a low topic diversity but then we see an extremely high topic coherence where theirs is only about 0.1 and ours is 0.78 and then this topic coherence is a quantitative measure of the interpretability of a topic. So since so many of our topics were essentially the same that would account for a high topic coherence which then gave us a topic quality of 0.13 which is on par with the other ones. It's a little bit higher than these ones at this point. So then we can go ahead and do our modeling and we can go ahead and see what the influence these papers are and other things. And then so we can go ahead and see that there's an exponential increase in the probability for drug and network which occurs one year after the introduction of the 2003 landmark paper. Then we see on the left that we have these citations and we see that they're the largest amount of citations that this paper ever got in one time and one year was in 2004. This is extracted from Scopus. So this is coupled with a large amount of citations for that paper in the same year. This suggests that the paper is one of the stronger drivers of this change alongside other parallel developments. So the drug development and genetic networks are some of the main themes of the 2003 paper alongside technological advancements and needing a deeper genetic variant and evolutionary understanding. So this one is the DTM for the cancer topic that we found and then we can see cancer related terminology increases over time while protein decreases. And then this is historically verified through our TFIDF words and word frequency analysis. So this is also validating factor for the quality of use of the DETM. As we can see that the, we expected cancer really terminologies to increase over time and we saw that protein also decreased over time. So this really validates that our DETM was on par with our other analyses. Then our next one was the microbiome. So gotten microbiota increase exponentially two years after the 2011 paper. So it's unlikely that the paper is a stronger strong influence, but rather the influence of the human microbiome project of the proposed end date is really what started this. So just after the HMP finishes, the probability those words dramatically increased which is something really interesting which we might go ahead and analyze at a later time. But basically the human microbiome project ran from 2008 to 2016 and they published over 650 papers. And I think they had over 75,000 citations so far. And then the purpose of this was the comprehensive characterization of human microbiota and their role in human health and disease. So this is a really interesting and then we can see that this is probably one of the bigger drivers for this topic in comparison to 2011 paper which is probably a smaller driver. Then we see that this, while the 2011 paper is strongly focused on genomic medicine, we see that the idea of genomic medicine is increasing throughout the entire period of the corpus and that one paper would not have the strongest influence in a well-developed field at this time. And then these are also qualitatively verified through the frequency and TFIDF. So we can really see that this 2011 paper while many of its predictions did come to fruition, it was not kind of a main driver in comparison to this 2003 paper where we see an exponential increase in those words after this paper was released. So to conclude, using a combination of digital and traditional methods through domain expertise, we're able to assess the thematic evolution of human genome research. Our digital analysis includes the triangulation for multiple methods such as corpus linguistics, NLP and machine learning. And then we see that these two papers we presented as domain expertise were one of the main contributing factors in a highly interdisciplinary field of research. So we found that the 2003 paper would have more influence of the trajectory of human genome research. And then we found that this paper was very much supported by all the results. And then while the 2011 paper predicted many of the themes correctly, they were a smaller driver of influence compared to other parallel developments such as the human microbiome project. So these basically the domain at the 2011 paper had a certain momentum and inertia at this point. It was really growing, it was starting to solidify. And then part of this inertia are common research questions, methods, key concepts and debates which stabilized and grew over time. This would prevent the 2011 paper to deter or change the field. So as the field and corpus grew, it'd be harder for a single paper or even factor to exert a similar amount of influence compared to the 2003 paper because a bigger corpus equals less influence for that single paper. So since it wasn't solidified at the time of the 2003 paper, this gave it more room and chance to influence the field, which we saw that it most likely did. So overall, the combination of digital and traditional methods enables us to more effectively model and characterize the evolution, duration, trajectory of themes and entire scientific field compared to using only one of those methodologies. So using the combination methods, we were also able to verify and measure the influence of domain experts before and after the field of human genome research solidified. I'd like to go ahead and acknowledge GBCI, the School of Complex Adaptive Systems and the Santa Fe Institute. And I'd like to thank my colleagues for all their help and everything and helping me put this together. And then I'd also like to thank the PennSLAB and DS2 for having us. Fantastic, thank you so much. This is what a great talk. This is thinking about combining the citation network type analysis with the text analysis as something we've been working, we've been thinking about a lot and that's a really, really cool way to do it. So let me go to some questions here from the audience. Here from Stefan Lindquist, that's a very impressive work. I'm interested in what's the null expectation for that similarity measure? So do you expect that to generally go up or to decrease as a project matures? What would you expect to see? Yeah, so I'm assuming the similarity measure was the Jacquard similarity measure, but basically we don't really, there's no like no what we expected to do because it is basically just going, it is just the intersection divided by the union. So we're kind of just expecting to see some sort of change or maybe stabilization, but the way that this is really verified, whether this is accurate or not, is through these traditional methods where we get a deeper understanding of the corpus before we do these analysis methods. So with health, we really did expect it to increase over time because that was something that we know was an important part of this human genome research. And it was also something that we knew as the field grows, it's gonna become more cohesive and solidified. And these ideas are gonna really start to come together and people are gonna start, it would be like this citation network, it would start to grow and grow and grow as time builds and as the field gets bigger. So we expected something to be similar to that. So that was kind of our null expectation for a lot of those. The one that I thought went against kind of my null expectation is I thought that the gene Jacquard similarity for call kits would be very variable and changing over time. But due to like these very, due to all these different, the classical gene concept, the molecular gene concept, all those different concepts, I expected the human genome project to take some of those things on and it would be highly variable. But we saw it's very static over time, which I thought was very interesting. Cool, thanks. Next question coming from, Stefan Hesperken who asks, have you validated some of these results against existing other kinds of keywords or taxonomy? So I think you mentioned some of this at the end about bringing in other kinds of domain knowledge. Can you say a little bit more about how that's load? Yeah, we didn't really find anything out of the ordinary there because this is such an interdisciplinary field, we expected things to be highly variable. But the thing is that we really found a lot of words. We honestly found more words about medicine than we did find words about genes towards the end, which I found was really interesting just because you would not expect it since it is human genome research, you would expect there to be a lot of talk about different components of genes, chromosomes, things like that, alleles, loci, things like that. But we found a lot of health, cancer, disease-related terminology, which I found very interesting. Cool, thanks. Next question coming in from John Redisky who asks, who says, great talk. Do you have any information regarding word frequency within the different sections of a paper? So introduction methods results discussion, is that a thing that your corpus has access to or not? That's not necessarily something that we have access to right now because the way that we do it is we take it from PMC, so we have the full PDF versions, and then we just convert those to text using Python. And then so what that does, it basically has it in certain ways where it is just really these paragraphs are separated by these new line parts in Python. So we don't really have a easy way to do that, but there are really good methods coming out to where you can just go straight from the PDF to into Python now, where you can just read it really well like that. So I think that there are ways that we could kind of use like a rejects way to basically split those into certain parts because I think that that is a really good point in the future of this kind of research using corpus linguistics. It might be really interesting to see how the introduction word frequency differs from the methods and differs from the conclusion. Cool, thanks. Question from Chris from Maltaire who asks, with the DETM method, so do you specify a total number of topics or do you focus on specific words for which you get embedding terms, which would explain why your topics were aligned with your focus terms? Yeah, so basically the way that their method works is they do their kind of whole processing, separating it into sentences, then they use a skip grand model to embed it. So then they just really save some time using that. Then the DETM also does a topic embeddings where you set the number of topics and you really do have a long list of things that you can set. So you can set like the kind of the alpha values, you can set which activation functions, your batch size, your learning rate, all that kind of stuff. It really does give you complete freedom to do that kind of thing, but you really do set the topic size. So we had 50 topics and then I thought because we were only able to run it for 10 epochs because it's such a massive model, it's like the LSTM that has linear transformation. So it's just like a massive model and the LSTM already runs slow. So coupling that with other things, plus it trying to learn like multiple forms of embeddings and trying to give those topics probabilities and then over a long time period that makes it really hard for the model to move quickly. And then I only have eight gigabytes of RAM on this computer. So we were only able to get 10 epochs done in time for the conference, which we're a little bit sad about, but we did get four really good topics. They're really diverse and really show these kind of interesting things. So it just shows that how effective this model can be even with limited training time. Great. Last question is for me. I don't see if any more come in here. I can, I chair inserted myself at the back of the line here. So similar kind of nerdy technical questions. So I've done some co-location analysis before and I just have to say those are exceptionally clean co-location results. So I wanted to ask you about, okay, if you're going from PDF to text, what kinds of, how much really nasty cleanup did you have to do? That feels like a very, very clean data set. Yeah. So it is really clean, but the nice part is that, so me and my colleague Ken, I do my dissertation research on gene drive, which is biotechnology and biomedicine. He does his research on microbiome. So he's a lot of knowledge, but basically we have these two really good stop lists, which, you know, they're like a few thousand in length, but we've gone through and we've used the wordsmith and we just gone through like the first 10%, you know, like a few million words. And we're just able to get rid of all those broken words and everything. And then we find that if we're using the same code, it usually breaks the words in similar ways. So that was one of the reasons. And then also when I was going through, I basically went through and I was using this kind of like rejects thing that I made where I had all this, since I had the list of all the author names, I was able to go through and I was able to basically say, hey, if there's like five or six authors in this string right here, just get rid of it. And then also I found that if like most sentences and scientific papers aren't gonna be less than 60 characters, so I removed everything that was less than 60 characters. So we're kind of removing names, things that weren't that helpful, which can also impact how clean that the collocation analysis was. Very nice. Yeah, that's that these, speaking of Dr. Burner mentioning a tacit knowledge during her keynote, these are all these little pieces of this tacit knowledge that we all have to figure out how to manipulate. Great. So I don't see any further questions. So a fantastic talk. Thank you so much. And oh, wait, nevermind. Wait, wait, we didn't get one. Cool. Christoph Mullington asks whether or not you tried comparing your model against just a good old fashioned vanilla LDA topic model with 50 topics or something, just to see what that would result in. Yeah, so while that would be really interesting results, basically using the vanilla LDA, we wouldn't be able to model over time. And then also just we found that using the embedded top model compared to the LDA is a little bit more efficient because you're using this kind of inner product with two embeddings. And then one embedding captures the semantic properties really well and then the other embedding is find these co-occurrence of words. So those meld together really well. And then so what I found is that I find that these kind of dynamic embedded top models while they're very long and can take a long time even on small datasets, they are really accurate because that they're able to really hold onto these semantic properties so well. So the topics just overall doing so much better. Here, I'll go back to the quantity of validation. So we can kind of see that the way that Dan Ruiz and Bly did this is that in these methods over here is that they're basically comparing them in a way that they're comparing it to their old LDA than a little bit improved one and then the DETM. And almost always the DETM does better on the bigger datasets because there's more diverse categories. And then we can go ahead and see that the sometimes the topic coherence is a little bit better for the DLDA because it still is really good. But then we can see that the topic diversity and overall topic quality is usually better for the DETM on these large datasets. So we just found that this is something that would work better for this kind of larger dataset as well as something that we were expecting to be really diverse in the beginning. And then so we want to make sure that we captured those kind of ideas that were happening in the beginning and then see how they translate later into the corpus. Very cool. All right. Thanks so much. Yeah, that's about our time as well. So very much.