 Hello, everyone. My name is Christopher Donahue, and I am a historian at the National Human Genome Research Institute, one of the institutes and centers at the National Institutes of Health. In 2012, along with Chris Flatterstrand and Eric Green, NHGI director, I co-founded the NHGI History of Genomics program. The history program was and remains the only scholarly and public communications effort focused on the history of all aspects of genomics and the human genome project. The history program under the Office of Communications also has also over the last year made a particular effort to use the history of genetics, genomics, and molecular biology to confront the history and present day realities of eugenics, scientific racism, ableism, and heteronormativity. Such efforts to address complex and difficult histories, both within archives and beyond them, within genomics and beyond that revolutionary inquiry, need increasingly sophisticated and responsive tools and methods development. Today you will hear from Luis Amaral and Spencer Hong especially on how machine learning tools can organize archival materials in an extremely efficient way, reducing the need for honor and physical review of materials. And as importantly, how machine learning tools can help us to see extraordinary connections and develop extraordinary stories. These archival materials I want to underscore are essential. The total archive as it stands now, with hundreds of thousands of all manner of PDFs, Excel files, PowerPoints, and Word documents, represent the most complete account of the history of the human genome project and ongoing genomics efforts in the United States. And as such, it is an invaluable resource. Nevertheless, there is only so much material that even 100 individuals can comprehend by simply reading these materials. And this is where the Spencer's work and the work of the Amaral lab, based on a much smaller subset of the archive, is so brilliant at unlocking the hidden patterns and stories in these files. And because it makes connections, which would be either difficult or impossible to detect, or years of work to understand visible in just a few short weeks of work or less. The order or orders of magnitude of change and efficiency and in clarity in representing the history and the science itself is really a testament to Spencer's work over the last year. I'm unbelievably excited for new insights and adventures in this data, and I cannot wait to see what Spencer brings to us in this lecture, and certainly quite a few more lectures. In closing, I would like to thank Spencer, Luis and the Amaral lab, who all patiently addressed all of my questions and who have made me feel welcome to share my expertise with theirs. I would also like to thank Thomas Stoiger of the Amaral lab, who served as a key liaison, and with whom I look forward to collaborating on even more projects. I would also like to acknowledge our moderator, Sarah base was not only brilliantly leading the office in these very difficult times, but who is also a constant source of excellent advice and leadership as the history program reaches its 10th anniversary and moves to fundamentally how the history of the human genome project and communications around that really deep rich history are understood. Lastly, I would like to thank our office of communications, our it be an AV staff without whom this event would not be possible. And, and now for really fascinating general overview of the Amaral lab is many fascinating projects and some introductory context for Spencer's brilliant talk over to you, Luis. Thank you. So, hopefully, everyone can see the slides. Chris has asked me to give you a little bit of an overview of how my lab came to address this issue, and my lab is a little bit strange in the sense that we have a very broad set of interests and directions. And so this particular direction was prompted by the question of how can we get better at producing knowledge at doing science. And like many things in my lab, it's kind of a windy path, you know, there are papers that go through different aspects and I want to give you a flavor of all those different aspects and even how by looking at the title would seem very, very different things how actually there is a trap connecting all of it. If I'm asking the question how can we produce better science to hypothesis that have been posed is that collaboration is a way in which you can produce better science by bringing different insights, different backgrounds, different techniques, you can tackle more complex problems or old problems in original ways. Another way in which we can improve science is by improving the process of mentoring, how we train the next generation of scientists, what tools and insights and approaches are we giving them that will enable them to tackle new problems. So, the first study that we did was actually to try to figure out about the role of collaboration, and we actually looked at data across three disciplines and you can see their names up top of the graphs social psychology economics and ecology. And we asked, do teams that publish in journals with different kind of impact of different characteristics, and essentially across all three disciplines we found trends and those trends are illustrated by the arrows. We found out that teams producing higher impact work at more experienced team members, not surprising, but we actually what was surprising is that we found that those teams also have a decreasing probability of repeating collaborations. So, there was more new blood brought into the teams that were more productive, and also this create a greater deal of connection within the community of scientists. So this was kind of something very exciting to us. But one question in there was, do we ever write measure of impact or not an impact many times in science is measured with regard to citations as scientists in a paper citing another scientist in a paper. And so, we tried to find out is this a good measure or not because it's very hard to read millions of papers and decide which ones are better or not. And so we actually went and study movies, and the ratings of movies, and actually in that context we could find that filmmakers directors make references to prior movies in their own movies and so there is sort of a citation network in there. And we actually could prove that that citation network contained very good information that match what experts were able to do. So, increasing our confidence that citations even though not perfect is a good measure for quantifying impact. So that's related to collaboration but what about mentorship. Well, keeping track of who trained whom is hard fortunately there is something called the math genealogy project, and it keeps track of who was trained by a given mathematician and we could look at things like how many other mathematicians did the particular one train. And we have large databases from these and because in this particular case we didn't have access to publications of citations for all the papers. What we actually looked was at the number of of protégés of people that someone trained and we saw that there is a correlation in this group in these two quantities by looking comparing high performing mathematicians that were elected to the National Academy versus the rest of the populations. And so here we found something really really interesting and this is a little bit complex. So I'm going to break it into pieces. So we looked at the mathematician and we broke their career into three thirds, early stage, middle stage and final stage. And then we looked at mathematicians also by how many people did they train. So some people trained the small numbers and people trained the medium number something people trained a large number and those would be the very successful. And then we asked, what was the impact. What was the outcome of people based on when they were trained and by whom. And so, very interestingly, people that train the small number of others. They had the very positive always positive impact on their trainees. These 37% there. The average people had an average impact, right. The really successful people were the really interesting in the first third of their career they had a very positive effect on their trainees, but in the bottom in the final third. They did not have. And so this is actually counterintuitive. And it made us wonder what is going on and I will not have time to discuss that because I should get to other things. So we talked about collaboration about mentorship, but we are not truly not able to see the collaboration in place so how could we find out something about collaboration. And that's what soccer comes in. I bet you are not expecting that. So, when you watch a soccer game you actually can see the collaboration and interaction between players it's being recorded. And actually from it, you can create graphs of the interaction between the players and those interactions tell you if people are doing a good job that has good outcomes or not. We actually took this idea to papers to the realization of papers in which authors, the circles here represent authors are interacting, which is signified by those arrows that connect them and connect not only to one another but also to outcomes like completion of a task or arrangement of a meeting. And we can see that we can build these graphs about scientific collaborations that they could have different structures, and we can get an idea about the role of people in them. So, interaction communication. What can we learn about communication between scientists. So, understanding how people communicating is something that fascinates historians of sciences and then people in general so there is an entire project to keep track of the letters the correspondence of Charles Darwin that has been digitized. There is an entire project about their correspondence and you can look at things over time like how many letters were sent by Charles Darwin and you can see that as he became more famous. He wrote more and more letters, but you can ask are there patterns to this correspondence. And we actually got the correspondence information from many different people from many different areas over different periods of time. And I don't have time to go into the details but I want you to look to look at the bottom right diagram. And you see a bunch of curves that are all separate falling on top of one another. And this is an indication that there are universal mechanisms to it, that people communicating the same way with the same frequency, even though we send many more emails than we used to write letters, the pattern are the same just the time scale has changed. There are all sorts of knowledge that can distract for these things. And that is really what has excited us about trying to automate these to enable more people to learn about these processes. And so we have thought about these born physical studied digitally infrastructure and framework in which people would provide to the system, digital documents that then would be processed for analysis and would be shared and increase our advance. And the work that Chris has talked about and that Spencer is going to give you more details falls within these creating an infrastructure that will enable more people to learn about all of these things. So I pass it on to Spencer now is going to give you some more details about things. Thanks, Luis. So thank you to Chris, Sarah and the rest of the history genomics team for at the industry for inviting us to speak today. I'm really really excited to share with you what the armor lab has been able to do with such a breathtaking collection that makes up this part of the National archive. In the next 10 minutes, you'll hear about what makes up this core collection, and how we've overcome some of the challenges computationally, and then in the subsequent 15 minutes will be dedicated to the theme of communication. We have chosen this topic as an example for today's presentation, because even though it's such an important aspect of organizations, it would be also difficult to study at scale without these computational approaches that you will hear about today. To see through the example of communication, several aspects of our work that could be also helpful for pursuing other questions questions that might might particularly trigger your interest. So without further ado, this is the history of genomics with toll through machine learning. The first part of the talk today is about unlocking new data. It represents our efforts to extract uncover information that prior to these computational models, having very time consuming to draw out will outline some of the computational efforts powered by machine learning that will save a lot of time compared to a traditional close reading approach. The next section of the talk will be dedicated to pursuing new directions. Our shovels have dug up a powerful fuel by extracting new data. So now with this fuel where will this now power us to go will showcase some of the many questions that are now within our reach. The entire archive sits around 2 million pages of material, making it the largest digital archive in contemporary biomedical sciences focusing on genomics. It contains material spanning from scan PDFs to PowerPoint files from many key leadership at the NSRI and on the NIH campus. The materials of the archive narrate the foundational role of NIH in the US effort to sequence the human genome from the late 1980s to the early 2000s and the subsequent programs that rose that on the side of this enormous archive sits a highly curated key collection of about 3000 files curated manually by the history of genomics team here. It has searchable metadata and encompasses some of the most essential and relevant documents in the history of genomics. External researchers can apply for access to this core collection by emailing the history genomics team directly. We start with this core collection because it can validate a lot of the models that we build the curated documents ensure that we're not just developing blindly. The core collection is what's powering all the models and analyses that you'll hear about today. So what's inside this core collection. There are correspondence among leadership researchers and external parties. On the left is the email that arrived at the desk of Francis Collins, the former director of NSRI from 1993 to 2009. And if the former director of NIH from 2009 to 2021. On top of the email are handwritten notes by Dr Collins himself, which will come back to later in the presentation. It represents a very unique metadata and therefore poses a unique computational challenge. On the right is a letter from the University of Connecticut writing in support of sequencing the genome of a model marsupial species. We will also come back to the theme of communicating in support of sequencing projects. But for now, these examples represent some of the biggest categories in documents that are contained inside this core collection. Not only does the archive contain correspondence, but also contains products or deliverables from key projects and individuals. On the left is a proposal prepared for the SMP consortium, which later evolved into the international hat map project. On the right is one of a series of documents that attempted to estimate how much more effort time and resources would have been needed to develop the first iteration iteration of the hat map. All representing the products, the working products that scientists at the NIH and the NICRI have created. The core collection has so much more from forms that scientists submitted for projects to newspaper articles that describe these efforts from an external point of view. And to PowerPoint slides, just like the presentation that you're seeing today. So how many documents exist in this core collection. The first thing a computational researcher should do is to understand the data. If we understand the core collection, it might help us to anticipate challenges that are that we might encounter in the future, especially when dealing with the broader larger archive. One way to track the core collection is to count the number of files. We have about 3000 files with more than half coming from Microsoft Office files, which can be Word files, PowerPoint files, and Excel files. It's really exciting that we have access to these electronic files themselves, and not just the PDF prints or the exported versions of them, which I'm sure you're all familiar with if you worked with Microsoft Office, because these file types themselves have unique metadata, such as revision history, and actionable metadata of who wrote what when. This is all very interesting to us because it exposes such questions such as who gets the final edits on a collaborative document. We notice that not every file is unique. There are duplicates in the core collection. Our computational tools can detect these duplicates, which saves a lot of time because all the subsequent steps that follow apply for each file. And it would be a waste of computational and human resources to apply our models for a file that is a duplicate of another. After merging duplicate files, we find that the core collection to be about 3084 files. But you might be asking yourself right about now, you know, this core collection is supposedly a curation of some of the most important documents relevant to the human genome project and other programs that rose on. So does this seem enough. And your hundred be correct. We found that actually 3000 something number of files results in more than 18,000 documents and how is that during the scanning process, multiple physical documents can get scanned in the same stream, which means that one PDF file can contain multiple documents. Using our custom software to label and split files, we can bring the total count of documents to be more than 18,000 documents with about 17,000 of them arising from PDF files alone. We now know that the core collection has this many documents, but now what kinds of documents would they be. Let's start giving some context to these documents. We focus on correspondence and the communication that it contains first because communication is essential to exploring why certain decisions were made, what was upstream of those conversations, and who was talking about what in a large organizational context. What you're about to see on the slide is a confusion matrix, which represents the accuracy of our deep learning model that can categorize documents into different classes. What you're about to see in the next few slides will be a four by four matrix with each square representing the accuracy of our model predicting the correct category of the document. The four main classes we outlined here are emails, letters, reports, and others. What separates an email from a letter is that letters are standalone documents on a letterhead, addressed specifically to one person or organization, while emails are often a chain or thread of conversations involving multiple people and sent through a dedicated email program. Here we show how well our model predicts emails. In this case, we know that 99% of the emails that we manually labeled to verify our accuracy were actually labeled correctly. The way to read this matrix will be to follow the diagonal line. In a perfect scenario, we would see 100% through the diagonal of the matrix, because every document in each category would have been predicted correctly. We use a document image transformer, a powerful pre trained model to predict the category for each document. Our deep learning model also finds most letters to be categorized correctly. We've also revealed two other squares in beige, which represent how likely our model is to falsely classify a letter from the email and vice versa. In this case, we see that only 1% of the time our emails predicted as letters, and only 2% of the time are our letters predicted as emails. Despite the fact that both are types of correspondence, our model is powerful enough to minimize the false classification between the two. Our model is able to discriminate correspondence categories well from the rest of the collection. This isn't necessarily the case for reports and others. We still have some predictive power in classifying reports, but we already know that we can improve the model by creating more fine grain categories for future training. For example, we could ask ourselves, should we really be binning proposals in the same category as reports? What about reports that have a cover page? Taking these into account will only improve our model further down the line. As we mentioned before, the archive has a lot of scanned documents that contain handwriting. Normally handwriting would be ignored and even add more uncertainty for printed text recognition engines. But we think that handwriting is a novel data that deserve to be extracted on its own. If we can lift the handwriting off from the printed text, we can isolate it to be studied further. Handwriting texts can also pose a privacy concern for archival documents. For example, handwritten cell phone numbers can be included in archival documents. With the help of Professor Clevers and Chetan, who was with the Armor Lab during his sabbatical, we develop a novel handwriting extraction model to mitigate these concerns. On the left is an email to Dr. Collins that I have shown earlier that you had seen. There's some handwriting here that I've mentioned. Using segmentation models, we train a deep learning model that can isolate handwriting texts. The thing is that it's very, very difficult to hand label images with handwriting, labeling which regions might contain handwriting and which regions do not. So we created a custom synthetic data set that artificially places handwriting texts on top of clean printed text and then train the deep learning network to recognize which regions are handwriting and which regions are printed. The email on the left has now changed, but this time without the handwritten text, the lifted handwritten text is then isolated to your right. By isolating both the printed and handwritten texts separately, we can further process them in different pipelines. The isolated printed text can be given to a text recognition engine, which will perform better without the additional noise. And the isolated handwriting can be sent to another deep learning model, which can recognize handwriting texts on its own. The original image with both the text can be preserved for full access and record keeping. We're going to show more examples of our handwriting isolation results with varying degrees of performance. On top are the isolated printed and handwritten texts from a draft agenda in an international hat map project in 2002. On the bottom is a handwritten memo with no printed text. In both examples, we see that the model isolates handwritten texts, but fails to extract doodle type of text, especially crossed out words, arrows and circles. We anticipate that our models can improve by artificially adding in arrows and circles to the synthetic train set that I mentioned earlier, and including more documents outside the core collection. In this first half of the presentation, I hope we excited you with some of the tools that the armor lab has built to uncover, unlock and isolate unique data and metadata present in the core collection. These tools can maximize archival efficiency, save time and labor at scale, and add more context to the history of genomics that is present in the archive. Now with our models for powering us, where can we go from here? We've given context to these documents, identify their categories and lifted handwritten text. What kinds of questions can we ask with this new data? We had focused on correspondence earlier because of the rich metadata present in these documents. One application is to recreate the communication network that existed in the NSRI. We first focused on emails sent from 1996 to 2008, which is a timeframe of the data that was given to us as part of the core collection. This is a network consisting of small circles or nodes as we call them, and arrows or edges that connect the two nodes together. Each node is a person that either received or sent an email. Arrows are going to point from the person who sent the email to the person who received the email. The circle size and the arrow color correspond to the number of emails that connect the two individuals. Using our detection models, we can query emails by who they were sent to, when they were sent, and who they were sent from, allowing us to create this network. Networks are powerful computational tools that allow us to visualize groups of nodes, as well as patterns that might arise. If you remember the two main themes of this presentation, a network like such is a window into pursuing new directions of research. One such direction would be to ask who are the main individuals involved in sending communication to the bulk of the emails in the network. We might also be interested in how different organizations are represented here. If we investigate the domain addresses of email such as .gov.org or .edu, can we get a sense of how government agencies, organizations, and universities interact with one another, especially during the course and the timeframe of the Human Genome Project. We can also resolve the network temporally. Because we can detect dates in email threads that are being sent, we can grow the network by the year that the email was sent. Here we show an animation of a communication network that's growing year by year. We can investigate the contents of the emails and explore how different topics might arise and whether they correspond to programs that came into existence and in which groups or clusters have those initial nascent conversations emerged from. Another example of a communication network that we're able to build are requesters of sequencing projects. Here we show a bipartite network. A bipartite network is one that contains two groups or partitions. Each node is assigned to one of those partitions. In this case, the two partitions are the requesters who are individuals associated with having proposed a certain sequencing project. And the other partition is the sequencing project that individuals have requested. Again, this bipartite network is only possible because our computational models can recognize printed text and identify entities or names in these printed text and documents. We can further contextualize the network by coloring in the working groups that the sequencing projects emerged from. Working groups are committees or panels that often brainstorm, recommend, and propose new candidates for sequencing. Here we highlight three working groups, annotating the human genome working group, comparative genome evolution working group, and the genome resources and sequencing priorities panel working group. Just from this network visualization, it's evident that some sequencing projects have common requesters. Some sequencing projects even seem to be highly intertwined with this fixed set of requesters, while some requesters are only ever associated with one sequencing project. One such group is the one in the middle, all associated with the annotating the human genome working group. While requesters from other working groups, mainly are associated with only one sequencing project, these requesters in the middle have somehow requested most of the projects together. Why would this be the case? Do their academic backgrounds match with one another? Do these individuals represent specific sequencing centers or organizations that were necessary to take on these malign sequencing projects? Another very interesting window is the existence of bridge nodes, which are individuals that seem to partner and bridge different sequencing projects, and even across different working groups. Here we magnified to show one such node, connecting a sequencing project from the annotating human genome working group and one from the comparative genome evolution working group. These bridge nodes appear elsewhere in the network, connecting different areas of research with one another. Who are these nodes? Why are they the ones to link different projects? Do these chaperones make a difference in the end outcome of whether such sequencing requests get recommended for funding? All these are very fascinating questions that are now possible and within reach of our hands. Through the lens of communication networks, you've seen how individuals and projects can be connected to one another and grouped in potentially meaningful ways. What you're about to see in the next few slides is that not only is this possible for individuals, but documents and files themselves can be grouped together as well. And we can compare how they fare against a manually curated structures in the core collection. As I mentioned, the core collection is a carefully curated library of files and folders, each labeled with valuable metadata from the history of genomics team. The 3000 something file curation was organized into specific folders based on their merit and type of content. The folders then reveal something common for the files that are contained in them. For example, we could expect that all the files inside a folder named mammalian sequencing to be related to the sequencing of a mammalian species. We then asked ourselves, could we recreate the structure computationally? That is, could a computational model powered by all the printed text in the collection that we've recognized through our pipeline group all the documents in the same way that the historians at the industry I have. What you're going to see is something called a cluster map in the center is a matrix that results in a heat map with colors denoting similarity between two documents. The diagonal line represents the similarity between two same documents. In this case, you see a clear white diagonal line because you would expect the similarity of two identical documents to be one. The map is symmetric across the diagonal because the similarity value of document A and document B is the same as the similarity value of document B and document A. You can also see some block-like patterns in the heat map which would correspond to groups of documents that are more similar with one another than the rest of the collection. We now introduce tree-like structures on the left and on top. These trees, also called dendrograms, represent the hierarchical relationships of how similar the documents are to one another. The way you would read this is to read from top to bottom with each branch, every branch starting from the one common branch. Here in the first branch on the left tree, the first split occurs and the chunk in black that we've highlighted is completely separate from the rest of the tree. We can interpret this placement as indicating that the documents that fall under that branch are substantially different than the remaining documents in the collection. We have clustered the documents in the core collection by how similar they are to one another, that is, the text and the content that's contained in them. So in some sense, we've put them in quote-unquote folders. This is an unbiased way of grouping documents, but then how do we know that these make sense and how do they compare to the original curated structures and folders labeled by the History of Genomics team? We highlight the four main folders that exist in the core collection, LC or ethical, legal, social implications research program, HAPMAP or the International HAPMAP project, GWAS or genome-wide association studies, and sequencing, which include all sequencing target files. In an ideal situation, we would see all the colors together with no overlaps with one another, since that would indicate that our clusters have grouped all the documents under one folder together. Here we see a group, a good number of documents are clustered together. Most of the sequencing target files in purple, for example, are falling under the same branch on top. Most of the HAPMAP documents in green are also grouped together. Even in areas where they don't perfectly cluster together are points of major interest. Why is it that some of the LC files are associated with the long stretch of GWAS clusters? What about those files and what the content in them make them more similar to the documents in the GWAS folder, rather than the LC folders? Could there be better labels for those files, as in should there be a different folder that holds them? We're just talking about folder structures here, but I want you to know how exciting this can be and how exciting the potential applications here are. For example, we can take this technology and integrate it as part of a recommendation and search engine for the archive. Imagine an external researcher coming to the history of genomics team and asking for papers or documents related to the LC project. Instead of having to manually read and find documents one by one, we can recommend the top few most similar documents regarding LC using a similarity value. This also has a potential for the rest of the two million something archive, as it would be improbable for the historians to read through each document in the archive to correctly place them in curated folders. Instead, they can use this approach to first largely group like documents together and fine tune the curation, saving time and labor. This is one of the many examples how tools we've built are complementary to the domain expertise of the history of genomics team, and that they should not replace one another, but work together to validate the areas in agreement and explore the areas that they do not. The people, the audience that stands to benefit from the computational and machine learning approaches applied on these documents are broad and diverse. First are the scholars of science, the very individuals that wish to study more about the early history of genomics, the onset of the human genome project and other perspectives of biomedical sciences in the United States. The tools we've built, not only reveal new data for them to study, but can help them explore and pursue new directions of research, some of which we've outlined today. Second are the computational researchers and machine learning experts. The archive is diverse, rich and carefully curated in specific collections. The metadata and the curation serve as powerful data sets for new machine learning and language modeling tasks that do not yet exist in the field currently. Lastly, the NIH itself. The archive is a snapshot of how scientists in large institutes work with one another, advanced ongoing projects and set goals for the future. Through the archive, unlocked and contextualized with computation, we can answer questions about how policy decisions come to be, how key decisions have impacted downstream outcomes, and the interactions between the institutes and the external scientific community at large. NIH policymakers can learn from these decisions to create a more efficient, effective and communicative environment for all the stakeholders of NIH, including you, the public. Today, we've shown how computational tools can unlock new layers of knowledge encoded in the archive and then apply this at scale to save lots of labor and time. All of this was possible with a very small, but mighty and important part of the archive, the core collection. I hope then you were excited about the new directions that extracted data has enabled us to pursue. These computational tools, when handled with the domain expertise of geneticists and historians, can significantly improve our understanding of the history of genomics and ask ourselves the questions that we have not been asked, been able to answer and ask before. And with that, I want to thank Chris, who had the foresight to digitize the core collection long before our collaboration, and is now a primary advisor on my own PhD committee, Zach who oversees the digitized collection, and the rest of the history of genomics team. I also wanted to especially acknowledge and thank Chris and Sarah for giving feedback, being enthusiastic and mentoring this work. What you saw today is a major part of my own PhD work and the mentorship and the guidance from the industry I have been invaluable to my own professional development. And of course, this research would not be possible without the ammo lab. I want to highlight as Chris did. Thomas, who with Chris started this entire collaboration and have been mentoring this work ever since clever who I mentioned during the presentation. I also want to give a special welcome and shout out to Zoria, who is my mentee who's working on some of these algorithms and models, and Jeremy, who as a professor of journalism, sheds a different outlook and perspective on this project. I want to thank my funding partners. And also I want to thank you so much for listening. And now we're happy to open up for questions. Thank you so much Spencer I think I can say on behalf of everyone watching that this was an incredibly exciting presentation both of your presentations were completely wonderful and and so exciting. I think we've gotten some really great questions in the Q&A. So if you have some additional questions, please drop them in there in lieu of raising your hands. I think one of the first questions that I had to kick you off is, was there anything unique about this genomics data set that really allowed you to use these tools and how carefully curated was the data set before you developed your algorithms I think that's probably a question that people will have who might have similar documents that they're interested in applying computational tools to. Yeah, it's a great question. The, the core collection that history team have put together is a really fascinating data set not only not only because of the content that it contains you know we're talking about the early histories of how genomics field came to be and so we're reading and digitizing and analyzing some of the conversations that now have monumental impact in the field of genomics, but also as a computational is and as a competition researcher. The challenges that we've outlined here are also fascinating because these aren't challenges that are present in other types of data sets. For example, as I mentioned the handwriting is very a very unique type of challenge because often the data set with that we have computational field and the machine learning field and the nlp natural language processing field, they're often printed text, and they often come as clean, clean text that no one ever has to doubt about. But in this case, we have to lift the printed text isolated from other layers of data, and then work with that which is a challenge, but that's what makes it fascinating and as I briefly mentioned during the presentation opens up a whole new realm of new machine learning tasks. For example, you could imagine one day, the archive, being the core and the center of a new machine learning data set train us that people use to train their own handwriting recognition and extraction models. So that's a very fascinating part that I encountered during the database. Another, another part of the database and the archive as a whole that I think was really fascinating is that the core collection has metadata that such as you know what types of documents and categories so you know if the industry has been involved in different projects you know LC as I mentioned, make a thousand genome project or different species sequencing projects so the core collection is carefully curated and grouped around those. So then we can ask tailored questions about the documents are inside those folders so for example, we have a folder that contains all the sequencing target files. We know that we can ask ourselves questions about those sequencing, for example, what influences or motivate sequencing, what are, what is it what's contained in this in the text in the letters and the proposals that make them more viable or less viable for potential funding later on so that type of questions tailoring would not be possible without the curation and that's what makes also this core collection and the archive really special. Yeah, I think there are a couple of questions on the Q&A that that could tie to some of this stuff so Hongshu asked, can this computational tools be used to the ongoing large scale research projects and that is the thing that is driving our intention of building really an infrastructure that can be used in general to extract more value from archives and all sorts of archives. I mean we have been talking here about biology and scientists and correspondence, but you know another example in which being able for instance to extract data from handwritten written documents is critical is for instance related to climate change. So, you know, people are trying to figure out what was the height of tides or what was the temperature and they look at records of trees and stuff and it's kind of indirect. But there are places in which at ports for instance in which people were taking notes every day of the height of tides and things like that. But in England, some of those records go back hundreds and hundreds of years and right now we cannot use them because they were handwritten. So if we have these tools that actually can go and extract this information, we can learn so much more. And when you think about how this can be tied to other tools that are being developed right so if you learn to recognize all of. Life needs handwriting right and you scan all of this stuff and now you actually have something that translates from Latin or German to English right you are giving access to many more people to study the correspondence and the thinking of people in the past. And there is still so much that that is going on that that scientists have handwritten notes that we would like to add to the collection of data that scientists want to study to learn how scientific ideas were developed and all of that. And, you know, it's, there are so many fascinating questions, for instance, Sean Alan asked, do you think you could use these approaches to identify points or people were increased communication could have been helpful. Yeah, that's the dream. Right. It's kind of, if you study these in a systematic way, you are actually learning how you could improve things. So, so there are many, many opportunities. Professor you are answering one of the questions, do you want to go ahead. Oh, yeah, sure. So I, there was an attendee who asked the question about how applicable our deep learning models will be to distinguish writing and different written languages. And is the handwriting you consider only in English and other languages which. So for now we have only considered only English, because of the majority of the text isn't English. And again, speaking to how unique and fascinating that the archive is, we do see instances where other languages are represented, for example, during the human genome sequencing and the human genome project, there are obviously collaborations with other countries, sources from other countries, which are written in different languages. So you could imagine in the future, as of, of able to train these models in those target languages and being able to detect them and extract them outright just like we have with the English. One question that I had while I was watching with your presentations is, what are you finding about the nature of communications and looking at communications from like the 1800s versus modern communications do you, I know, you seem to imply that that they were very much the same or very similar are you seeing modern communications evolved in an interesting way. No, so it's hard to record patterns of written communication in the past, because most correspondence of most people is not preserved. It tends to be preserved on people that were famous and then, you know, they kind of sense that they were going to be part of history and things got preserved so you have philosophers, scientists, artists, all of those people. And, and for the longest time, you know, you had people writing handwritten letters that would be sent by different methods, right, if you wanted to write to someone in Europe and you are in North America, you had to get the letter on a, on a ship to get and so there was these delays in correspondence right you have to wait several weeks to get an answer and stuff like that. So you could imagine that that would lead to very different patterns from, for instance, email, right email reaches you instant, instantaneously. And what we found out is that people have these patterns in which essentially you can describe the process as now I'm going to do correspondence. And they sit and for a while they do correspondence and the correspondence could be email, or it could be letters, but it's sort of then during a period they may do several of them. Right. And, and then they stop. And, and the frequency with which they entered these active periods is something characteristic of the person. Right, some people do it few times over long periods, others do it frequently for short periods of time. But so the idea is that actually these mental strategy to handle correspondence was the same with handwriting, and with email, right, we could put all of those curves. Now people are using other methods like slack and unlike instant messaging and stuff like that and people may think, Oh, that is so different. Right. But it's like, we all have strategies right there are some people that are doing it all the time for little periods of time and our people that are able to compartmentalize things very much. So I don't think that the new mean is changing the strategies. It changes the time scales right things are happening faster. You know it's like in the past writing a letter you know that it's going to be ready in two weeks and you'll receive an answer in one month. It was a long process thoughtful right it's kind of writing each letter was took time so you do fewer of them. Now any mail okay something was not clear you'll write another one so so, but but kind of that strategy to handle the that that process is the same so I don't expect it to change. And I know my answer is going to be is going long, long, long, long and I should let other people ask other questions. We do have about five minutes left just for people to be mindful of submitting questions like you can and I know since you're Lewis will type away to answer questions as well. One question I had and to make sure we address before the time runs out is, what would we have a lot of NIH researchers in the audience. So what would you tell NIH researchers who are interested in using tools like this on archives or on data sets that have a lot of correspondence sort of how to maintain that those sorts of data sets in order to use tools like this, and then be how can how possible is it to use these tools to be predictive in in terms of keeping on top of emerging science and emerging trends and science know always trying to get ahead of about where where science is going and really fund the forefront of different areas, you know, like genomics. So what tips would you have for researchers who are interested in doing that. I'll go ahead, Luis. I just have a couple of things to it. So one of the first important problems that people think that digital is better than paper but actually that's a dangerous thing because paper lasts a long, long time if it's reserved. Digital formats can change. So there is a question of if you have old files that somehow don't get transferred to new formats, they will be lost. And it can become very hard to recover them. So, and another thing is storage formats, right. How many people still have those floppy five inch and a half drives, right. But many people start stuffing them. So, so kind of these maintenance of transfer of format is important. And that's why kind of printing them and on paper actually can have a kind of retaining effect right I mean there is something to be said for what your team did of printing stuff. Because if you need you'll scan it all over again, right. But but there is a format there. So that is one of the things and Spencer maybe you want to take the other part of the question. No, that's I fully agree with that. I think one thing when the question about how can we what are some tips we can use to use this to predict you know emerging areas of science where or, you know, how did what topics is the more concerned with at a given time frame and can we look can we explore that using this historical data. And I would say we would have we have to be cautious in not giving too much power to what these tools create so these tools have to enable and can and is currently saving a lot of time for historians and archivists to explore and curate collections. But, as I mentioned in the presentation, this is all a complimentary to the domain expertise that the archivists and historians think. So, if our computational models and competition models in general produce a result that is in agreement or not in agreement with what an archivist or historian would think that does not invalidate their findings, nor does it mean that the competition models are always correct. I think that should that that should just focus it on the fact that those are areas that we should study even more. Why is it that they're not in agreement. It could be something interesting and special about the ways that they're not in agreement. And that, and among many others have described our avenues of further research that these competition models can highlight, including one of emerging science and emerging topics. Those are wonderful and Louise, I think we might get some people from our facilities office who are concerned about printing everything but but that's really good to know. And Spencer, I'm glad that our historians are not going to be out of the job. So, we have I think about maybe one more question and I think we had a really good question that I think you might have just typed the answer to do you want to answer that. Yeah, I think it's a really really important question which is, yeah, I mean a lot of what has been done currently is for analyzing the communication and records of wealthy famous powerful people. And this is, of course, providing a very biased view of society. And the question is what can we do to listen and become aware of those other voices that were not wealthy powerful and stuff. And I think one of the matters is because it's so expensive to do these things. It's too gender taking to start doing it for more regular people and that's why you know that vision that I presented in my last slide about creating an infrastructure that would make these feasible that essentially would democratize all of these tools. You could build usable study ball records from archives of non famous people. And so, so that is critically important because we want to provide the history of society, not the history of just the wealthy powerful people. So that that's also one of the things that we have in mind. And, and, and I think why this is so applied in other areas I mean Spencer is very interested in journalism to and in data journalism and and that's another situation in which, you know, it's kind of having tools that can be deployed on many different problems would be critical. This has been fascinating questions that I'm so grateful for that for that and these for for their engagement I mean it was really exciting to have so many questions and interest. Thank you. Well I think your wonderful presentations and work have really sparked a lot of interest I think that's what really has people's fascination I will say for anyone who has a lot more questions have a lot more questions about the archive or about camera labs work. So thank you so much for joining us here today and really want to thank Christopher Donahue, our historian who's really helped develop the program over the last 10 years. Chris weather strands act as our archivist and then Eric Green who helped to establish our history program back at the beginning so thank you all so much this has really been an amazing hour. Thank you everybody. Thank you. Thank you, Spencer. Thank you Sarah. Thank you, Luis. Thank you to our audiences what a wonderful way to celebrate the 10 year anniversary of the history program and the history of genomics will be seeing you soon. Everyone have a wonderful day. Thanks.