 Hello, my name is Zhiru Xie from University of California Riverside Library. To date, my colleague Dr. Ilin Chen from Virginia Tech Libraries will join me to present initial results from our project. This project is a collaboration between UCR Library, Virginia Tech Libraries, and the FIPSM Collaboration Corps at Yale University School of Medicine. The project is founded by the Institute of Museum and Library Services, National Leadership Grant for Libraries, titled, Librarian in the Loop, Deep Learning to Curate Very Large Biomedical Image Datasets. We will unpack this long and wide title, starting with what we mean by very large biomedical image. In biology and medicine, structure often dictates the function. For example, discovering the double helix structure of the DNA helped to answer the question how life reproduces. To demystify more biological grand challenges, scientists often rely on better and better microscopes to see finer and more intricate structures. FIPSM, a short for focused ion beam scanning electron microscopy, has recently been improved due to the point that it can now capture 3D images of a sizable sample volume at nanometer resolution. Referred to as a quite resolution, it was selected by the journal Nature as one of the seven technologies to watch in 2023. As an example to demonstrate why FIPSM is a big deal, let's take a look at this schematic diagram of a cell nucleus. On the envelope of the nucleus, there are many tiny little pores that controls how the genetic information enters and leaves the nuclei. Scientists believe the pores are related to how diseases propagate. Before I enhance the FIPSM, however, there exist microscopes that can image the whole cell, but the resolution is too low to see those individual pores. There also exist microscopes such as cryoEM that can image the pores on very high resolution. But the region cryoEM can image as too small to tell the full story at the scale of the whole cell. It's like looking at a planet Earth using a microscope. Therefore, the schematic diagram to the left was only the scientist's imagination. Until enhanced FIPSM, there's no good way to count the numbers of pores on a cell and then calculate the pores density and the size distribution. But without this information, scientists struggle to answer many research questions. Enhance the FIPSM now gives us the best of the both world, not only covering much larger sample regions to see multiple cells, but also at very high resolution. The image on the left is a 3D FIPSM image. Imagine if we slice a three-dimensional cube into many thin slices, then each slice is a two-dimensional image. Now we move the frame along the thickness direction of the cube. Then these two-dimensional images will be animated as shown to the left. The image has been zoomed out to one-twentieth of its original size along both the width, height, and the depth. So what we are seeing here on the left is only one-eight-thousandth of its original resolution. This allows us to see multiple cell nuclei. If we zoom into its original resolution, that's the image to the right, showing the small round objects, which are nuclear pores. Enhance the FIPSM can reveal many such important biological structures. Very high resolution combined with much larger region, the resulting image is also much larger. Even a micrometer-level biosample can result in many terabytes of image data. This is what we mean by very large biomedical images. With the continued improvements of microscopes and other instrumentations, scientists will collect more and even larger image data, which poses even steep challenges in data curation. What do we mean exactly by curating data? Typically, librarians start the data curation lifecycle at the end of the science pipeline. After scientists have already finished milking the data for publications, then that's when the librarians move in to tackle metadata, long-term storage, preservation, and then ready those data for reuse. But this model almost insulate librarians from scientists, rendering much of our work irrelevant to those who initially created the data. But without their participation, it's often difficult to imagine how data curation could have bigger impact. Because scientists collect the data for specific purposes, then we must understand their rationality effectively reused it. To rectify this tendency, we propose to move the data curation process upstream to be part of the science pipeline and make data curation a side effect of doing good science. In this project, images were taken from a group of Yale neuroscientists focusing on a type of epileptic seizure and Alzheimer's diseases. The research hypothesis was that the disease are linked to nuclear pores and the poor density and size distribution may provide hints on the cellular mechanisms of these diseases. Phypsum images were taken from mouse neurons that have been genetically modified to introduce the diseases as well as a control group of healthy neurons. Our work on data analysis will provide definitive answers to poor density and size distributions, therefore it's critically important to the research. So our approach is in line with the general library trends of embedding librarians in research and expanding the library's value proposition from building collections and providing knowledge management services to knowledge creation. This approach was initially proposed back in 2015 and have been tested in many projects. We believe only this level of research embeddedness where librarians work as crucial to answer science questions can give libraries that necessary impact with Spire as a qualified campus research partner. In this particular project, our work started immediately when the images were captured from Phypsum and our initial focus has been on data analysis. To solve the science problems, we need to start from categorization, which means for each pixel in the terabytes of image data, we need to identify which of the 30-something cell organelles does this pixel belong to. In other words, is this pixel part of a nuclear pore or is it part of a mitochondria or something else? It turns out there isn't an easy way to do this. With nanoscale image like Phypsum, it was estimated that it would take about 60-person years to label a single cell. It took me a whole night only to label the pores in the tiny section shown on the left, which identifies about 100 pores. But there are potentially tens, if not hundreds of thousands of pores on each nucleus. So there's no way we'll be able to answer those research questions by manually labeling and counting. The process has to be automated to certain degrees. Fortunately, AI machine learning is here to help, which is why we invoke the power of deep learning. Deep learning-based image segmentation is relatively mature. It belongs to the so-called supervised learning, where we started by having some ground truths, then using the ground truths to train a model based on which predictions can be made on data that initial training has never seen before. In our case, the ground truths are the handful of tiny slices where I manually label the pores. For this type of biomedical problems, it is unfortunate that ground truth data are so scarce that despite the manual labeling is very expensive and time-consuming, it had to be done by ourselves. Our goal is therefore to generate as fewer manual labeling as necessary to generate sufficiently good predictions. It takes a lot of tweaking and trying the error to get things done just right. I will now explain what we mean by library in the loop. On one hand, it is a special case of human in the loop machine learning technique. Here, human insights are taken into consideration to accelerate the training and increase the model prediction effectiveness. In our case, after the initial training and the prediction, we will use human intuitions to select where more ground truths are to be manually labeled and then add them back into more training. These are often the areas with lots of false negatives. Machine would not know a priority where those areas are, but trained eyes from a librarian can easily identify those areas. This way, we can achieve the intended goals faster and better than either the pure machine learning or a manual labeling. On the other hand, the term library in the loop also signifies our desire to embed librarians deeper in the science pipeline. Some people may question that work we just described should belong to someone in an academic department but not the libraries. And they believe these are not proper librarian work. Our response is, why not? Not only we can't do such work, but we also believe we should. Think about it. Are there any fundamental differences between labeling image pixels and categorizing and cataloging or creating metadata? We don't think so. Essentially, these are still information work and we're professional informationists. So we should not be afraid of stepping out of our comfort zone. Some may worry that librarians do not have sufficient skills, but that is changing. Both iSchools and professional development programs these days for librarians are now teaching data science and AI. More importantly, we librarians are lifelong learners and we are eager to learn. Just because we can do, that doesn't mean we should. And aren't we trespassing computer scientists domain? The reality may surprise you. Data science and AI skills are so much needed in science but are often beyond the scope of most domain scientists. Indeed, we should not realistically expect every neuroscientist or astrophysicist to be able to be also a computer scientist with sufficient programming skills and hands-on experience to do such work on their own. While there are many computer scientists who can do this type of work, they often are too busy focusing on their own research agenda to care about other domain scientists specific research needs. This is exactly where we librarians can shine because we are service-arranted and we don't feel the needs to fix it on a predetermined research agenda. And we're also eager to become a qualified campus research partner. Indeed, there are plenty of opportunities. Even if we just want to do data curation properly, as we previously argued, it is still much better to start from within the science pipeline. All right, enough opinion. So exactly what have we achieved? Here's the example. So after manually labeling six tiny crops as ground fruits, we train the model in about a week's time and then run predictions on the full nucleates. Here you can see on the left, this is a nucleus in the process to split into two, which is why it's in an odd shape. The prediction on the left still have much room to be improved. As you can see, there are darker regions on the envelope which we believe should be fully covered by pores. So these are regions with lots of false negatives. We then label the two additional crops taken from those areas and train the model for another two days. The results are much improved as shown in the prediction results on the right. The darker region are now also covered with pores. Continue to do this type of continuous training, we believe in the end, the prediction will be much closer to reality. We now built a fairly stable pipeline and have achieved the satisfactory predictions on four nucleus and we're now working on post-processing to count the number of pores and its density and size distribution. So now the key takeaways of this project. First, librarianship is whatever we make it to be, it's not defined by fixed set of doctrines. Second, to qualify as competent research partner, librarians needs to be embed ourselves deeper in the science pipeline to help answer critical research questions. The third, focus on our partner's research agenda and only achieve our own agenda through the side effects of our collaboration. The fourth, practical machine learning AI work is not at all about programming and the technical skills. It's a combination of hard and soft skills including manual labor, human intuition and trial and error. That's it for today. If you have any questions or comments, please get in touch. Thank you very much.