 Welcome to this briefing on ARC identifiers for the Coalition for Networked Information. Since our last CNI briefing in 2020, the world of ARCs has grown considerably. In particular, the number of ARC organizations increased from 600 to over 1,000, and the former ARCs in the Open Community Initiative became the ARC Alliance. We're also happy to see significant new software services and publications. As a reminder, ARC stands for Archival Resource Key, a special kind of URL that can serve as a persistent identifier. Collectively, libraries, museums, archives, data centers, and publishers have created over 3.2 billion ARCs for content such as scanned books, medieval manuscripts, botanical specimens, fine art, scientific data sets, and public health documents. The CNI briefing two years ago covered ARCs for national history, genealogy, and publishing. Today we will cover ARCs for physical samples, biomedical artificial intelligence, and vocabulary terms. My name is John Kunze, and when I finish this preamble, I'll pass things over to Dave Igles from the University of Kansas. Dave has deep experience with earth science and technology and is lead developer of the Global ARC Resolution System. When Dave is finished, he will turn it over to Tim Clark from the University of Virginia. Tim is an associate professor in the School of Medicine, and he brings tremendous expertise in strategic thinking around a wide variety of biomedical identifiers. When Tim is finished, he will pass it back to me, John Kunze, from the ARC Alliance. I'll close the briefing with a look at permalinks for vocabulary terms, which combines my long interest in systems and standards for metadata and identifiers. Why should we care about ARCs? The average lifetime of a URL was once said to be 100 days. At the end of its life, a URL link breaks, usually giving you the dreaded 404 not found error that most of us have seen. Irritating at best, it's a minor disaster for memory organizations. ARCs are similar to DOIs or digital object identifiers. They're both persistent identifiers for accessing content and metadata. And ARCs are found in many of the places where you'll find DOIs. In contrast, ARCs come with no fees, no limits on how many you can create, and no metadata requirements. From the beginning, ARCs were designed to be decentralized and to identify any kind of thing, digital, physical, or abstract. Here's what an ARC looks like. At first glance, it's a URL that carries this internal label. To the right, the five digit name assigning authority number identifies the organization that created the ARC. Further right, the part after that names the thing that the ARC is assigned to. To the left, the host name makes the ARC actionable or something you can click on. It's also known as the resolver. ARCs are unusual among persistent identifiers in allowing organizations to run their own resolvers. The part starting with the ARC colon is the core globally unique identity. It doesn't depend on the host being available or on the future existence of the worldwide web. ARC organizations include a wide variety of memory organizations, nonprofits, for-profits, and government agencies. A few are listed on this slide. We've seen accelerating adoption in South America, Africa, and India. To see an ARC in action, an example from the Louvre Museum appears in the lower right corner. That's it for the background. Now I'm going to turn things over to Dave Igles to present our first use case. Hi, my name is Dave Igles and I work at the Biodiversity Institute of the University of Kansas. I'm also fortunate to be a principal investigator of the Institute of Samples, or I Samples Project, which is sponsored by the National Science Foundation. My Samples is a standard space collaboration to uniquely and consistently and conveniently identify material samples, record metadata about them, and link them to other samples, data, and research products. This effort would not be possible without globally unique, resolvable identifiers such as ARC identifies. Physical samples taken from the Earth environment provide a critical role in the sciences and our understanding of the natural world. These samples can vary considerably. For example, a geologist may collect mineral samples or ice cores, an archaeologist may gather various kinds of artifacts, and a biologist may collect samples for natural history catalogs and genetic analyses. Once gathered, these samples may be utilized in many different ways. Some are simply archived as reference material for later use. Some are analyzed immediately and consumed in that analytical process with only the digital artifacts remaining. The derived products or subsequent analyses often fall on the basis for publication, hence the original samples effectively provide a factual basis for Earth science knowledge. There is often value in revisiting analyses and other derived products. In these cases, it is often instructive or even essential to access original samples, tracing the provenance graph backwards from publication to analyses to the sample material, and even the location on Earth where the sample was originally taken. For this to happen, it is essential for samples and the derived products to have globally unique, resolvable identifiers. There are several types of identifiers in active use and ARC identifies bigger prominently in the physical sample communities for uniquely identifying not only the physical objects themselves, but also derived products including metadata, photographs, videos, analyses and publications. iSamples provides a hierarchical infrastructure of searchable physical sample catalogues with a penultimate collection called iSample Central, which gathers content from the various other collections and so provides a catalogable physical samples. The catalogues provide a simple way for researchers to find samples of interest, searching by time, location, collector or various other properties. Since each sample has a globally unique, resolvable identifier, it is trivial for the researcher to visit the original catalog and to link additional information to a sample or group of samples. For example, here we use iSample Central to visit Morea, a part of French Polynesia in the South Pacific. Each of the colored points represents a physical sample. The towers of points indicate the many samples were collected from the same physical location. Each sample point shown here has an arch identifier which is displayed when mousing over the sample points. Clicking on a sample, on a point, brings up the summary metadata for held by iSamples and of course, the identifier forms a link back to the original collection. Following that link we visit the Smithsonian Institute to see more details about the specimen and from there we may traverse cross references to analyses, publications and other material. This simple linking between samples is facilitated by arch identifiers. There are literally millions of these records, already in iSamples and the number of participating collections is anticipated to expand significantly as the project continues. In conclusion, globally unique, resolvable identifiers are being widely adopted by the earth science community as a common mechanism for referencing physical samples and all derived products. Arch identifiers are a good match for the technical requirements and fill an increasingly prominent role for the earth sciences community. That's it for my part. Now I'm going to turn things over to Tim Clark to present our next news case. Hi, I'm Tim Clark from the University of Virginia School of Medicine and School of Data Science and I'm going to talk a little bit about using ARCs in AI for biomedical research. Many people in my field will tell you that AI is now transforming the field of biomedicine and biomedical research and bioclinical research. And it's doing this by making very complex predictions that are kind of amazing really, but they can be very opaque as to how those predictions were made and they require explanation and robust evidence to be presented so that they can be trusted. So for example, an epilepsy detection model here in this diagram reads brain MRI data and processes it through a complex machine learning model and reports out patient is diagnosed with epilepsy, 85% confidence, right? The clinician wants to know why did you make that prediction? It's not a simple decision tree. It's something much, much, much more complex. Can I trust the AI models that made these predictions? You can't trust them. It's like working with a colleague. If a colleague told you I diagnosed that patient with epilepsy, you would like to know why he made that diagnosis. And you'd also like to know how he was trained. Is he board certified? Where did he go to medical school? And what's his experience in diagnosing that disorder? So I'm going to talk about two biomedical use cases for very deep AI machine learning. One is a predictive model of cellular response to drugs. I can use that, we hope, in drug development, pharmaceutical industry. And a predictive model to detect up to a week in advance, kids in the neonatal ICU that are going to have some life-threatening trouble. And the idea here is that the impending problem may not be evident on the surface to physicians and nurses and that they'd like to have some warning, long enough in advance to see if they can avert harm. So all of these things require results, validation, and explainability for the model. And there's ways to do some work statistically at the other end to tie the predictions back to the feature sets. But these feature sets, especially if they're processed a lot, even to the point of, how did I select the feature set? You want to have some robust way of saying how the feature set was derived. The pre-model explainability and the post-model explainability go together. We produce things called evidence graphs for pre-model explainability. They describe how results are obtained, and they provide supporting evidence, and they also allow reusability of the components. So we give every graph node in our network diagram essentially an ARC that is resolvable first to metadata about the node, which could be data, software, computational parameters, and then resolvable to the actual data software parameters, models, and so forth for explainability. ARCs have a huge advantage here, where in cases where you're doing thousands of chain complex computations and they provide the ability to specify the metadata flexibly in an open way. Beyond sort of the core bibliographic style metadata about the data set, you would like to know a lot of other things. So the first use case I'll go to is the predictive response modeling of normal and disease human cells. This project is part of the NIH Bridge to AI program, which we're doing in association with colleagues at Stanford, UCSD, and a number of other, Yale, and another of other institutions. So the idea is let's construct an accurate cellular component architecture. What's going on, and what are the structures and protein aggregations and interactions inside the cell based on high dimensional data, and use it to construct a deep learning model that can predict the response to biochemical perturbation. And of course, as I said, we have to interpret and explain the model results robustly. These are some, as a montage of some figures generated out of our first paper on this approach, where we have protein interaction data generated using affinity purification aspect. We have protein localization using immunofluorescent subcellular microscopy. And we have a whole complex process for taking these, calibrating distance, interprotein distances, placing them in the cell, describing the interactions, and building up communities of components. So the second version of this approach is going to add to the microscopy and mass spec lab products single cell RNA-seq processing, which will add a genetics genomics component here. And it'd be very, very powerful, we hope. And the tools pipeline that you see here, the second column from the right, is it does integration of all these components to produce the cell architecture. And then the last column on the right, fear integration, that fear stands for findable, accessible, interoperable, and reusable. It's an important acronym in the AI, NIH, and health care world is to generate the provenance graphs as things are being produced, the evidence graphs, and track the provenance of data sets. This is a preliminary map of cell architecture from the 2021 Nature paper, where the first results were published. And this was generated by a complex process. This is a provenance map of the process that is very formal, actually. And the blue rectangles are data sets. The blue ovals are algorithms. The purple ovals are AI models. The green rectangles are hyperparameters for model training. You can see that when we are going down the path to this thing in the red box on the right, the derived cell architecture, that's going to be the basis for modeling predictions. It's very complex. And it's quite possible for deviations or variations in different parameters and data sets and so forth out on the left to have significant effects on the right. And we'd like to know what those are. And we'd like to be able to rely on a well-defined and actually machine interpretable representation of the process. Each one of the nodes in this network is an object that has a set of metadata assigned to it with an arc assigned to it. And it resolves to the foundational data software hyperparameters and model and computations and computational conditions that derive this architecture. So arcs were something that we used in that. And we're being used in that. This is something that we did really this last year, fairscape. This is tracking what the vital signs information from 6,000 babies over 10 years using a large number of algorithms, many, many candidate features, 100 terabytes of data, algorithmic result clustering, and 17,000 or more actually computations per result. And we were able to do predictive analytics on these babies and predict adverse medical events. Actually, we were able to predict them seven days in advance. Every patient, every result or every patient that generated a result had a graph left like this generated that showed the provenance of the results. And these are stacked up for all the babies in the experiment. And the complete graph is 17,000 nodes. We know that it's important to be able to generate all these with persistent identifiers and arcs are free to generate. And they allowed us to use very flexible ways of representing the metadata, including a formal ontology called the evidence graph ontology to formalize this representation. So these are all serialized in JSON-LD using terms from schema.org and the evidence graph ontology. You can resolve an arc in here to return the metadata that gives you formal ontology terms and formal vocabulary terms. You get directed a cyclic graph of provenance slash evidence. We know that from this serialized representation that the results were derived by a certain specific computation. You can find information about the computation by resolving the arc and resolving a further URI in the metadata. And we know that they came from a certain data set. So this is really, really important. And it was great to be able to do this so that we know that we can rely on this result. So to summarize, why do we use arcs in our work? We use them as persistent IDs to robustly represent the data in our evidence graphs and all our results and all our computations and all our primary data. Their freedom and their metadata is flexible and there's a large ecosystem of users and developers supporting them. We use arcs for complex evidence graphs on AI machine learning predictions in biomedical research in the lab and in the clinic. And every node, every data set, computation, software, model, hyperparameter set is resolvable through the arc to its contents. Each node is persistently identified with an arc. So in conclusion, I'll just say that for us, arcs are very useful, highly flexible, and scalable persistent identifier model. They're especially useful for traceable complex computations. And again, I'll just reinforce. If you're doing work in biomedical research, the predictions you make in AI have to be traceable. They have to be explainable. Otherwise, you can't use them. Not in any sense where lives may depend on them. We're happy to chat with perspective arc users. And thank you very much for your attention. That's it for my part. And now I'm going to turn things over to John Kunze to present our final use case. This last use case describes work supported by the metadata research center at Drexel University. Why do we want arc identifiers for vocabulary terms? Because linked data and the semantic web rely on persistent URLs for concepts. Thousands of work groups across the world are inventing vocabulary or modifying existing terms. But we don't have easy ways to share and get feedback on changes. Some of this vocabulary labor adds an enormous value of great interest to researchers and scholars globally. So we built a crowdsourced metadata dictionary called YAMS. It lets anyone look up terms to reference. You can log in to comment on other people's terms. You can also upvote or downvote terms. And you can watch them so that you get an email if they change. This particular term is labeled vernacular class, which means it can be changed by the owner. But terms labeled canonical class are done changing. There are also deprecated terms, but they too will always be available. You can add your own term and immediately walk away with an arc persistent identifier to reference it. And there's no problem creating multiple terms with the same name string. This one has 28 alternates. But each alternate will have a unique arc concept identifier. You can also add tags to terms and import terms in bulk via a CSV file. Standardized metadata is less common than some people believe. The official story about the metadata standards that one's institution uses often hides an unofficial story. When catalogers, scientists, and archivists really care about accurate description and have internal workflows to support, what tends to happen is that existing standards are modified to suit local needs. Projects all over the world are giving rise to their own dialects of standards and interoperability suffers. This problem was described in 2004 with the first cross-domain metadata standard. That was the Dublin core, which would not allow more than 15 elements because the standardization process was so difficult. Domain-specific communities liked Dublin core and wanted to add a few elements, but they had to create their own vocabularies. As a result, here's the metadata universe that we inherited. Lots of metadata elements, right? But when we zoom in on a given area, we find with some dismay that those aren't elements, but entire vocabularies are ontologies with lots of conflicting overlapping elements inside them. Each point is an island of non-interoperation. It's not just cross-domain vocabularies that spawn dialects, but also narrow domain-specific vocabularies. In cryospheric science, which is all about frozen water, there's widespread disagreement on generalized terms such as glacier and puddle, and on specialized terms such as phrasal ice. It happens in any domain where object experts care about accurate description in their area of specialization. So with so many alternate terms, are all of them necessary? Who decides? Traditionally, busy experts in your fast-moving field sit on standards committees, which can take years to reach consensus, both for the initial draft and for revisions. For logistical reasons, field testing is often very limited before standards are voted on and published. We think YAMS is an efficient and interesting alternative. YAMS is not a standard nor an ontology. YAMS is a living dictionary of metadata terms containing all parts of metadata speech, all domains. Unlike an ontology, a dictionary tolerates unrelated terms sitting next to each other. Each term is like a proposed nano-standard. Some are upvoted, others not. Reputation-based voting, like Stack Overflow uses, resists gaming and helps standards committees choose among the best terms. YAMS is a microservice that ontology, software, and linked data can reliably use by referencing arcs for metadata terms. That's all for now. Thank you for listening. If you have any follow-up feedback, please reach out to us. We'd love to hear from you.