 Welcome to iSchool. I know some of you are not iSchool related, but welcome anyway. And also I'm happy to learn that there are some librarians or library students in the audience. I'm also very happy to know that. My name is Zhen Chen, and I'm the professor at iSchool, and I'm also the program director for the Master of Library Information Science. So my research is in metadata, so this is my favorite topic, especially when it's related to analytics or data science. So I'm just going to dive into what I'm trying to talk about today. Metadata is, and then for those of you who do not have a library information science background, I need to give you a little bit explanation about what it is, what it looks like, why do we need it, and what it can do. And things like that, you know, just make all of us on the same page so we can move on to talk about analytics. And the analytics I'm going to talk about here is a little bit different from what Professor Jeff Hemsley talked about when there's not so much about, you know, rows of rows of data, but the data has, the data have different format. Okay, let's, okay. So metadata, what is it? And, you know, if you go back 20, I think more than 20 years ago, people will tell you, metadata is data about data. That's simple. Yes, it is true, but that's not the whole story. The full definition about metadata is the structure or semi structure description of information and or data objects in the form of catalogs indexes indexing databases and metadata repositories and metadata is everywhere. Let's see. So why do we need it? Well, whenever you go look for something on the web, you are using metadata. So first thing, you know, when you are in high school or in, you know, in college, or even at the job, on your job, you will constantly search for information. And this is a SEO libraries interface, very familiar, right. So, and all those data, like, you know, the each fields that you can feel in some data is those are metadata. And the other thing is metadata is also very common in our life, you know, entertainment, or you'll want to learn something you go to every time you go to YouTube. You are using metadata, you know, to find to select a category of, you know, particular kind of video to watch. Oops. Okay. Even you want to buy something, Amazon is full of all kinds of metadata. So, so because metadata is so is like everywhere, right. And, but that's for search for use for, you know, facing the public. It becomes indispensable when you need to look for something on the web, or in the library, or on the YouTube or anywhere. Because it involves information system. But on the other side, libraries, or businesses, or government, all these entities or organizations also need metadata to manage information information resources. The, you know, the diagram. The picture there. The picture there shows all kinds of information resources but actually they are much more than showing in this photo in this image. So metadata, when we use it to manage information resources, it answers questions like, you know, who created it, because we want to credit the others or the creators of the data or publication or, you know, an artwork. What is it, when it was created, and what is it about, you know, about some topic, or about a person or about a literary work. What is it, is it a video audio or print material or archive, or sculpture. And who owns the right, you know, when a book published, the author may not have copyright because it has transferred to the publisher. So, and how can it be accessed viewed, you know, did you need some technology or device in order to view the content of the information object. And how is it related to other relevant objects, you know, a monument may have architecture sketches from different times, you know, different design stage, and then at different times, when it's finished, and it has been, you know, renovated. And all those have been documented by different images. So when you're creating metadata to describe a monument, you're describing the design sketch, or sketches, or the images that, you know, capturing a certain time where when the monument, you know, is in. So there's much more. And metadata can describe as small as organisms, DNA sequences and as big as buildings architecture. So it range everything in between. So you can see how important metadata is. So metadata in has always, you know, being an important content foundation for libraries, archives in the museums, and, and any organization that deals with documents, you know, all the kinds of information resources. So in libraries, you know, you you see the public display of a record for this is for Obama spoke audacity, you see the, you know, it's very common. But at the back, the record is actually presenting in that way. And who knows what 100 represents right who knows what 245 means, but every of these number has some meaning coded behind. So this is kind of, this is proprietary data format. So move into the current the 21st century, and metadata has been transferred or transformed into linking the data. So what does link to data mean, I think that they are, I can give you a whole class in a course to explain, you know what link to data means what technology is involved in encoding languages involved and how to design how to model those. But I'll skip all that. So, when the metadata record the same, same metadata for the same book, when it's in linked to data format that the public view you can sense the difference from the previous slide, what you have seen from the as your library, Kevlar. And then you can see the, you know, subject terms are linked and then it has very clear categories. And in the system, and this is only one format of the record. And in linked data world, you don't say things like record, everything is a statement. The so called statement is like subject predicate object, which means the book has a name has a title. And the book has a creator, Obama, and the one not not his, it's about subject is about authors, actually somebody else. So, the book has topic, Obama, has a topic, the topic is about person, the person is Obama. And of all these, everything can be expressed in a triple. So, this is Jason format, and then the record can also make the data. When the record is in linked data format it can be expressed in XML, RDF XML format. So if you don't understand what RDF is. There's another course you can take in LIS, which is a 681 metadata course. You will understand what RDF resource resource description framework. So, so there's a whole, you know, set of foundational knowledge behind it. And then this format can have like seven or eight different Java JavaScript and they can be expressed in many other. formats. So, what does it mean by metadata of linked data. It means it is structured data. And it is entity based. So in the past, you know, when you describe certain information objects, you based on that object. And who created it and you just, you know, faithfully document, you know, every bits of information into the record. But when it comes to linked data, this conception has totally changed. So, the, the metadata now is focused on entities, persons, and events, publications, data sets, and, you know, artworks, and you name it. When, and then folks on also entities don't exist in isolation. They are related in different types of relations. So another important aspect in metadata is to identify relationship types. So, so there's a lot of interesting stuff in linked data that will, you know, I can go on, but today I'm just going to focus on give you the very basic. So as linked data, each entity like, you know, is going to be you globally uniquely identified. So this globally unique identifier has using the standard web standard, developed by some organization. Right now, the major one is the, the HTTP protocol kind of identifier. And then is the data, all these entity based, you know, record or statement is in a subject relation object triple style. And as far as what that means, you know, and triples can be linked to form a graph network, that is, so now we come to the very interesting part of the metadata. So linked data, briefly, you know, briefly describe, well, briefly say we can summarize this. It's a, you know, it's available on the web. And sometimes people also call web of data and semantic web. So you have heard these names. And linked data is the foundation for the semantic web and available as a structured data. I mentioned that earlier, and also readable by a machine. So linked data is primarily developed or established for consumption by computer systems by computers, available in the non proprietary format. So all the format you see like JSON, RDF, XML, and our OWL and all those different formats and triples. And all these are non proprietary. That means you have better interoperability metadata from one system can talk to metadata in another system, and expressed using open W3C standards. And that is a very important and linked to other data on the web. Whoops. Okay, so there are a lot of challenges in this process of transition from traditional metadata records to link to data statements or triples. And in this or to link to data semantic web. And the issues. I'm, this is not an exhaustive list of issues, but I only picked up some important ones. And I call it the perpetual ambiguous author means. So because you know you probably notice in a lot of indexing databases or catalog systems. They usually provide full name of authors, they have last name and abbreviate the first name and the middle initial. And what if people, you know different people have identical names, identical last name identical initials. Then you run into problem, you know, can we tell who's who, or can we tell this identical name actually represent 100 different people. So, these ambiguities can cause serious problems in information search, or if you use this data to do research, you can cause, you know, all kinds of problems because it will affect the reliability of your research conclusion, and so on so forth. So name disambiguation has been a subject of research in both computer science information science information retrieval, and people have used all kinds of algorithms to do to detect other names, and to disambiguate other names. So that is, you know, a great area in metadata for research. And metadata is never ready usable, readily usable for analysis. Well, if you want to use metadata to analyze, for example, in one of my research projects. We use the metadata from the Jim bank repository. Jim bank is the international or is a data repository established from 1982 and runs till now, and it stores DNA sequences from, you know, biological biomedical research projects. And especially, you know, it's very important, especially in the recent pandemic. And you will see, you know, those DNA sequences for the COVID virus has, you know, increased has a grow, grown very fast in the last two years. And also, you know, the reason that we can have vaccine vaccines developed quickly, you know, we can attribute this to the DNA sequence sequencing project. Anyway, so we this product, our Jim bank metadata analytics project collected the metadata about the gym DNA sequence sequences, you know, who submitted when it was submitted, and what is the taxonomy lineage for the DNA sequence and and what references associated with this DNA sequence submission. And so we are not we are no biomedical researchers, but we study the metadata to trying to see how scientific collaboration networks evolved from early 1980s until current days, present days. So, you know, of course, during this project, we spent a lot of time to passing the data, cleaning the data, transform the data, and generate the data sets that we need, for example, one of the data sets we need is edge list. So, so there's a lot of these. So the, it's not ready. You know, when you collect this, you'll have to go through a lot of cleaning in order to make the data analyzable and difficult to use the code is very difficult to reuse. So often end up read always read, you know, reinvent the wheel a lot of times, and limitation of current metadata structures is, you know, a lot of libraries, especially national libraries are converting their legacy mark a cataloging data into link the data format. And if you go to Library of Congress catalogs, you know, the website, and other national libraries website that you will find, many of them have already been converted to link to data. So, now we understand what's the landscape, you know, look, look like for metadata, and then we'll talk about the analytics, I cannot cover all of the analytics aspects, but I'll give you a couple of examples to demonstrate how metadata may be used to do for analysis for research, as well as for learning. So one of the main main concept in this process is, okay, okay, is to, you know, from display to analysis. So I'm going and making sense of the data. So let me go through this. So, in order to conduct metadata analytics, we want to have data infrastructure which we already have, and semantic infrastructure, we already also have. And then for conducting metadata analytics. That's another very broad area. So for the name disambiguation solutions, you know, you'll see orchid, research ID, the app, and all these are the solutions developed for deal with the name and embugation problem. So I'll skip this one. But I'm going, I think I want to spend two minutes to show this project. So if you want to, you know, test this one. This is a news press, or no, I think it's called press archive for the 20th, 20th century. And this is a huge archive. And in Germany. So basically, they digitized all the whole archive collection. And then they, you know, make the data available on wiki data. I'm having proud. Okay. So this this wiki data, the wiki data is, is we can call it a knowledge base and wiki data basically follow that subject predicate and then object triple structure. So if you have these, you know, if you know how to write the queries, and you can test it. And I would highly encourage you to follow this link to play with it, because once you do this. Once you have the query in there, and then you have the those query builder. And then if you click that wrong button, you will get the visualization of, you know, the metadata result for the, you know, for persons involved in this archival collection. I think this is a group of economists. The place where they were born. And also, on the left, if you click the top, you will see a list of different views you can view in bar chart pie chart bubble chart, and a different presentation of the data content. So they are also, you can also see the code. So this is the last slide. Entity management is an important, you know, transition, which is going on right now in libraries, archives and museums. And the other one is shipped from display to analysis and help researchers to make sense of data by doing all kinds of, you know, text mining network analysis modeling metadata and a lot of things. So I'll stop here. Questions. Yeah, so there's a question about objectivity when it comes to descriptive metadata on the commerce platform, for example, being labeled as a best seller and ensuring that that wasn't just, you know, used for promotional purposes. Oh, that's that's a good one. Yeah. Yeah, this is something I think us information professionals will always want to keep in mind that, you know, whatever data analysis you do you want to maintain neutral, not, you know, sometimes you may accidentally, you know, do some advertising involuntarily for some commercial entities. So that's, that's why we really want to be careful when we use those commercial examples. They said, from a business perspective, what other areas of application. Maybe in addition, you were talking about a medical DNA case, I think. Like many like several slides back. Yeah, and it was asked about what from a business perspective some other areas that the patient. You mean metadata analytics. This is kind of hard to say because it first it depends on what kind of metadata you have, and the volume of the data you have in the format. And in in business settings, metadata. I would imagine if you are in. I mean, in. Yeah, it's, it's difficult to tell. But I think if you infuse the idea about linked data, and that will make your data better structured, because, you know, you're no longer a folks on describing one particular type of product, but you you describing, you know, entity based, and then relations between different ones. And I think in Amazon case, it does have make that kind of connection, like if you buy this one you may also like to buy, you know, these kind of products that that can be, you know, really can can be modeled using the linked data style. I think in, in many cases, the data structure data model, as well as the nature of the data, really, you know, play important role for whether it's, you know, whether it's feasible to conduct metadata analytics, or what kinds of analytics you need to do. So I think in. Yeah, it's, it's case by case, it's hard to give a overall, you know, suggestion. And then just one more. What is the major source that they may find from the meta analytics. They said, are we using a data crawling tech, or are they purchasing it from a third party. That's a good question, because in it really depends on what kinds of metadata analytics you want to do, because there are many freely open access data repositories that you can scrape metadata from. So the National Center for Biotechnology information called NCBI is a center under National Library of Medicine. NCBI hosts, I don't know, at least several tens of databases, and these all these databases are open access, and they have FTP, you know, site you can go to get the data. There's also an access data repository called data.gov. And if you want to use data sources, you know, in those places you can go to and it's the broad categories include agriculture, you know, health care, and many other, you know, many others. So I think and many libraries, I mean, major libraries, for example, Library of Congress, open, you know, their whole links to data set can be downloaded, and the name authority, subject authority databases can all be downloaded for free. So there are a lot of resources, data, metadata resources that can be used for analytics projects.