 Okay, it's 5.15, so I guess I will call the meeting to order. I didn't know if someone else was going to. But I'm Sandy Payette, and this is Mohamed Javid, and we are going to speak about a project at Cornell called Scholars at Cornell. And the subtitle here is Visualizing the Scholarly Record, and the Viz is kind of our little, kind of potentially maybe a sticker or something could eventually come out of it. So this is the origin of this project. It had its roots in the Cornell Vivo initiative, John Corson-Reichert, the original founder of Vivo. Cornell had the original instance of the Vivo, full installation at Cornell. And we over the years learned a lot at Cornell. I'm new to Cornell University Library again this year. I had been there previously, as you know, at the Infroscience program and department. But now I'm at Cornell University Library. So I'm the director of research IT for scholarship, and this project was kind of reconceived as how can we take Vivo at Cornell into the next generation of what this space means. And so basically, in a nutshell, what we're talking about here from what we're delivering here, data, people, publications, and organizations based on the Vivo ontology exposed as open link data. But we're focusing on this notion of the graph of knowledge. You know, Vivo has many things, and Vivo has been very successful, and is now under the door space umbrella. But we're trying to take a different twist on this at Cornell, and I'll explain why that is. So again, we're really focusing on what can the graph of knowledge of the scholarly record essentially, what can it tell us? What kinds of latent knowledge exists in that graph? So, you know, some of the common questions will be what are the hot research areas at Cornell University? What are the patterns of scholarly collaboration that the scholarly record itself can reveal for us? Who are experts in what areas? Who are co-authors in which areas? And the Vivo project has done a lot of this over the years, but we were turning our attention to visualizations where we really wanted to kind of explore how the semantic knowledge base could meet with the dynamic web. So that's really gonna be the focus of this talk. And the reason that we have reframed and refocused our Vivo work at Cornell is because since Vivo first came out, the landscape is really exploding in terms of people getting a piece of the game here. I mean, if you just look at, everyone's getting a piece of the action. And we have the whole Elsevier Pure product. We have symplectic elements, which we are actually using and partnering with. You have all these other players, ResearchGate, academia.edu. Many, many players are kind of entering this space of what you might call research, information management systems. In fact, Lorcan Dempsey noted this several years ago, observing this landscape and reflecting on where's the library's role in this emerging landscape. So that was the question that Javid and I and the team really were setting out to answer. Like, what is the special thing that we at Cornell University Library can do in this space? So again, we have all these players and we have to figure out how we position in this landscape. The other thing I recognize is this notion of research profiles, they're everywhere. I mean, there's like this proliferation of profiles. We started with vivo profiles. There are orchard records, not technically profiles, but they're profiles, even though they're more data-oriented. You have the Pure Isles, you have Pure Profiles, you have the Scopus Profiles, the Mendeley Profiles, the symplectic element profiles, the Open Scholar websites, the Research Gate profiles, the NIH profiles. And as a scholar, where do you begin? And I do wear two hats because I have my scholarly hat in the Department of Communication at Cornell and my librarian hat, but it's like, where do you begin? And so the question is, how do we start to observe this space that is kind of getting a bit unwieldy and yet we're talking about interoperability and linked data and knowledge emerging as this great knowledge network where everything flows and everything's interconnected. And we have a lot of great work going on there, but we're just almost still pretty siloed in some ways with these research information management systems. So again, how to make sense of all this? Is this just the messiness of an emerging knowledge infrastructure? Emerging infrastructure is, by definition, as history will tell us, can be very messy as different players enter the game and things settle down and there's redundancies and some winners and some losers. So yeah, I think it is an example of emerging scholarly knowledge infrastructure. Are these different players focusing more on an institutional perspective around faculty reporting and metrics? Or are they focused on an outward knowledge for the public good, knowledge for everyone, open knowledge, proprietary versus open data? There's tensions there, commercial versus open source versus hybrid technology. So in the technology dimension of this, there are questions. And this notion of, you know, there's a tension between isolated systems and interconnected networks. And so, you know, where do we position? Well, definitely we're in the hybrid network kind of paradigm. Again, thinking of it as scholarly infrastructure, but really trying to think of the piece that we're doing, which is on the far right there, the scholars at Cornell service, which we're kind of conceiving of is more of a data and visualization service, unlike the traditional Vivo perspective. But we said, okay, we're in an ecosystem here and there are upstream data sources that have citations to the scholarly record, Web of Science, CrossFit, PubMed. Then you have a player like Symplectic with their elements product where they basically, you know, have feeds for all of these different upstream sources. Then we also have institutional data in play here. So basically what we bring to the table here with scholars is what we call the scholar's feed machine, whereas basically through this system of systems we will eventually get an automated stream of the scholarly record, okay? But that's not enough because there's tremendous amount of inaccuracies in the data, missing data, incomplete data, even coming from the supposed pure sources upstream to layers up and in the intermediary layer with a product like Symplectic elements that aggregates it all together. But it's still not clean and it's incomplete. There are all kinds of anomalies. We spend a tremendous amount of time working on a piece of code and algorithms that cleanse this data and we try to push as much of the curation through an automated stage but there's always more to be done where manual curation is necessary. So the big point here is the data quality is key to this. And so as the library, we say, what's our special sauce? Data quality, okay? So this, we can do something special in terms of getting a really best of the best kind of record of the scholarly record at Cornell, that is, and we use a Vivo-based internal engine, basically. But then the other big piece that we contributed was how do you bring this data to life and answer real questions and do it in an intuitive way? And that's gonna be the focus of our demo. So how do we motivate this project? Again, we're denying the messiness. We're hanging in there and saying, no, we contain the messiness. We're taking a public perspective on the data unlike some of the pure or Elsevier, focusing more on the institution's needs for reporting and faculty evaluation. Where are the library? That's not our role. Our role is knowledge. So we're taking this public perspective of Cornell's output, which is knowledge. We're taking open data perspective. All of the data that we have will be made open through linked open data in other ways. We're taking a hybrid technology, which means approach, which means we're open to open source and commercial systems coming together to create the best architecture possible for what we want to achieve here. And we assume interconnected networks is the world we're playing in. So in summary, we're kind of starting with a Vivo-based backend taking advantage of the excellent Vivo model, the ontology. And we're building a fresh front-end experience on it with these D3 visualizations. And so basically it's like the best of both worlds is the proposition here where at the backend we have authority records, we can have inference, we have links to external knowledge sources, open access. And on the front-end, we're looking for as much as possible fresh looks and feels, data downloads, moving away from the kind of list views that you often see, there are directories of faculty members, directories of institutions to a more dynamic interactive view. So with no further ado, I'm gonna have Javid show you this because I think the most fun part about this is actually seeing the demo. So I would found an institution where any person can find instruction in any study. That's what Eza Connell said 150 years ago. And that's what Connell is right now. We can find a domain experts in many fields. And if this is so, that means the scholarly contributions of these domain experts also vary significantly. They range from general articles, books, conference papers to newsletters, scripts, reviews, performances, play, and so on. So scholars at Connell's Supreme goal is to record most of these scholarly contributions but for the start we focus on the easy ones, the publications. So I will give a small demo about from our main website. So let's go to the main website. So this is our main website, scholars at Connell. As Sandy mentioned, it's about research and scholarship across Connell. And if you look at current Vivo instances, you will see lists everywhere, persons list, organizations list, articles list, grants list, all kinds of lists. And we want a list of good to browse the data, to navigate to the data. But we want the scholars at Connell to be used to discover some data as well. Finding a domain expert by a specific subject area or research interest. Finding a specific article based on a specific subject area with high global impact. Discovering ongoing research grants on a specific topic based on the investigators, based on the funding agencies. We are also using realizations which not only help to navigate through the link data, but they also are useful for demonstrating the implicit knowledge, which you can't see in the list views. And all this data behind these realizations is downloadable. So we are working with specific pilot partners at Connell right now. One of them is College of Engineering. So I will just go, I will just use this list view to go to College of Engineering's page. And I'll select a specific department, biomedical engineering. So in the right hand panel, you see the list, faculty list. You see the list of grant. But in the left-hand panel, we call it this panel, you see some realizations. So these realizations actually are emerged by our continuous discussions with the pilot partners, what their needs are, what kind of information they want. And one of them was to demonstrate the research interest of our faculty members. So we have this person to subject area realization for that specific department if the network's okay. So in the middle, you see the list of faculty members for that department. Across it is the subject areas. So there are two specific questions we have asked, we have heard a number of times. What is the source of the subject areas? And how did you identify this person to subject area mapping? So the answer to the first question is that these are Web of Science subject areas for now. And to find the mapping faculty to subject area mapping. There are three different ways which I know you can do so. First of all is that research faculty member by himself and say, okay, these are my research interests. Second option two is that somebody curate this information for faculty members. I know some people at Cornell, they're doing it. They go to look at the websites of the department. They look at the profile pages. They look into the resumes of faculty members to find the research interest of a faculty member. Option three is that let the data speak. Let the scholarly contribution data speak. And that's what we are doing over here. So Web of Science categorized the journals based on this subject area category. And all the articles which are published in that journal get the same classification. So what we say is that if the articles which are published in that journal get the same classification, can't we infer that the author of that article also have some research interest in the same area? And of course these subject areas are quite broad. Surgery, physics, physiology, pathology and so on. So we can infer this information from the data itself rather than focusing on manual or creative process. So if I hover over the faculty members, you can see in which area they are publishing. You click on that, you see the specific faculty's view. You can also see, okay, who else is working on the same subject area. You click on subject area, you see the subject area view and so on. The same visualization can also be seen as a subject area to person mapping, okay. You can see who is publishing in surgery. Maybe they are the potential collaborators. Maybe already worked together in a collaboration with each other. So if I select here oncology, I see okay, there are five faculty members who have published in that area. And if you want to know a bit more about a faculty member, you can go further ahead. You have to take a one step deeper here. So there's a link off the faculty page. I click there. I go to the faculty members page, okay. So Mr. King is a professor in biomedical engineering department. So in the, again, right hand side, we have the list of applications. But in the list panel, we have this keyword cloud. So normally in the articles, there's a specific section called keywords. So we have aggregated those keywords together and present them in a form of keyword cloud. So it's not only a fancy visualization, but it helps in two ways. First of all, it can be seen as a fingerprints of a faculty member. You can see, okay, he's working on cancer, circulating tumor cells, breast cancer, and so on. It can also be used to filtering the articles of a specific faculty member. For example, I can see what articles he has published on cancer. Click on that. You can see the list of the articles which he has published on that specific topic. Okay, cell deformation. And so on. So, okay, now if I select any one of them, I can go to the articles page. So just moving from one page to another page and it's navigating through the link data. Okay, so on this articles page, I would like to show three main things. Okay, we have this authors list and the journal in which it is published, citation count, volume, published year, and so on. But we also, okay, so first of all here is our traditional metrics, citation count. But if we also have this here, we have automatic donors to see web-based impact. And we have noticed here is that as soon as the article is published, there's no citations which comes up very quickly. But this automatic donors can show the web-based impact of that article straight away. I've seen some articles which are picked by 13 news outlets, but there's zero citation count. So this citation count is not only metric, you can apply on a journal article and so on. We are part of library, so wherever possible, we also linked the journalists to the library catalog page for the same journal so a user can go and see what volumes and issues are available and he can learn specific issue. And wherever we have the DOI, we have this link here to the article page, the full text of the article so somebody can go and get the full contact of the article. Okay, so I'll go back now. So I showed you a department page, a person's page and article space. So let's go back to department level. No, maybe I'll just go to college level. Okay, so another need which we had, we have listened a number of times from our partners about collaborations. Who is collaborating with whom? How often they collaborate? So we have these two civilizations here, interdepartmental collaborations which shows how the departments within one same college are collaborating and cross unit collaboration. So we are at the college of engineering page so engineering goes in the middle and all the colleges around it we can see here. So if I click on art and sciences, now we're just drilling down within the same specific college. So engineering to art and sciences and now we have the faculty departments of art and sciences. You can see okay, who is publishing in that specific department. And then you can see the list of the faculty members who are publishing with collaboration with somebody else out of the college of engineering. Click on that, you can see the output of that collaboration as well. So we can just move once colleges to another college and you can see okay, who is collaborating with whom and how often they collaborate and what is the output of that collaboration. One more thing again, the data is downloadable. So all the data behind this realization is downloadable. It's in JSON format, you just take it and off you go. Okay, so let me go back to the home page. Okay, so we use this realization again, same thing like navigating to the link data and also showing different views, some specialized views at department level, college level, personal level. But we also saw a need to show our Cornell wide view. So we also have, we also have realizations at the home page level. One of them is the research grant realization that shows all the grants we have at Cornell. So it's a big data. And if you hover over, you see the title of the grant, click on that, you see all the title, investigators, funding agency and so on. But main thing over here is that you can filter it. You can just click here, say I need to see the grant from basic Mary. Okay, and here you go. You see the grants, somebody is working on. Similarly, about the funding agencies. Okay, I want to see who is publishing, who is working in the cancer domain. Okay, here are the grants from National Cancer Institute, NIH. And also we can also, of course, we can also filter them based on the activities and so on. So you can see the grants from the weekend. Okay, so now let's go back to the slides. So I would like to highlight two specific things from this demo. One was the VisVivo, the front end. Other was the utilization, the backend thing, which you haven't seen because I'll show the UI thing when the main thing is there. Okay, the first about is the VisVivo. So if you look at the current Vivo implementations, it's based on the Vivo ontology. And this ontology, this link data, these are all the models to, they are all the technologies to model the data. They are backend things, but in current Vivo instances, the front end and backend is coupled very significantly. And in Scholeset Cornell, we are separating them about till a certain level, okay? And that's how we can use the dynamic web technologies like Bootstrap, D3, Node.js, and so on. But again, one more thing is that we're turning data into a knowledge, okay? So if you look at that, this citation data, you can understand, okay, one person who is affiliated with a specific department organization, he has co-authored another article with somebody else. And article is published in specific journal. But when you aggregate those citation data together, you can show some real good knowledge there. Who is collaborating with whom? How often they collaborate? The fingerprints of a faculty member, the domain expertise, the research interests, and categorizing the research grads based on a specific subject area, and so on. So that's what we're doing here. And second thing is the utilization channel. So as Sandy mentioned, we are using elements as a backend thing to record this all the citation data. And elements goes on the web, harvest the data from different upstream sources, and gets the data into elements. And in elements, you define, okay, one of them is my preferred source. For example, in this example, PubMed is my preferred source. So the next step is how you get the data out of elements and convert it into a linked open data. If you use the provided harvest API from elements, you get only one viewpoint. They give you the citation data entry for your preferred source, and that's it. That's not enough for us. So what we do is that we make this data and pass it through our utilization channel, okay? So all this data coming from web of science, PubMed, also whatever source that we have for a specific article, we merge them together into a specific one single record called Uber record. If that Uber record is clean and complete, it can go straight away into scholars at Cornell, converting to an RDF. If it's not clean, then we have to make it clean, make it consistent, so we can use the creation bin. Now this creation bin can be used to clean the data, as well as it can also be used to add additional information on that specific thing. For example, we can add the arcade IDs for the authors. We can add the great IDs for the organizations and so on. So in the end, we are giving, in scholars at Cornell, we are giving the best of the best scholarly record. And that's what the way the value is actually. And we have, in the future, we plan is that if we push this best of the best scholarly record back to the simplistic elements. So now I'll just move back to Sandy. So Sandy will wrap up. So when we set out to do this, it's really in February when we really started this. And we had a number of really key questions. The biggest challenge we were facing at Cornell was that it is a highly decentralized university, and particularly around things like research management. Basically, every department has the right to choose what they want to do to manage their research information. There's a bunch of departments using activity insight. There are a few of them that were actually using Vivo Raw. Using that as the data entry as if it were a full research management information system. Pure has been on campus, has not gotten traction yet, but symplectic elements did. And so we actually are working with the engineering, College of Engineering, the Johnson School of Management, which there's a prospect to have the whole college of business involved, and that. So we have a set of pilot partners. So we really had to approach this as a pilot with those that wanted to go this path with us to see how much could we achieve through this automated feed of the scholarly record. A lot of manual entry was going on in a lot of departments, and it was really kind of a, to tame this was kind of an overwhelming university-wide initiative. The provost's office does not mandate a certain system of record for a scholarly output. The Office of Research doesn't either. So I mean, the question was for us, can the library nudge a university-wide bottom-up coordinated process around managing faculty data, managing faculty citation data, grant data, faculty profiles? And so that's a question, it's still a question. So we're trying to kind of create a system that is compelling enough that when we unveil these pilots over the next six months that more people will be like, ready to come on board. The provost's office is now quite intrigued. A lot of people are getting the word about this demo and what could be possible here. So this is an attempt to have the library do something that the library is not empowered to do to kind of try to get a coordinated response around managing research information. But then you'd have to ask, well, what is the role of the library? What is the role of the academic units? What does the provost's office and the office of research think about this? And these are open questions. Probably I would say this is the classic socio-technical system as a project where the socio part of this in terms of the people, the organizations, the culture, the politics, that is the big challenge here, okay? Less so the technology, because we've been able to do some really cool stuff in a quick amount of time. So data quality, this is the issue here. We took the stance of how much automation can, how far can we push it? How much quality can we infuse into this data through algorithmic means? And a fair amount. I mean, we really have done a lot of remediation of this data, missing ISSNs, journal names that are, messed up in all kinds of authorship problems. I mean, all kinds of duplicates. So we've done a lot, there's still some records that for the College of Engineering, for instance, I mean, we're in the hundreds of records that we have to have manual curation on, which isn't good because it used to be in the thousands, you know, the high thousands. So we've been able to make real progress on this automated curation thing with the Uberization process, the Uber records. Again, this notion of taking the fragments of the various upstream sources and pulling the one that has the best title, the one that has the best journal metadata, the one that has the most accurate author list and kind of mashing it all together and then doing some additional remediation by calling out to authorities in the process. So we've made some really good progress there. Another one of the challenges here in OpenQuestions is, you know, who is the user? I mean, we have many user stories here. Currently, we're engaging at the department chair, university administration level. There seems to be this real need where the questions they'll ask, you know, how many people have collaborated across, you know, the sciences and the social sciences around, you know, social entrepreneurship or, you know, things, questions like this. And they have to go kind of on a, you know, a hunt, you know, and gather together all this data manually. So they are really interested in the kinds of things that we can pull out of this knowledge network, this semantic network and the visualizations making some of their job really easy, where it was formerly very manual. The university communications department seemed to love this. I don't, you know, it seems to have this kind of outward facing appeal, you know, of look at what Cornell is doing and you can interactively explore it. So interestingly enough, the departments of communication and outreach seemed to like this. Not surprisingly, the faculty view this stuff is kind of a burden in a way, you know, these research management information systems and they just as well would have administrators put the data in. However, faculty would love to have their websites fed by a clean data source. So when Javad said, every visualization we have, you can get the data through in, you know, all the standard formats, JSON, RDF, XML, and even CSV, whatever. And we're gonna open up the number of formats with the idea that the, you know, faculty websites could use this source as, you know, a trusted source. And, you know, again, this external view of prospective students and faculty that has always been something that Vivo has appealed, had an appeal for, people coming to who's in what departments at the university. But interestingly, we're seeing the most traction right now at this kind of higher administration level and communication for, you know, public views of what's happening at Cornell, you know, what's hot at Cornell. And then the other main question is, what is the investment to sustain this? I mean, one of the big questions we have, and we're finally in a position, we just created a human curation tool. What does it take for a human to curate the bin? After all of the automated stuff is done and there's still stuff that was kicked out as not appropriate quality, who's gonna curate the bin? Some people have indicated, the symplectic element people have indicated that a lot of libraries are actually stepping up into that role of the data curator, but is that the department's role? Is it the library's role? Who's is it? How much effort is it? So really we're trying, you know, this month we're gonna do an experiment in terms of how much does it take to figure out what's wrong with a record or why it's a duplicate and to fix it and, you know, let it go into scholars. So, you know, these are the kinds of questions that we're working on. This is a phase one. We just wrapped up phase one, which ran from February to the end of the year and we're gonna do phase two with the idea that we will launch this in July. So, to wrap it up, this could not have been done without the work of an entire team, including our predecessors, which would be John Corson-Reichert and Kathy Chang and several other people, but to me it's a great example of how many pieces of the puzzle need to be done really well at the back end, the front end, the human process, the coordinating with the vendor, of coordinating with the departments. So it really took a whole team to pull this off and so we have a Twitter thing, but we also have kind of a virtual brochure. If you go to about.scholars.cornell.edu, there's kind of a video and a few other things and you can actually, for a limited period of time, we're opening up the demo instance and here's the credentials so you can play around with it and since we're still evolving it, we're gonna leave this open for only a limited period of time but then people of course can contact us always. So, I'll end there and we can take any questions that you have.