 Welcome everyone to the ENDS webinar on identifying researchers. We have two special guests today. We have Jim Blake and Chris Westling from Cornell University. Jim is the software release manager for Vivo in Cornell. He has been involved in the implementation of Vivo in Cornell University and he has a great field of experience in interoperability between research profiles in Cornell and other universities. Our other speaker today is Chris Westling from Cornell. Chris is a data analyst for a Cornell Vivo project with a great experience in auto resambiguation, dealing with dirty data and aggregating data about researchers from different sources in the university. Despite the time zone difference, Jim and Chris kindly accepted our invitation to join us in this webinar from a Cornell etapa campus. If I'm not mistaken, it is 8 p.m. now, it is in etapa. So, a special thanks to Jim and Chris for accepting ENDS invitation. Before I start the discussion, just a quick housekeeping business. Chris and Jim are connected to the webinar through one account, so they are actually attending this to the one session, that is from the Chris Westling account. Some of them speak that's what is going to be visible and they are also not the only people who are from the United States. I am also connected to this webinar from New York, although I'm actually in a different place and we did a test yesterday. Unfortunately, the internet connection that I'm using is not that good. So, if there will be any unexpected drops in the connections, Alex will be going to pick up the session. Now, let's get to the main discussion about identifying researchers. In the last few years, a lot of Australian research institutions went through the process of implementing the research profile management systems and platforms. The common observation that we had was the problem and the challenge of linking the researchers to their publications, grants and research data. It is a common problem across many of these projects. It is not uncommon that you see the researchers have similar name. Also, researchers often move across different institutions. In some cases, researchers have multiple records in the individual institution. So, all of these inconsistencies and duplication in records cause a great challenge for creating seamless profile management system for researchers. That shows the outcome of the research for the individual researchers and also for the university in general. I initially met Jim in a Vivo conference in Melbourne and we had an interesting discussion about the problems that Cornell had when these guys were working on a project and then Jim was involved in implementing Vivo and Cornell. What was interesting to me was although Cornell had a different approach to creating a profile system for the researchers the similarity between the experiences that Cornell had and our experiences in Australia was very close. So, with this introduction, I am actually handing over this presentation to Jim. Thanks Jim for joining us in this presentation. Okay, thank you Amir. Let me see if I can get through the technical aspects of going into slide share mode. Yeah, I think we've got there. I'd like to say good morning to everyone or as we in the states like to say good evening. Amir mentioned how we met and he heard some talk that I gave about Vivo and about some of the issues that we had. When he suggested that it would be nice for me to talk to your group I had to confess to him that much of my information had come from an interview that I had had with Chris Westling to prepare for that very conference. So rather than have me try to tell you what Chris told me I decided that we could both share this. I want to give you a little bit of context as to where Vivo came from and what our situation is, which as Amir mentioned in some ways is very similar to yours and in other ways may be very different. So a little bit about who we are or where we came from and then we'll get into the question of dealing with issues and data and we'll save some time at the end for questions and answers. Vivo is a research networking application and it uses semantic web technologies to store its data. It uses the semantic web technologies from the inside out. So we keep all our data in what's called a triple store. We make all our data available to data requests using what is called linked open data. And again the raw data is just available for people to query or for other applications to query. And in between there we also have a front end on the data that is our web application that displays it for people. We include an editor and again in speaking to some of the folks in Australia I found that the editor was of little interest to some people since they ingest all the data and don't allow individuals to edit it. We have a mixture of both some data which we ingest and don't allow others to edit. And then other pieces of information such as a person's preferred title which we let them go in and change on their profile they choose to. So we have the data itself, we have the editor, we have the access point what I've heard called a data publisher so that other applications on campus or elsewhere can use the data and that's what uses that linked open data protocol that I mentioned earlier. One application that uses that is a multi-site search application where many vivo sites are all, their data is all collected into a single search index. A nice thing about that is that rather than your usual federated search where you might come back with six results from Cornell, three results from Harvard etc and then see the separate lists because we have them all, we've harvested that data from the various vivos we're able to rank the results across the entire set of data rather than each individual institution. I mentioned Harvard and that actually brings me to my next point and that is that we have this shared open platform. Open, of course, you know about open source software and vivo is open source but the other aspect is that we have this open data interchange format called the vivo ontology. I know I've been working on this project too long when someone starts talking about cancer doctors and I think they're talking about ontologists that is people who deal with the structure of data. So perhaps it's time for me to take a vacation, but right now I'll continue here to say that the vivo ontology is a way of structuring the data, a way of identifying relationships among the data so that instead of saying that a person is the instructor of a class or has an instructor role in a class we would say teaches a class. Actually that's probably incorrect since I'm not an ontology specialist but the point I'm trying to make is that we agree on what that relationship will be called. So then Harvard Profiles which is not a vivo software but is publishing their data using the vivo ontology will use that same terminology to express the same relationship and so it can be accessed by any application that can access the vivo data. We're also seeing other applications taking up the vivo ontology so they can be part of that shared data community. Oh gosh, vivo started a long time ago in a small village on top of a hill outside of Cayuga Lake in upstate New York. It's a little difficult for me to talk about the history since the fellow who originally created vivo is sitting here beside me prepared to frown at anything I might say that is incorrect including that apparently. It was devoted strictly to the College of Agriculture which became the College of Agriculture and Life Sciences here at Cornell and as it expanded out to other disciplines people who were using vivo, developing vivo, etc. publicized vivo, went to conferences, presented papers and that sort of thing. I spoke with Simon Porter of the University of Melbourne and he said we found that we had to do research networking and when we started to develop our own application we heard about vivo and realized these people have done two thirds of our work for us and so the Melbourne folks are now using vivo. If you go to their Find an Expert site, that's all vivo in under the bonnet. Bonnet or hood in Australia? I'm not sure. Never sure, never sure. In 2009 we received a grant from the National Institute of Health in the States here and ramped up the development team. We added features and functionality to the application but also one important thing we did is sort of did our best to turn it into a product. So the people who adopted vivo early on, University of Melbourne, Griffiths also University of Florida here and Indiana University they were all sort of leading edge people willing to hack their way through the wilderness and figure out how to use it. In order for a vivo degree to gain a wider acceptance we needed to package it more nicely. We needed to have instructions. We needed to have workshops. We needed to have a conference where people could come and talk about vivo. We needed to have implementation instructions and mailing lists and all those outreach aspects that build a community of people using vivo. We were very much aware and so was the National Institute of Health that they weren't going to fund us forever and so part of our goal became to create a self-sustaining vivo project. I suppose it remains to be seen whether we've done that successfully but our next step has been with the end of the NIH grant to affiliate ourselves with Duraspace which is a not-for-profit corporation that handles software products such as Fedora Commons, the repository software also D-Space repository software and a couple of other pieces and they thought that we were aligned very nicely with their market space and that they were wanting to start an incubator program at Duraspace and looking for that first project. Well, how nice that was because we were looking for someone to relieve us of the necessity of creating a not-for-profit corporation of figuring out all the legal aspects of that and Duraspace has been very helpful to us in that regard and also in sharing the expertise that they've developed over the last several years of working with an open source product. Just got a little map here just pointing out the various sites of the universities and colleges that were involved in the National Institute of Health grant and this is our sort of bragging map of how many vivo implementations there are worldwide. There are lies inherent in this map. One of them is that many of these implementations are not in production. People who are working to prepare vivo for use at their institution. So you see five locations in Australia and two in New Zealand. I'm not sure how many of those actually are up and running and of the three dozen or so in the States, I think we have about a dozen that have a public-facing persona. Others are very close to it. Others still are just experimenting. You see the different color. This is the second lie. The red flag, the yellow flag, the red flag represents a Harvard profile system. As I mentioned earlier, they're using the vivo data structures but not the vivo software itself. The yellow flag represents a low key implementation and they are in a similar situation. And I believe somewhere in the upper right corner of the States would be a tip of another red flag there to represent Harvard itself. It's a little cluttered over there. So vivo does three things. Well, it's used in three things. One is gathering data. One is providing an opportunity for us to review and edit the data. And finally to dispense the data to make it available through the web application itself and also to make these feeds available to other websites on campus and across the network. Just a little glance at what vivo Cornell looks like. A couple of pages here. We have people in the system. We have faculty members, and if I were to break that down into emeritus faculty and current faculty and adjunct faculty. We have non-faculty members, librarians, et cetera. We have organizations within Cornell, each individual college within the Cornell and some departments or what would you say, laboratories, research projects. And of course we have the research pages and pages and pages of journal articles, books, grants, posters that were delivered at conferences, papers that were delivered at conferences. This is the meat of the system. And Amir assures me that this is the meat of yours as well. I hope I got that right. Yes, you did. So some quick statistics. I'm reminded of a very excellent book that was published in the late 1950s entitled How to Lie with Statistics. And I'm about to do just that because vivo lies to me. And this is where I'm leading into Chris's part of the presentation. So although I can say, well, Cornell vivo holds 138,000 distinct information resources, I'm not sure that they actually should be distinct. We have 93,000 persons in our vivo system. Of those, about 11,000 are actual Cornell persons that we know about. And others may be co-authors on the papers that they have written. So they appear in vivo, co-investigators on the grants that Cornell faculty are investigators on. And so they appear in vivo as well. And once again, although we can say that 11,000, or I'm sorry, the 81,000 are unknown persons, they may in fact represent fewer than 81,000 people since we have trouble making sure that each person is represented only once. How am I doing here, Chris? That's just about right, Jim. I'll jump in and say that I'm back up just a second. The takeaway here is that one in eight people in Cornell vivo is not real. The idea that we have quite a few of these unknown persons or co-authors that have come in and have been generated with a unique ID that represent a co-author in the world somewhere and that may represent one of our Cornell researchers, but we're just not sure. This is the point at which our machine disambiguation kind of falls down for it. Briefly, I'm going to go ahead and jump in here. I'm a data analyst at vivo at Cornell and my day-to-day job is sort of the care and feeding of vivo. I work as part of the implementation team. We've got about five or six people that we work with. My boss, you might say she's a vivo evangelist. She takes vivo out to the faculty and the administration on campus and makes sure that everyone understands what's going on as best they can. We have a full-time programmer, and the duties of myself and another part-time editor kind of comprise the cleaning and the day-to-day maintenance of vivo. We employ a number of student workers to help us with the heavy lifting, and we do have part of an interface design specialist and a web designer, so there's quite a few people that make vivo go. This is how it looks. In this particular case, we're going to focus today a really basic example. This is our friend Anthony and Graffia. He's an engineer, specifically a civil engineer, and if we had the time to get in and know him better, we'd understand that he's interested in the way things break apart. He's interested in stress fractures and cracks in things. He's a material scientist. As we go down through this example, I'm going to show you the parts of which that becomes kind of important as we go through. This is his vivo profile page, we like to call it. It's got information about his positions, some basic detail about his research area, but down here along the bottom row, we have tabs that would take us out to his publications or the courses that he teach, give us more information about his service to the university and so on and so forth. One of the things that's kind of interesting about vivo is that we can click on this little link icon in Anthony's profile, and we can see this URI or Uniform Resource Indicator. This is Anthony's unique identifier in vivo. This is how we refer to him, and it looks like a web address when in fact it is. If we put this into a browser, it would resolve this same page and show us Anthony's smiling face here. But one of the interesting things we could do is click on this view profile in RDF format, and we get something that remarkably does not look like a web page. That's because this is RDF or Resource Description Framework. This is the stuff that makes vivo go behind the scenes, and we can take a look in here. This is not designed to be intimidating, but we can see different things like Anthony's middle name is R, and he has some research areas. He's also known as the Dwight C. Baum Professor of Engineering. So if we go ahead and skip forward, we can talk about RDF triples. I'm going to give you the most basic of introductions here. RDF triples are made up of subject, predicate, and object. In this case, we see Anthony's unique identifier, his URI, and we can look over here to see that this predicate gives him a label, and the label is just a plain English way to describe his name. We also have predicate for his last name, his first name, his middle initial, and a net ID, which we use as an internal identifier. The idea here is that the semantic web kind of gives you the data structure behind the scenes. The idea is that instead of looking at an HTML webpage with Anthony's information, we can also get back these triples. We can look at RDF and do different things with it. The neat thing about RDF is that you can describe people and things, and just about anything in vivo, with however much information you need. If Anthony didn't have a middle name, for instance, there are plenty of people who don't, we'd omit this particular triple and it wouldn't exist in Anthony's record. If indeed he had additional information that we wanted to put in, we'd have triples for those, and we could move on. Again, as Jim mentioned earlier, the ontology is really the way in which you describe your data. You can customize the ontology in vivo to sort of fit your data needs. Again, this is not meant to be intimidating, but this kind of shows you the relationship between things in vivo. Over here is a person, and it's related to an academic article. We might imagine that this is Anthony over here, and here's an article that he authored. We carry information about the article, specifically the pagination information, the title of the article. We also might continue to track information about the journal that article was published in. All of this is done through RDF and these relationships between things. In this particular case, it's all done through a central authorship that links both ways to the person and the article. Moving to data ingest, this is a diagram that basically shows all of the different inputs and outputs to vivo, vivo being this blue bar in the middle. Over here we see things like HR data coming from PeopleSoft and OSP, our Office of Sponsored Projects, is the entity that discords all of our grants information. And then faculty reporting system, which is where we get much of our data. We also push data out of vivo. We populate a number of websites across campus with the data that's in vivo, and we do that via the same RDF that we use to generate our web pages. The takeaway from this is that data ingest is complicated. We do have a lot of resources put forth into this effort, and it's been a long road, but we're doing pretty well as far as the ingest is concerned. Just a couple of quick bullet points. One of the early ingests that most new vivo implementers get introduced to is Harvester. It was originally developed by University of Florida as part of the National Institute of Health Grant. It does a really great job with PubMed and other flat file data formats, and it's normally the first method of ingest that many people use when they're getting used to vivo. Wild Cornell Medical College is affiliated with Cornell University in New York, and the folks at the National Agricultural Library at USDA collaborated together to make an ingest process that utilizes Google Refine, now called Open Refine. Here at Cornell we use several different custom processes and most of those have been the result of four or five years' worth of trial and error in programming. In our particular case, we use a custom process for internal faculty reporting. We leverage a product called D2R to prop up flat files and CSV sheets so that we can query them much in the same way that we do vivo. There's a lot of resources and energy that go into creating the weekly data ingest for vivo. We have weekly processes that almost all run automatically. Most notably, our HR process is in flux because of HR's recent change to a different system. We now have to go ahead and rebuild our HR ingest and stretch. Can I just ask whether HR is known? That's our term for human resources or personnel. Thanks, Jim. Our programmer, Joe McInerney, has worked long and hard on algorithms that will take the ingested data and match them to things in vivo. This process of disambiguation mostly relies on data from name parts, last name, first name, middle initial, and any other identifying material that we can find. We can also identify researchers' IDs if they're available. Whatever can't be matched through that ingest process gets minted in vivo with a brand new identifier. If I'm unable to match a particular person or author or co-author in the ingest process, I'll go ahead and create a new person in vivo and give them a brand new identifier. That's about, in a nutshell, to get up to 81,000 people in vivo. We're able to go then through and clean that up. We've got some custom tools that we do use to do that. The relationships between the data allow us to go in there pretty quickly and help clean that up. The takeaway here is that human intervention is always part of this equation. At the end of the day, even the best disambiguitors can really hit about 80, 85%, and there's still quite a lot left over for humans to get involved with. We'll jump back into Anthony's profile here, and I'll give you a real quick tour to a specific problem that came up. This is real-world stuff. Here's Anthony's RDF again. Again, we're describing his last name, his first name, his middle initial, and a net ID for him, which is our identifier internally. If we looked at Anthony's publications, we saw evidence of the fact that he's interested in that ID. Keywords like fracture and crack formation, fatigue show up a lot in his papers here. These are our academic articles that he's published, and as we bring more information in for Anthony, occasionally we have authors and co-authors that get generated that may indeed be a match for him, but the machine wasn't able to, the ingest process, that is to say, wasn't able to go ahead and fix that up for us. This is an excerpt from URI Tool, which is a custom process that we developed here at Cornell to help us sort these things out. More often than not, when we look at these cleanup strategies, the idea is we want to present this information in a useful way that a human person can understand and help clean up rather rapidly. In this case, what I did was I searched in Vivo for all persons, all person elements that matched by their last name. So in this case, what we're looking at is all 93,000 people in Vivo, of that we have these five that are together by virtue of the fact that they have the last name in graph, the same as Anthony. Up here at the very top, we see that uniform resource indicator that URI or a unique identifier for Anthony. So we know that this guy at the top here is the real Anthony and Drafian. We look at this one, we can see why the machine didn't match this. A last name and a first initial really isn't enough to match him definitively with Anthony R. This could be Alice or Andrew, but what we want to do is figure out whether indeed this article is published by Anthony. We can throw these next two out pretty quickly. First initial J and first name Janet are not going to be a match for Anthony. And in this particular case, it's just an offset of the way in which I've asked for this query. The machine was able to determine that these don't match Anthony, but in fact they may match, maybe a match for one another. And down here we have this in graph, we'll also take a look at that as we go forward. What's also interesting to note here is we have that unique identifier our internal identifier and that ID for both of these folks. So that helps us figure things out. We'll look first at this in graph A. We can look down at the academic articles that generated this unknown person. And we can see that material damage, welded beam column, moment connections, fracture, toughness. These two publications, if we were to look a little deeper, will indeed belong to our friend Anthony. So we'll go ahead and plan to merge those in. If we go to the next one, however, this is the in graph via first initial T that we saw a moment ago. And this conference paper, doesn't really mention anything in Anthony's particular speciality. So what we need to do is go just a little deeper. The way in which we normally do that is we follow the publication reference. In this particular case, we're going to dig deeper into the paper. We're going to look at the title up at the top. We can see all of the co-authors for this paper. And we know that Jerry Gay is a researcher that we know here at Cornell. We could tell right in this particular view by the fact that Jerry's first name is the only one of all the co-authors that exists. But we happen to know that Jerry input the information for this, and she also went ahead and typed in the information for her co-authors here. But what we really want to know is this our author? Can we find out more about this? And most of the time, the deal breaker for these sorts of things is a manual Google search. We also have unique library access here and can reference a number of databases. In this particular case, the first hit for this title is the ACM. We're going to also note that the next two hits are Vivo. And so another example of why Vivo is a really good search engine for the sorts of things that we're looking at. If we click here, we're going to be able to note that it'll take us to the ACM digital library, and we can see unequivocally that this paper was authored by Jerry Gay and Anthony and Graphia. What's happened here behind the scenes when Jerry and Anthony get together socially, she calls him Tony. And so when she went to put in this reference, she typed in his first initial T pursuant to the particular citation guidelines that she was using. So we can say unequivocally that this is the right author for the paper. And we'll pop back into our custom tool to merge these up. And all we're doing here is we're picking the first real Anthony and Graphia. This in Graphia A, which we've produced a good match, and the in Graphia T. We'll pick all of those together. We might also decide to put Tony's nickname in here. And when we merge those, what happens is we get the same RDF back, but we've also extended a little bit to use this also known as ontology that we developed. In this particular case, the first name might also be T or Tony. We also might use the initial A to describe Anthony at that point. This allows us to put additional information on top of the stuff we already have, the authoritative information, and it helps us to do some biguation time. The next time we run across a T in Graphia in our ingest, we're going to go and take a look at this. It doesn't mean that it's a match outright, but it allows us to go ahead and say maybe there's a probability that Anthony and Tony are the same person. The challenge is that the problem never really goes away. And in this case we've got another record in the system. And the problem here, of course, is the last name isn't in Graphia, it's in Graphia, period. And so dirty data sometimes gets in the way. We can easily merge this one because we know that Anthony's involved in methane and shale gas, hydro fracturing stuff, but the challenge here is that with dirty data we never really escaped the work that we need to do. And it also goes back and relates those. I'm just wondering have the slides been advancing appropriately, Alex or Amir? Yes, the slides are fine. Yes, we have very good voice on the slides. Okay, I'm sorry, a little technical glitch here made me wonder. Okay, so we'll pick up here and talk about how we deal with things when they don't go our way. These are sort of flippant policy choices that we came up with and the actions that we decided to use. The idea here is that in most of the cases where something is bad, we really want it to be fixed in the source data. We don't want to make edits in vivo and then find out on our next ingest that the data coming in will overlay the changes we've made. What we're really interested in doing is asking people nicely to go to those sources of record and change the data there so that when we do our ingest next time we'll be able to get the right information in the right place. The idea is there's also varying levels of incorrectness and some changes are more important than others and so we do often go ahead and make changes inside vivo but often we go ahead and flag those so that they're not automatically clobbered by an ingest process. So to kind of sum up, the idea is that we believe that vivo is uniquely well suited to deal with this issue of disambiguation. Without taking you too far down the technical rabbit hole of disambiguation and ingest what we found is that it's a lot faster and a lot easier to clean these sorts of things up and get good matches in vivo than it would be in a relational database. We also recognize that as we continue to clean up and manually match these things the information that you get from vivo may actually be cleaner than the source data and what that means is that we can help the folks that provide us with information clean their information at the source and make the world a better place for data. The idea too that the interconnectedness which is a tough word but the physical idea that we have satellite campus at Wild Cornell and a campus here at Ithaca we're working back and forth bidirectionally to go ahead and keep our data in sync and make sure that we have the right information about one another. That's also a challenge in respect that it's not easy to do. Some of the problems that we bump up against and I mentioned earlier we always are going to need a human eyeball on our data even the best of the disambiguation information we've come up with just really doesn't do it for us and we don't expect that to change. What we've done is done the best we can with our matching algorithms and then we recognize that there's a certain amount of human input that needs to happen. We have lots of dirty data it comes from all corners of the earth and we're constantly looking out for things that break our ingest process or things that throw a wrench in the work and one might argue that the original owner of the data the researcher themselves holds the key to all of this information if we could just go ahead and ask them whether or not that's their publication we find out a lot of information but that has its own set of challenges we recognize that moving forward as the user base sort of matures in the digital age we're finding different ways to reach out to our researcher community and involve them in the process one of those is researcher ideas like Orchid or the Scopus IDs or even PubMed ID when that becomes reality the idea that they will help us push people together but they don't really solve the problem for us all things being equal Vivo is really sort of a pleasure to work with and to help collate information in a novel and interesting way it's also allowing us to push this information back out into the world in the form of linked open data and one of the things that we strive to do is make sure that our information is as correct and complete as we can make it so that we're not putting disinformation back into the world and with that we have a few minutes left if we can answer any questions yes thanks Chris and thanks Jim that was a very interesting discussion the way that I know this works is that actually anyone can put the question into the question box if you do that I will unmute you so you can actually start the conversation and and the other option is that you can actually raise your hand while you're waiting for this Chris I personally have a question you mentioned about the whole process of basically this auto disambiguation to the tools that you have can you give us some information on how many working staff are involved in that part either from the library or from the research admin office the number of staff that are involved yes basically I'll try to actually figure out how much manual work is involved and how much you guys put effort to that part hmm it might be accurate to say that depending on the task what happens is our varying priorities for data cleanup and integrity sort of have us moving around quite a bit we tend to work with the individual colleges each of the colleges has a librarian after a fashion and they're an advocate for Vivo in that particular field that understands the intricacies of working with those researchers so there's a person that's at the researcher level and that helps direct that information in those endeavors to get the cleanup done there are two of us here at Cornell in our department that handle this sort of task full time and as I said there's a number of student workers that come and go to do the really heavy duty tasks of going in and translating someone's CV into fielded publications with all the trimmings and doing that same detective work to go out and make sure that we have the right title and the right authors so I would say you know offhand there's probably seven or eight people that at any time are working on this task at Cornell and at some level we're making this information more visible and so even though I said that it's sometimes difficult to involve the researcher information is being disseminated about our people and so it's out there in the world when an author reaches out to us via a help email or gives us a phone call and says hey the information that you have for me in my vbill profile is not correct we jump on that immediately and clean it up the idea is that we're committed to doing the best we can in representing our faculty academic staff the best way we know how to yeah I think you kind of touched my second question on this one the feedback system that you have from the researchers is there any mechanism inside Vivo or do you have any procedure in Cornell for the researchers to approach you and say look well the information that you have is not correct yeah the channels in which changes get made depend on the type of researcher that we have there are a number of people who are part of our internal faculty reporting system and as such we get every piece of information almost every piece of information from that system for them in Vivo so what we do is encourage them if they want to make a change we want to have them go back to that faculty reporting system and make the changes there so when we do the ingest that information comes forward for the other folks at Cornell we do allow them to do a certain amount of their own data entry we have a single sign on you can go ahead and edit your own Vivo profile it's not clear to us how many people take advantage of that but we do know that there are quite a few people that are managing their Vivo identity by virtue of the fact that it's relatively high profile search engines like us they like RDF and the semantic web and our information tends to rise to the top of a search engine search and we're finding more and more that the people are finding out where their data is in Vivo and asking for help if I can just touch on that also Chris had a slide there about what do we do when the data is wrong and sometimes we both change it in Vivo and we change it in the system of record as well and part of the reason for that again talking to the folks at Melbourne and you know they're re ingesting this from their systems of record every day we're not able to keep that kind of schedule because of the amount of manual correction that has to go into this data and so rather than just say you know go to the faculty reporting system and fix it there sometimes we say well you should do that but we will fix it in Vivo right now so that you don't have to wait if it's something that is important we don't have to wait for the next data cycle yep okay now that we have time I'll actually go through the questions from the audience Christian has a question I actually at the best way I will approach this is I will unmute the visual and then I'll let them to actually just ask the question so Christian I'll unmute you and you can actually talk hi Christian hello can you hear me yes yes I can okay just pasted two questions in there one is I noticed one of your slides mentioned hard one algorithms we've had a little bit of playing in that space as well and I'd be very interested to look at your algorithms if you would be able to make them public absolutely yep our programmer Joe McInerney is working right now on a public a white paper that's going to be included at the Vivo conference this year and it's a pretty extensive example of the process that he's created to bring in the information that we need from our faculty reporting system via a series of access LT transforms and some pretty interesting code so that's going to be available very soon the conference is in August and so we could probably go ahead and be sure to distribute that it also would be online shortly after the conference yeah I know not everyone can make the conference but I believe Amir are we expecting to see you there and yes yes that's right or hopefully well in the conference or in Cornell oh in the conference in August St. Louis I don't know about that alright well we will try to keep in communication with you and you can coordinate requests from people like Christiane who are interested in this paper would that work yes it would and that's the only thing I was going to suggest is a lot of these either the source code or ideas around this kind of work can be shared and I think is a very efficient way that we approach these problems in a collaborative manner so if you already have a solution for a problem the community would be able to take benefit from this and vice versa if you have something that we learn as part of our own experience we are more than happy to share it with Cornell as long as we have Christiane unmuted why don't we take his second question okay yes it wasn't quite clear maybe I didn't understand it but when you automatically assign an author ID onto a publication compared to getting a human to look at the options for what it could be and manually confirm which author ID something is. Is anything done automatically and when do you decide yeah this is definitely the person 100% automatically behind the scenes to assign that author ID to the publication yeah the best way to look at that Christiane is that we have that internal net ID that I mentioned briefly when that appears in the source data that's our 100% block on that particular person everything else is varying levels of conjecture even if we have complete name part match for in-graphia, ANTHONY R that's a pretty good indication especially within our closed data set that we have a match for the guy we're looking for. If the name turns out to be Brown comma John L or Liu comma H the name match doesn't really mean quite as much. Joe uses a sliding scale as it were to sort of match those up inside the algorithms that he uses. Basically most of what we're doing right now is based on name part matching we have been experimenting with co-authorship connections. The idea that we can maybe have a higher degree of probability of the match if we do see the same co-authors in certain relationships we're also looking at being able to do keyword matching based on the fact that there's keywords that exist in the article. Titles are in the abstract that may match up to the researchers keywords in Viva. Right now most of it is done with name part matching and the algorithms that Joe has set up. Do you think you'll get to a point where without that net ID and using keywords and the co-authorship that you'll be able to confidently assign an author ID or you always, even with those extra things, want a human to confirm them? Our accomplishments have not really lived up to our expectations so far so I think that we would be reluctant to say that we're going to reach that point. Yeah, that brings up this concept of misambiguation. That's the idea that if I attribute publications to the wrong person sometimes that's even worse than if I had not brought those publications in and attached them to the right person. We have in truth a case where a person with the last name Liu L-I-U first initial H is in Viva a polymath. He knows nuclear science. He's also a medical doctor. He knows all about engineering and maybe has a couple of papers in the arts and it's because he's a melange of several different people. So we really try hard to not lump those things together and make as many discrete decisions as we can and not let the algorithm run too far as it were and misattribute things. Thanks. Okay, thanks Christian. The next question is from Natasha Simon. Natasha, I'm near to you. Just a moment. Thank you. We've touched a little bit on what you've asked. It's actually my value for Griffith here. There's a bunch of us sitting here eagerly listening to everything. We've been playing around with generating feature vectors for doing disambiguation. Are there any approaches that you've been exploring other than string metrics that might be amenable to more machine learning oriented processes to disambiguation? There's been a lot of work within the community along those channels. Here at Cornell we've adopted this approach for right now. One of the things that Jim and I talked about when we decided to host the webinar was that by the time we got Griffiths on the line, you folks may have founded ahead of us in some respects. We're always eager to talk about the possibilities. In this particular case, our process here is still kind of tied to making data into vivo and making those positive matches. We have done some experimentation and we know that within the vivo community there are some interesting things going on but I personally wouldn't venture to say anything. John, do you want to comment? I think the only thing I would mention is that there are entity recognition algorithms out there. I think the question is we feel like we're still going to improve and really trying to improve and make the best use of the information we have so that we can make a better choice about where to go next. The other interesting thing we're working on a little bit is the concept of using a v-card which is another common ontology similar to FOV which is designed to capture information about persons, their name parts, their nicknames, their email addresses, their communications and other information about them all sort of as a cluster of data around an object that is not quite the same as a person. So we're thinking if we can create a v-card that knows, that can assemble everything we know about a person in connection with one publication and that have 20 of those would be set up with a nice data structure for doing this kind of feature vectors and eventually we'll have it in one place and we're not kind of overloading the use of the person object itself which tends to imply that we're more certain than we are about the information we get. And is that one of the needs that you're using to not have to continuously make these assertions every time you do an ingest? How do you persist the mappings over time? Maybe the best way to address that is to talk about the faculty reporting system ingest that we do. If I understand your question correctly what we're wondering is how not to have to do the same work over and over again? Yes. Okay, great. The idea is that that's where that also known as ontology extension comes in really handy. The idea is that let's say that we have a name in the source data that's clearly a typographical error. We do the manual match. We know for a fact that this resource should be applied to a specific person in vivo. What we might do is retain that typographical error as an alternative name and also known as such that next time we hit upon that same typo we can go and look for it in vivo and say, look, we've seen this typo before and it matches to this specific person over here. That's one way around the problem and allows us to speed up our process quite a bit. And there's some other things that we're looking at. Does that answer the question? I'm sorry. And you maintain that in the main knowledge base so that it's persisted over time? Yeah. One of the things that we do is divide our RDF up into specific name graphs and it allows this to go ahead and remove and replace information in a more structured way. The RDF is really good at that. In theory we can pile on more and more information without really bothering performance too much. If it's not something that the display model uses then it remains in vivo in the data store until such time as it becomes handy for disambiguation. Okay. Thanks Mark and thanks John and Chris. I think we have the time for one last question from Dominic. So Dominic, I'm going to unmute your microphone. You should be able to talk. Thank you. Sorry, this is a bit of a broader question because I'm just not so familiar with RDF. I was just wondering say as you're collecting more data and you discover new things about the data that you're collecting that you're taken into consideration for designing this. How difficult is modifying the ontology compared to say modifying relational structures? I'm just wondering how extensible it is. I think you really it's a lot easier and you really touched on the essential difference between using triple store and using a relational structure. Let me take that in two directions. One is if we have new data, we can either add to the ontology in a non-compatible way or in a compatible way we can extend the ontology. So if we have a class of individuals in the ontology that is faculty members and we decide that we want to have honored faculty members as a subclass of that, we can do that. These declaring honored faculty member to be a subclass of faculty member any rules that apply to faculty member including how they are displayed will carry through to individuals of that subclass. This is also nice because although another vivo installation or another application that uses vivo data will not recognize our extension, it won't know what an honored faculty member is it will be able to tell that this individual is also a faculty member and so it can treat it. We've compatibly extended the ontology. Now it's also possible for us to incompatibly extend the ontology and we find various sites with a good reason to want to do that. So they're adding completely an information that's completely orthogonal to the existing information and they just have to accept the fact that another application retrieving their data will ignore those pieces of information because it cannot infer anything about them. Finally, when it comes to the question of modifying the ontology compared to a relational structure let me just say that you can do it without stopping vivo. You can go right into the ontology editor and add properties, change properties, change the parent property of a child property so restructure that ontology to dramatically without even stopping the application and there again this is really the essential difference between that triple store which is so open to expansion or extension of the whole database where you would have to go in and redefine the nature of the tables. Have I answered your question? Yes, thank you very much. Okay. Okay, we are actually amazingly on time sharp end of the one hour discussion I think that's probably the time for wrapping up this. I want to personally thank Chris, Jim and John for giving us this opportunity of this discussion and also on behalf of ANS all the efforts that we put preparing the presentation also thanks to all of you that attended this webinar and particularly to the Christian Mark and Dominic for adding to the conversation with your questions. Any further discussion about vivo, I believe there is a vivo forum that people can actually get involved in that and discuss. Chris, I forgot to actually put that address into the into the slides that we had initially so if you want to share any forum or discussion panel you can send it to me and I can forward that to the ANS and mailing this. Yes, Amir, let me just say vivoweb.org so www.vivoweb.org that's our sort of home site and we'll tell you where to find our mailing list forums our wiki pages our source code information about the conference that's really the front door. Okay, wonderful. Thanks a lot, thanks everyone.