 Go ahead and get started. Thanks for joining us today. I'm Cliff Lynch. I'm the director of the Coalition for Networked Information and I'll be introducing this session. You've reached one of the project briefing sessions for week three of the Coalition for Networked Informations virtual fall 2020 member meeting. Note that in addition to all of the synchronous sessions. We have also released a number of pre recorded videos this week. The theme this week is around technology infrastructure and standards. And I think you'll find some interesting material among those recordings as well. Speaking to this session, it is being recorded and it will be subsequently available. Closed captioning is available. Please make use of it if it's helpful. There is a chat box and there's also a Q&A tool at the bottom of your screen. Feel free to pose questions that are make comments at any point. Here our presentation is over. Diane Goldenberg Hart will moderate a Q&A session and we'll try and speak to all of the questions at that point. So with those few logistics, let me just turn to introduce this session briefly. It's my great pleasure to welcome a very old friend back to CNI. He's known to our community, Nasib Nassar of index data. And he's got a very interesting topic for us today. We've all heard endlessly about analytics and how wonderful they're going to be and we tend to speak very globally and casually about analytics. Nasib has actually tried to do serious analytics, knows that they are a big pain and the more data sources you fuse, the bigger the pain it is. It really is not just one of these casual things. Nasib is going to talk about a tool that hopefully will ease some of the pain here that is being piloted in the library community at a number of source sites. I'll leave it to him to fill us in more detail on that. Thank you so much for joining us and for speaking to this issue, which I know is very much on the minds of many of our members over to you, Nasib. Thank you so much Cliff and thanks also to everyone at CNI. I very much appreciate the opportunity to speak today. And this session was originally imagined as a panel discussion, but because of time constraints, it's going to be more of an overview. I'd like to begin by introducing the open source MetaDB software. And then I'll talk about how the software is being used in libraries. If you've heard some talks on high performance computing. Sometimes the speaker will show a picture of an early supercomputer. I was very fortunate growing up to have a summer job working in medical research labs. And one of the labs had a PDP eight and some other early computers. In practice, a lot of the research data management still looked something like this data were recorded on paper data visualization was also done on paper data analysis often involved a calculator. At home I was taught how to use a slide rule but they were already rapidly disappearing by that time. And this was usually the work of one person or a very small group. Of course today we're transitioning to a world where data are frequently accessed by many, many people from many far away sources and brought together for analysis. And we're really at the beginning of that transition. Many of the data integrate data management problems that scientists have been facing in recent years are problems that everyone is about to face as data science continues to transform almost every field. I think we've all heard the statistic that data scientists spend at least 90% of their time doing data cleaning and data integration. And at least in research, there's some recognition now that that we'd like data to be more interoperable and reusable. So you might expect people to be embracing things like schemas and databases or some similar or equivalent alternative and trying to go further in that direction. For example, offer very basic descriptive metadata databases help people structure and automate data management. So it's interesting that in fact the opposite has been happening. We've seen for many years now a move toward schema list databases and toward data lakes. If you're trying to make data interoperable and reusable. The last thing you want is a data lake where you basically park data in a shared file system. So why have things been going in that direction. So I think data lakes are really assigned that data management technology is not meeting people's needs. Part of the problem is that some kinds of data need more scalability or specialized database models, and it takes years to engineer a new database system and to make it reliable. But there are also some missing data management functions for things like versioning provenance data curation data integration. These functions are still very manual tasks and often they're not well integrated into into data management systems. I've had the opportunity to work with a number of science institutes and in many research domains. And I'm interested in general problems in science computing and data management. A few years ago I worked on a project run by the Coastal Studies Institute in North Carolina, studying wave power, which is a form of sustainable energy. And we had geospatial time series data captured by sensors in the ocean and we needed to transform them in a way that would make the data suitable for analysis. Roughly around the same time I was involved with an NIH funded project, which looked at whole exub sequencing for medical research and clinical use. And we developed software and a database to take patient level sequence data and annotations from genomic pipelines. And again, to transform them for data analysis. Now recently I've been working with libraries, and I was asked to make recommendations on building reporting and analytics capabilities for an open source library services platform. In this case, the data source is a transaction processing database, which provides the storage for library services. So this is something that people build over and over again. And that's a problem we can try to help with. The idea of MetaDB is that you have one or more data sources, and you want to store the data in one or more databases for analysis, often just one database. MetaDB are continuously streamed into MetaDB. And we have an opportunity to do all sorts of things to prepare the data for analysis. The most basic function of MetaDB is to synchronize an analytic database with a data source. As it does this, it can also transform data and support different database models. It transforms automatic versioning and time stamping to allow date range queries. And by default, it never throws data away. Also on the to-do list is adding support for named versions, annotations, including provenance, and optionally adding persistent identifiers. And I'll touch on data integration, which we're working on now. This is the final sketch of the stream processing. Kafka is used to stream change events from data sources into MetaDB. We ingest a sequence of these records, which are then parsed and translated into commands for the target database. In the library example, there's only one database on the right side. In many science applications, especially larger projects that use databases and even some small science like long tail science, which uses many of which do use databases, data are sometimes split so that relational databases are frequently used to store metadata and then some kind of numeric data or array data may be stored in a different kind of context. So in this pipeline, we apply transformation rules to the data. This is also a good place to add annotations. Finally, the commands are executed in the database. This involves resolving schema changes that have occurred in the source and working out the details of data versioning. So what you end up with in the analytic database is a copy of the source data, but the data have been enhanced and automatically versioned as they have streamed through MetaDB. This is really just something very briefly about data integration. Data integration is about reusing and combining data. And ideally, you'd like this to be completely automated. And it's sometimes frequently spoken of in that context. I think something that may get lost in the discussions about data reuse is just how difficult it is to automate data integration. As soon as you get outside of a well defined relationship between the consumers and producers of data. So when data are created and shared, let's suppose they've been curated in the sense of attaching descriptive metadata. What this really means is creating a model of the semantics of the data. So the curator is making decisions about what aspects of the data are salient or important. So then someone comes along to reuse the data and let's call that person a data consumer. We can postulate a similar model, which represents what the consumer thinks is important for the consumers project. In other words, what the consumer needs to know about the data. Now the main concern, and it is a big concern is what happens when there's been no data curation and there's no descriptive metadata and the producer model doesn't exist at all. But this is only an instance of a more general problem, which is what happens anytime that the semantics needed by the consumer are not available in the producer model. Because the producer has not foreseen the needs of new consumers. In any area of science or technology, there are always new questions that that you want to ask, and the consumer need is going to evolve over time in unpredictable ways. This is going to happen not only over time as ideas evolve, but also in a sense in space, as you as you get farther away from the community that produced the data. So the real goal of making data reusable is to reach people outside of your immediate community, where you hope the value of the data can be extended. So I don't think it's all, it's clear at all that that this problem is going to be solved by better descriptions of the data, or even in a deterministic way. This kind of thing makes data integration a very difficult problem. I think data integration deserves to be recognized as a grand challenge problem. But it's also a wide open space with with lots of room for ideas. In recent years, there have been many attempts to use machine learning to do data curation on the consumer side. There is something like this for MetaDB by trying to identify information that is latent in the data. So the idea is to fill in latent information that has not been included in the producer model, but is needed by the consumer. And of course you want this to happen to the extent possible in real time as the data are streaming in. So now let me shift gears and talk about the first application or test case of MetaDB, which is being in libraries. In this case, the data source on the left is a transaction processing database, which provides the data storage for user facing library services. It uses Postgres as the database system. Now for for transaction processing, especially in support of a user interface, you want data access to be very responsive. You want very short running queries that read or write a small number of records at a time. Now the library services here are in this system are actually microservices and so their data have been intentionally siloed, because that's what allows microservices to be scalable. You can't really do cross domain queries directly on the database. So of course bringing the data into a single analytic database makes it possible to do the cross domain queries. Postgres is a traditional relational database, which is more or less optimized for transaction processing. For analytic queries you do a lot of things like aggregate functions which range over all or most of the records. And so you really want the data there in a column store or at least something optimized for reading data. MetaDB can use Postgres also for as it's as it's analytic database, although Postgres is not really designed for that, but it's open source and ubiquitous. MetaDB also currently supports Redshift, which is a very popular column store database offered by AWS. And Redshift is also MPP or massively parallel processing. So you can handle very large amounts of data, for example if you want to do data science at the present or in the future. Again the transaction processing database needs to be very responsive. So it doesn't store more data than it needs. It stores the current state of the system and generally discards previous states. When a record is modified or deleted, the old data are removed from the database. In an analytic database, MetaDB keeps all of those historical data and automatically versions them to enable date range queries. And simply having an analytic database that is separate from the transaction processing database means that you can pull in data from other sources and bring them together for analysis. The thing that MetaDB can do in this context is to transform JSON data to a relational database model. Many software engineers prefer to store their data in hierarchical database models like JSON. Because for them it's just another data structure. But for non-engineers, the relational model has proved to be simpler to query accurately. And so MetaDB is well positioned to do those kinds of transformations and really any kind of automated data transformation. So MetaDB software is being deployed and used by many libraries, many university libraries of all sizes. And basically right now it's mostly it's libraries that are implementing the Folio LSP. And it's also this project is the MetaDB software is also being adopted by ReShare, which is a new resource sharing platform. For the libraries, MetaDB will serve as their reporting and analytics platform. This application of MetaDB is known by the libraries as the library data platform, which is a community based project that uses MetaDB and also includes some library specific analytics tools. There's a large and active user community now building reports on this system, and it's beginning to be deployed into production by some libraries. It has been very interesting for me to see what the library community has built on top of this technology. This is a prototype for a query builder app that the community have started building. Some of the early engineering work on this was done by Roman Ruiz Esparza at Duke University Libraries. This will allow librarians to do reporting on an analytic database created by MetaDB without having to install a database tool because this app runs directly within the user interface for the library services, library service platform right next to the ILS modules. Angela Zoss also at Duke University Libraries has created some very functional and practical visualizations using Tableau. Again, this was created on top of an analytic database generated by MetaDB. A large group of librarians that work in reporting and assessment have been building a suite of pretty sophisticated reporting queries to enable themselves to do the reporting work that they need to do. They have essentially built their own reporting system, and about two dozen of them have significantly ramped up their SQL and data management skills. They've actually written and debugged all of the queries. Both of those have learned how to use GitHub, including forking their own repository and creating pull requests. So they're now direct contributors to the query repository in GitHub. I gave them a two hour tutorial on GitHub and they just simply ran with it. You can see here a few of them are currently working in their own forks of the repository. Some of them have also been looking at data privacy and GDPR compliance. They maintain a list of data fields that contain personal data and MetaDB can be configured to anonymize those fields. So one of the most enjoyable aspects of this library application of MetaDB is how resourceful the librarians have been in contending with the new technology. I think their achievements have been very impressive, which brings me to my last point. As I mentioned earlier, because of data science, many of the problems of research data management are going to become everyone's data management problems. So I think there is probably an opportunity here for libraries to consider the data management of their own library data as an on-ramp for developing some of the skills and experience and intuition that are meaningful for research data management. Well, thank you everyone for listening. I'm very much interested in hearing your thoughts as well as exploring possible collaborations with you. I should add that this project has been supported generously by my employer index data. So my thanks to them as well. And again, thank you to CNI. Thank you, Naseeb. That's really an interesting project and great to see what people are doing with that tool. And thank you for coming to CNI to share that with our community. The floor is now open for questions and welcome to all of our attendees. Please share any questions that you might have in the Q&A tool and Naseeb will address those as they come in. And I think that if Clifford has a few questions for you, Cliff. Yeah, let me jump in real quickly. I don't want to monopolize the Q&A period but just while people are thinking. So your last slide, sort of connected to something that I was thinking throughout your presentation, which was that, while this is real handy for library analytics, it looks to me like it is a superb addition to the toolkit for supporting research data management broadly. Particularly as you get into applications that want to stream out data that needs to be amassed someplace. Are people actually deploying this in that setting? Not necessarily in your library pilots but more broadly out in the, say, the scientific community? I've been waiting until the software was far enough along to pursue that question with the people I know in the scientific community and that's sort of next on my agenda. And I'm really interested in, you know, potential collaborations in that direction to test these ideas in the scientific area. A few years ago I worked as an assistant director of Informatics at the National Evolutionary Synthesis Center, which was an NSF funded organization that really spent a lot of time on data integration and data management. And that's when I first came across a number of problems that seemed to me gaps between what scientists needed and the data management tools that were available. And ever since then, I've been thinking about this problem or sort of a set of problems and trying to figure out how to come at it from a software or technology point of view. And the really hard part has been trying to figure out a way to, and I think Cliff, you alluded to this in an earlier talk that I heard maybe yesterday, that, you know, scientists don't have time to drop the tools they're using wholesale and adopt a completely new tool. And so the really hard thing is, how do you do this in a way that can integrate with the tools that are already using? Of course, an obvious way would be to take a popular tool and modify it and extend it. But that actually comes with a lot of challenges and then it only works for that tool, but you could do that as a pilot project. And, but in this case, there seemed to be a common need across many projects, and it seemed to be an opportunity to fill in that space and also to bring in ideas for those missing data management functions. And I think that right now, one thing that makes it hard to talk about those parts of data management is that there's still a lot of work to be done to test ideas in real world systems. It's one thing to build a research prototype. It's also something else to have a, you know, maybe a group come together and think about the problem, but ultimately, you know, you need a system that's people that people are actually using to test your ideas. And that's reasonably robust. There was a workshop this fall at, sorry, not a workshop, a symposium that was put on by the National Academies Board on Research Data and Information dealing specifically with this sort of data in motion kind of model that you might want to go have a look at a few of the talks from. It feels very much of this is in that spirit. I see that we've got some other questions coming in. So I will leave it at that. But really interesting. Thank you. Thank you. And as Cliff said, we do have a question from John Coons who asks. He comments it's a very nice tool and could you provide an example of data integration with MetaDB. So I, I'm, I try to resist talking about vaporware, which is to me any, any work that is not completed yet. So we're really still at the beginning of that work but if, if. Thank you for the question, by the way, but if CNI will invite me back again next year, and then I'll talk about it then. Yes, of course, but we'd love to hear the follow on to this and thanks John for the question. And I guess I'll point out that anyone who would like to contact Naseeb about this project I believe there was a call for collaborations. I can do so using the SCED tool. So please feel free to do that. I don't see any more questions in the Q&A box at this time. And I see that we are right at time. So with that in mind, I will just thank Naseeb one more time for coming to CNI. Thank you for sharing this very interesting development with us and thank you to our attendees. I'm going to go ahead and stop the recording now but please feel free to hang in there with us if you'd like to join the conversation. Just raise your hand. I'll be happy to turn on your microphone. And with that I'll just say farewell everyone. We hope to see you back at CNI in the coming days and weeks. Bye bye.