 I think we'll go ahead and get started. So this is a very interesting project that we're doing at Rutgers University. And I'm presenting on behalf of two of my colleagues, both of whom are in my department. I'm Grace Agnew, associate university librarian for digital library systems. And I'm reporting as part of a project team. And our typical project teams are to have usually a public services liaison who works with the relevant faculty on research data and interprets their needs to what will then be our metadata librarian. And then frequently for a project this size, we would have a programmer. I'm kind of a little out of this field. I spend a lot of my time on budgets and staffing and things like that. But this is an enormous project. We were offered a reasonable amount of money from the NSF to do it. And I've worked a lot with large data. I consult a lot with large data. So I was asked actually by our research team if I would step in and be the project leader for this, which has really been a whole lot of fun. So what is the Ocean Observatory's initiative? Well, it literally is a very ambitious project to basically document the two largest oceans in the world, the Atlantic and the Pacific Oceans. It's an American-led project that's funded by the National Science Institute. It's run by the Consortium for Ocean Leadership, which is an organization of oceanographers and marine scientists. And the organizations that are participating are the Woods Hole Oceanographic Institution, Oregon State University, University of Washington, and Rutgers University. Rutgers' role was to do the cyber infrastructure for this very complex project. So they spent, this is a $300 million and counting project. A lot of time and effort was spent on, these are pretty big bodies of water. What are we actually going to be studying? So they divided those bodies of water into four basic areas, or three basic areas, global, regional, and coastal. And then, of course, the cyber infrastructure collects the data. And there's a lot of data. And the data is versioned in a different life cycle. So there's the raw data. And that raw data itself comes into flavors. It is data that is continuously streamed from the sensors to the cyber infrastructure located at Rutgers. But it's also data that's pulled every 15 minutes off the instruments. And then compared to the stream data to see if there's any discrepancy. And thus far, there hasn't been. Then there's what we call L0 data, which is data that's been compiled to remove some of the white noise. A giant wave hits a sensor, what have you. And so some of that becomes just noise and unusable. And that's removed. L1 data, which has only occurred in tests so far, is data that's actually been edited by a researcher for various reasons. And then L2 data, which will come along as the project gets older, is data that's been analyzed, visualized, or changed and repurposed in some way. So lots of data that we're talking about here. So this is a very impressive project. What they've done is they've basically provided the oceans. And I should note, if anybody knows anything about NSF large data projects, yes, there are four institutions that are doing this. But you know that every institution that has an interest and a strong program and researchers in marine sciences and oceanography will have had a say in how this is set up and how this is organized. There are many meetings that were called to decide where are we going to set these platforms up and what are we going to do? If you even just think about how on Earth would I survey the two largest oceans in the world? Lots of time and lots of effort went into this. So what they did is they created the concept of platforms. And what these platforms are is they can be anything from a giant platform that you might sunbathe on in the middle of the ocean to a cable that just runs a considerable distance across the ocean to floating devices that are loosely cabled together. So a platform is a fairly durable concept in this project. And it basically is any stationary geographically located environment where we're going to collect data about the world's oceans. The subsites are very interesting because those platforms can be a mile to 50 miles big. They can be enormous. So the subsites are where these platforms were broken into interesting areas where they're actually going to put instruments. And one of the things there in their early stages, they went live sort of in October and then really live in January, and then they're going to go really, really live this summer. So a typical large research project. So they've done their splashdown, but they're not really, really, really, really live yet, and they will be soon. The subsites are where they've decided on the platforms to locate these instruments. And they may decide within the first year or so, hey, I think we'd get better data if we moved it 25 miles across to another quadrant. So that has major implications for us as we do what we hope is durable metadata for the data that they produce. So then there's 800 plus instruments. And these instruments, a lot of thought and effort went into, what do we really want to know about the world's oceans? What are the best instruments to capture this? And then just in the first year, they think they're going to produce as much as 10,000 data products, which if we were to do manual metadata, it would certainly be job security for somebody. So the arrays, platforms, and instruments, there are coastal scale nodes, global scale nodes. And then the regional scale nodes are actually virtual nodes that cut across the coastal and global scale, which is why there are seven nodes. So you can see that each array has a number of sub-platforms and then instruments on those sub-platforms. So Pioneer Array, which is what we've been working with, is the first array that went live. That was an array set up by the Woods Hole Ocean Graphic Institute. So this is what Pioneer Coastal Array looks like. So it actually is just a giant virtual field of interest, where I think it's many miles. It might be as many as 50 miles in circumference. You can see the different kinds of instruments. You can see up in that corner there that there's actually, and they're measuring at different levels. They're working with different levels. You can see up there in the corner that there's actually a little submersible that carries instruments that goes on a path and goes back and forth on that path to collect information. You can see that there's a submersible that goes up and down the water column that's looking at things like density and salinity changes in the water column. And you can see there are stationary instruments. It's a very complex undertaking to have an array and then to have these sub-locations or sub-platforms on the array. So each of those instruments, those 800-some instruments, produces different types of data. And these are versions of data, but these are lifecycle versions of data. These are different versions of data on the lifecycle. So there's the unprocessed data, which is the data that stream continuously to the infrastructure, but what is also downloaded every 15 minutes and compared against the streamed data. So far, there's been no discrepancy, so we're actually treating those as different manifestations of the same version, much as you would say a JPEG and a PDF are different manifestations of the same data. Level zero data is data that's been processed, and that is parsed data. And I should also note that the raw data, it's a little illusory. It is a little bit processed because it is actually exposed both as raw data and as common data format. The level zero data, which they actually haven't started producing yet, will actually be compiled using different compilers. And then the level one data, as I say, will be when humans feel a need to edit the data to make it more usable. And then level two will be derived data. So they decided to collect 108 different types of information about the ocean. Some of these are things like temperature and salinity. They're broken up based on the sampling regimes, the air to sea interface, what happens when the sea meets the air, how does that change in temperature, rain rates, sea surface connectivity, et cetera. The seafloor crust, where they're looking at hydrogen concentration. They also have HD video, which is based on motion detection, where they're trying to document any sea life that swims past the video. They're looking at sea floor pressure, et cetera. And then the water column, they're looking at connectivity, echo intensity, signal absorption, all things that mean a whole lot to me, but probably a little more to them. So what they wanted from us, and they wanted this from us because at the very last year of the project, the NSF said, some of our data portals are starting to give us GOIs. We think it would be a really good thing if the OOI gave us GOIs as well. So here you have these folks that are frantically trying to set up expeditions that are fraught with everything that an expedition into the deep ocean has to face, trying to calibrate and put all these instruments down, decide, am I happy? What sub-platform it's on, et cetera. It's supposed to go live in October. We'll sort of do it. And then we'll really go live in January. And then we'll really, really go live further on. They're doing, they're juggling all of this. And then they're told that they need to do DOIs as well. So they turned to us, and they said, is this something that you could do for us? And of course, we have been minting DOIs. We're data site members. We've been minting DOIs for quite some time. And we really understand what DOIs are really intended to do. And what they're intended to do is to fix data in an immutable form so that it can be cited in the scholarly records. So that if you find data cited in an article or cited in a presentation or just cited by itself, you can trust that the data that DOI is going to take you to is the exact same data that you cited. And this was a concept that was actually reasonably new to the marine scientists and the oceanographers that we were working with. So we're doing this within our U-Core repository, which is a very API-friendly repository. It's designed to be able to add, we started, we were one of the early adopters of Fedora. Fedora is a preservation platform, but lacked a lot of functionality. So we started immediately by building a suite of services that made it a lot more user friendly for us. So we are very supportive of APIs to other services. We have a building block approach that lets us reuse services and maintain flexibility. So when we started talking to the marine scientists and also to the cyber infrastructure specialists and also to the subcontractor Raytheon, who's actually building what is the metadata infrastructure for OOI, it was clear that we were all talking a different language, that we weren't actually communicating on the same page. And Cecilia just did this, you know, it'll mimicking amongst the shriek or the scream or whatever that thing is. But so we knew the first thing we wanted to do is we wanted to develop a data model. We wanted to have a shared representation of the data that we would all understand. And we wanted to be in agreement about what data they were collecting was and how it would be used and what would receive a DOI. I also am a big believer in data models because it is one of the few places where you actually model relationships among entities. And relationship is one of the critical things that you capture in metadata. And unfortunately, it tends to be the least captured information about data even though it's the most important. I really don't know how you identify relationships if you don't actually model your data. So first we wanted to do the research data model and I led that effort and then develop a metadata application profile which Johan Linn, our research metadata librarian, developed and then we wanted to do an API that would help us create metadata for this complex project and also automatically generate it since there is no way that we're creating tens of thousands of metadata records each year. So why do we do the data model? And then I think it bears repeating. We really want to understand the data. We want to be sure that the metadata is in service to the data and the data is not in service to the metadata. And those of us who grew up with traditional cataloging we actually grew up molding our data to fit into Mark AACR2 to fit into our metadata. So the data model abstracts you from that and helps you free yourself from thinking, oh my God, how do I do this in mods? And instead saying, what's important about this data? What matters about this data? It also starts a conversation with your data scientists where you can say, you know, what are you trying to accomplish with this data? What's the impact of this data? Why does it matter to you? So one thing that was very clear to us is that of this 300 million that was spent on this project, about 280 million at least was going to the platforms, the instruments and the arrays. And all of the metadata that was being captured by Raytheon was essentially a giant instrument inventory. And it's actually a really good inventory. They have events for anything that might happen to an instrument such as calibrating it or replacing it or repairing it. They document the obvious things, make, model, et cetera, but they also document how it's used, when it was placed, things like that. So it was really clear to us immediately that the instruments themselves were really what mattered to everyone. What also matters is geographic location. As I say, we're talking about the two largest oceans in the world and we're talking about seven arrays that are out there that were carefully selected to document literally the health of those two oceans. So location, location, location. Where the array is a permanent location, it's not going to change. That's been decided. But then the platform itself are sublocations within the array and while they themselves won't change, the instruments that they hold might. So these were two durable but really important concepts that we needed to capture in our metadata. So as you can see, we looked at their physical objects and then we looked at what is our metadata, what relationships does our metadata have to accommodate? So the platform itself is a sort of instrument of because it's a collection of instruments essentially. It may not even be a physical entity, it may just be a whole bunch of co-located instruments. So the platform itself is an instrument of the array which is a durable location essentially. And then the instruments themselves and the array has many platforms. So it's a one to many relationship there. The platform has physical instruments and sensors which are also in a one to many. And so the platform itself has instruments and those instruments matter. And what matters now? So now we've moved from location to what kinds of information am I collecting? So we've identified some really critical attributes here. Location is everything, we need to know where this data is being collected and we need to know what kinds of information are being collected. So then those physical instruments also produce and we started with the raw data. So they produce several two instances of raw data really but the raw data resource itself ends up being only one. So essentially what we learned from the scientists as I said, that we can count what's streamed and what's downloaded as just manifestations of the same resource. But the instrument, sensors may actually collect several types of information. The same instrument might collect temperature and salinity for example. So this ended up being a basic data model that we could all agree on to help us understand what we needed to do with data. Then of course we get into our next type of data which is the L0 which is the compiled data and what will matter there is we're now going to have to look at the parsers or the compilers, we're gonna have to version those and we're going to have to catalog those. So what's important to realize is each of these are resources to us. Everything gets a DOI, the array is a collection, it gets a DOI, the platform gets a DOI, the instruments get DOIs, the data gets a DOI and then the parsers get DOIs. Why do we do that? We do that because we are talking about authenticity and immutability. And since data is not immutable, how do we ensure immutability in a very dynamic world? Well every time you change it, we version it, we declare a victory, we say this one's done and we're moving on to the next version and that's how we can say that data's immutable. We're not just letting the data change if it changes in a significant way, we're declaring a victory and versioning to the next version of the data. So here's the data model in action in the RU core repository, we have the pioneer array, we have the high power surface mooring, which is our platform, we have, and then we have an instrument such as the fluorometer, triplet, and every one of those, as I say, have their own DOIs. So what happens in the data citation workflow is somebody discovers this through Google, through data site, through RU core, the DOI resolver takes you to the RU core metadata. Now this is probably going to shock you. They do have a major database for instruments. They had no plans to do metadata for the data. It did not actually occur to the marine scientists and the ocean scientists that they needed metadata for their data. Why do you suppose that is? Well, they live in a science bubble. They're in their bubble. They know that data. They know exactly how they're gonna use that data. There's no confusion about that data. So what they confused though is they confused discovery. They have a really cool georeferenced discovery application where you drill down to exactly the sensor. You can see it on this cool map of the ocean. You drill down to exactly the sensor that you want. You say, hey, this sensor is giving me both temperature and salinity, but I only want salinity. And I want it for November through December of 2015. So I'm good. I know exactly what I've got. So I'm not really needing metadata for this. I've got discovery. But what they really hadn't thought about is that there's two audiences that they're not really addressing. And as I said before, they're addressing very well their core audience. That core audience may not have the grant, but they've been invited to many meetings where they got to help plan and develop the OOI. But what two audiences are they not addressing? So this is interactive. Who are they not addressing? Somebody tell me. This is how I keep you awake too. Come on, you can do this. There's two audiences they missed out on. Who are they? Seriously? Okay, I'll give you one. All right, one is, this is a 25-year project, but we hope the data will persist forever. What about the scientist of 25 years from now who wasn't part of all this, who wasn't in that bubble? Do they know what they got? And who else today is not in that bubble that may not know exactly what they're getting? Absolutely, the interdisciplinary, everybody pays lip service to interdisciplinary, but frankly it takes all the energy and smarts that a research scientist has to serve his own community. So the interdisciplinary scientists who will love to get hold of this data, the atmospheric scientists, the urban planners for coastal cities, they want this data and they don't really understand where it came from, what it's provenance is. They don't know how to use it. What is one of the prime reasons that we have to have metadata for data? Even if you have the coolest discovery in the world, you have to have metadata for what reason? What did I say earlier about data? And you guys talked about this in your presentation, so I'm pointing directly at you. That's a, go ahead, go for it, go for it, yeah. It starts with a V. What does metadata let you do? It lets you version. It lets you say, hey, this metadata conforms, this object conforms to these specifications. I know exactly what I got, but did you change things a little bit? Because I kind of need to know if you change things. So you have to declare victory in metadata. You have to say this data is done now. Here's this data. You want the next data, it's related, it's attached, and it's slightly different because things have changed a little bit. And you kind of want to know that because you're actually deciding the next Marine Fisheries International Treaty, and you really need to know that they got a much better camera with higher definition and that one of the reasons they're capturing more fish is it has a more sensitive motion detector. This is something that's actually neglected a lot in science. There are actually a lot of variables that are these sort of hidden variables. Things have changed a little bit, but we're just going to keep on trucking. So what we didn't want these scientists to do was just do that, keep on trucking. We do want people to be able to do longitudinal studies, but we want them to know, you're stepping from version to version, like you're stepping on stepping stones is not an unbroken path. You need to know, yes, you can keep doing your longitudinal analysis, but things changed a little bit. We recalibrated this instrument. So here's the recalibration stats. And if you know your onions, you're going to know I need to compensate for this. I need to realize that they have a much more sensitive salinity measure now. So that's the sort of thing. One reason you have metadata is to be able to do versioning. So we fortunately do already have a metadata application profile for research data that we simply had to modify to support what we discovered in our data model were the critical things that matter. So we also, when you're generating metadata for large data sets that just keep on coming and frankly keep on versioning themselves, what you need to know is what's my fixed information? What's my fixed content that I can count on? And what changes the data so that I now have to say it's either a new version or it's related data? Once you know all of that, you can actually start pulling data from different places to automatically generate metadata, which is what our API does. So I just want to show you that this is there. They said they didn't have a data model, but we actually, Johan, our wonderful metadata library actually found this on their very complex website. They actually pretty much had the same data model that we had in mind. They had already designed it. They just didn't call it that. So some of the challenges we've run into is that this is a real world project and they're just really focused on getting it going. And there's a lot of discrepancies right now. There is an inventory database. It was based on assumptions of what would happen when these ships went out and positioned these instruments. These ships went out, they made decisions. Hey, we're gonna move it to this sub-site instead. They recorded it in their ship notes, but it hasn't made it to the database yet. So we're still in a position of trying to get really solid metadata. We have a really good data map that lets us map any metadata into RU Core, and Johan has developed a really good map. We just don't have really good values to feed into that map yet. So we're waiting right now to have a database that's declared authoritative. And it was a shocker to them to realize how unauthoritative their database was and they're working on it. We also needed to reach agreement with them about when data would be versioned. And we agreed that whenever an instrument was recalibrated, which the intent is to actually do that every six months. So every six months, we're gonna have to end a version and start a new one. That actually does mean that we have to start a new version and they actually said that, that the calibration can really mean that we're capturing different data. They also will be pulling instruments and repairing them. And if they actually replace an instrument, then what we have is not a new version, but we have related data. Because the three fixities in our data is that it came from this platform, this sub-location, and this instrument. If the instrument is out of commission a significant length of time, then we're gonna declare a new version as well. What we're also, and if they relocated on the same platform, it is not a new version, but if they relocated to a different platform, which as I say, could be 25 miles away from where it was originally located, then that's actually a related resource. That's no longer a version of the data because you can't use that for longitudinal data. It's not in the same location. So we had really productive and interesting discussions with the data scientists and they said something that was really nice. They said, why weren't you with us at the start of this project? And I said, well, memorize our address and we'll see you next time. So our next step, so we have actually got our API to work. We've actually, as I say, the first array that was set up was the Pioneer Array and we have actually taken metadata, some of which may not be entirely accurate and we have created dynamically generated data and we've dynamically assigned DOIs. They have made a location that we can automatically feed the DOI to so that when a researcher requests data, they can see a DOI. We're still trying to get our heads around the L0, L1 and L2 relationships, although we think we've done a pretty good job with that. And then we're also looking to, in the next phase, and this will be, for an additional sum, generating DOIs for the dynamic data that researchers actually request because they don't want that enormous run of raw data. What they want is October to December, 2015, et cetera. So this is how we did the metadata. This is our collection is at the array level. The date of deployment really matters. The location really matters. You can see there are many related resources as you can see in the left-hand column. So what are those related resources? Well, all of the platforms or sublocations are related resources and you can see that Pioneer has a lot of interesting things. It's got a bunch of moorings but it also has a glider so it's got different types of sublocations. And we chose to look at the, in this case, at the high power surface mooring, which is a sublocation. On that sublocation, we picked the, and this is the metadata for that sublocation. And you can see it's permanently and durably related to the array that it's on. And then we chose this instrument, the seabird. And for the seabird, what we not only have inherited information automatically from the sublocation and from the array, but we've also captured a lifecycle event, the initial calibration. The good news is that they capture events and we have event metadata, which is a mod's extension. So they capture events, we can feed it right into our events. What we need to do, however, is we need to be automatically notified every time that they do a recalibration so that we can automatically generate a new version. So that's our next step that we're working on right now. So when you go to the seabird, what you also get is you get the hyperlink. So this is what you get when you click on the DOI. You get the hyperlink to the raw data download. And this is what it currently looks like. So this is really the only metadata that they actually have. And it really is more part of their discovery than it is metadata that's captured and maintained any place. So this is really in a nutshell, and I actually did take only half an hour to do this. This is, in a nutshell, what we're doing with this ocean of data. So I was talking, so it can frequently be frustrating and scary and we're learning a lot. We do feel like we've been dumped in the deep end of the ocean without water wings. But we are learning a tremendous amount about what scientists care about with data. And we're also learning that even the best scientists have pretty messy processes. We like to think science is a very precise and exact thing, it's not. They're winging it, they're feeling it as they go. They're lucky sometimes if they have a successful expedition way out into the deep ocean to set these arrays up because you never know what's gonna come up a hurricane or whatever. So I was talking to our lead representative, Mike Vardaro, who's in our Marine and Coastal Services and is our science co-PI on the cyber infrastructure. And I said, are you able to enjoy the ocean? I mean, when you go to the ocean, are you just thinking salinity, temperature, density, et cetera? And he said, when I go with my two-year-old, I'm thinking about my two-year-old and his water wings and his bucket and how he's enjoying the ocean. He said, I'm not actually thinking about my research. But he said, when I'm doing my research, I'm actually thinking about my two-year-old. He said, I think about him all the time and I think about him as a grown adult taking his own two-year-old and I think about what I hope the ocean will be and how I might play a role in a healthier ocean. So I do the same. When I find this very frustrating, I think about my scuba diving days when I used to scuba dive on the Great Barrier Reef. I think about what a fragile, terrible condition it's in right now and I think about how lucky we are at the libraries to have a role in helping to create a decision-making tool for people to hopefully come together to design healthier oceans. So questions? See, how hard can data be if it doesn't take an hour to describe it, right?