 So, I think we'll go ahead and get started. So welcome everybody. This is a section that's looking at research data management implications of the Carnegie Mellon Cloud Lab project. I'm Cliff Lynch and I'll be sort of moderating a conversation with Dean of Libraries Keith Webster from the Carnegie Mellon University. Just as a tiny bit of background, this is a follow-on to the closing plenary from the December 2021 in-person meeting. I hope many of you have seen that plenary or the video recording that we subsequently made available. That plenary focused on sort of the high-level strategic implications of the Cloud Lab effort for the research enterprise at Carnegie Mellon and beyond as a whole. There are a lot of really interesting questions about what this means sort of on the ground for research data management and the role of libraries and scholarly communication that we just didn't have time to get into on in the December session and I certainly heard a number of comments and the evaluations expressing interest in delving into this more. Keith has very graciously agreed to come back and talk some of these questions over and he's going to start with a very quick kind of recap and refresher on the project as a whole and then we'll dive into questions and discussion and take some additional questions from the audience. Over to you Keith. Thanks Cliff and good afternoon everyone, it's great to be back at another in-person session. It feels like normality is on its way back. So I'm going to do a very quick summary of what we presented in December. For those of you who haven't had the chance to look at what we did, I was part of a three-person presentation team along with the Dean of our College of Science and one of the co-founders of the Emerald Cloud Lab on whose work we are building. So let me just give you a quick overview of what on Earth the Cloud Lab is and then we can dig into what does this mean for libraries as we see these sorts of operations come to fruition. I think there are two big issues we need to deal with. The first is how do we advance open science, open data in the context of a commercially sensitive operation as Emerald is and secondly I've lost my thread already. How do we cope with the data deluge? If you've got a lab that is working 24, 7, 365 days of the year, what does this mean in terms of working through data sets and figuring out what you keep, what you don't keep? I stole this slide from Rebecca Dorridge, our Dean of Science, the recognition that we are entering an unprecedented era of science, that we are seeing vast amounts of data as I've already mentioned. The application, let me try a different mic, see if that's better. Okay, great. So we're seeing vast amounts of data as I've already mentioned. We are seeing AI technologies being brought to bear on research and on data analysis and we are seeing an increasing amount of research that is crossing disciplinary boundaries. A lot of interdisciplinary research, a lot of transdisciplinary research as we see the formation of entirely new disciplines. My wife who's sitting in the audience made the point to me last week that if we think about the structure of library support for research, to a large extent we are still operating in the 20th century mindset of the individual librarian and the individual researcher. But so much of the research we are trying to support is being conducted by teams of hundreds if not thousands of researchers. We've all been aware of these thousand authors plus papers. The last data point I saw was more than a thousand author papers being generated per annum, typically with research teams from more than 60 different countries. How do we think about library support for research on that scale? And how do we change the ways in which we think about how we organize our work? And that's a large part of the narrative behind what we're thinking about. I've got a video coming up so I'm not going to belabor the overview here. But the broad point about the Cloud Lab is that it's a remote controlled facility largely powered by robots and remote sensors with some technician support. We are building our Cloud Lab at CMU on the model developed at the Emerald Cloud Lab in San Francisco. The ECL was founded by a couple of CMU alums. Around about 15 years ago it is designed to support primarily research being conducted by Silicon Valley startups, biotech companies, and was designed therefore with a high degree of commercial confidentiality, a lot of data security around access to the materials contained in the Cloud Lab. The broad premise is that the researcher, rather than conducting hands-on research in the laboratory themselves, designed their experiments in an operating system. Those designs are communicated to the Cloud Lab, the research is conducted, the resulting data sets are returned to the researcher for analysis. This Cloud Lab is structured very much on a racked system where the lab is designed with racks that allow equipment to be added or withdrawn fairly quickly and workflows are designed to optimise the use of machinery so that we're not holding one job up whilst another one is making use of the same bit of kit and workflows are designed with optimisation in mind. The intent is to allow the experiments to be run as quickly as possible. Rather than me trying to explain to you how it works, let's take a couple of minutes to watch the Cloud Lab video. Apologies to those of you who saw this in December. As chemists and biologists, we've always been firmly bound to the laboratory. For us, the scope and limitations of scientific exploration have been defined by the instruments we have, and the scale of our work has been metered by the long hours required at the bench. It's time to change that. Emerald Cloud Lab is a remote-controlled life science laboratory that allows scientists to execute their experiments without being anchored to a physical lab. In a Cloud Lab, experiments are driven by issuing commands over the Internet, which are then run in a vast, highly automated central facility. With an ECL account, you have full control over every aspect of how your experiments are conducted. Control the transfers of volumes from less than a microliter to 20 liters. Control the transfers of solids, with masses from micrograms to kilograms. There are over 200 different models of best-in-class instrumentation online at the ECL. ECL facilities run your experiments on demand 24 hours a day, 7 days a week, 365 days a year, leaving just hours between the moment you conceive your experiment and the moment you receive your results. It's not unusual for an ECL user to be orchestrating dozens of protocols simultaneously, far more than one could ever manage in a traditional laboratory. When you're ready, you can build scripts which automatically execute a series of experiments of arbitrary complexity, reproduce results, or process the data and generate reports for you to analyze. As chemists and biologists, our minds are capable of moving faster and further than the laboratory has ever allowed us. Take your seat in the command center. Transcend the lab. So we will be opening the Cloud Lab at Carnegie Mellon this summer. It's being developed and constructed in the library's off-site warehouse, which I think is an interesting coming together of the future of science and the record of the evolution of science in a single space. Our science colleagues are terribly excited about the efficiencies they anticipate in terms of equipment. No longer do we need to negotiate multimillion-dollar start-up packages for new science faculty who want to design their own lab. They will simply be given access to the Cloud Lab where most of the instrumentation they're likely to require already will be existing if we need to add a couple of new bits of kits so be it. We are installing 223 different pieces of apparatus at the initial stage of the Cloud Lab's operation. So we are anticipating increased productivity because of the automated workflow part of the process. We are very much focused on ensuring that what has been hitherto a closed commercial operating system is re-engineered to support our institutional commitment to open science and to allow our researchers very quickly to comply with things like the forthcoming NIH data sharing mandate. Just before the pandemic, one of our then-graduate students, Dima, worked with Emerald Cloud Lab on creating some novel compounds. What we found, as you can see on the slide, is that the work that he was doing typically would allow him to synthesize three compounds in the course of a week. With the Cloud Lab, he was able to synthesize hundreds of compounds in that single week. So the acceleration of research very much was demonstrated by his time at the Cloud Lab. After completing his PhD, he was hired by Emerald and he will be one of the key figures in helping our researchers work with the Carnegie Mellon Cloud Lab when it comes on stream. I've got a few screen dumps here of the notebook. I'm not going to go through them. I just put them in here so that when the slides are captured by CNI, anyone who wants to look at this may do so. The Emerald Cloud Lab website has a number of videos of the operating system in action for those who wish to look at this in greater detail. As I go through this, I'll just mention that the operating system is structured on the basis of an electronic lab notebook and allows the addition or insertion of protocols, allows for the tracking of experiments as they are conducted in the laboratory, and then it manages the return of the data sets from the lab to the researcher for subsequent modeling and analysis. As I said, one of the things we are very much focused on is helping our researchers work with the relatively closed Cloud Lab operating system against the backdrop of the progress in open science that we have been making for a number of years. Our support for the university community is structured around the five themes of tools, training, events, collaboration, and outreach and assessment. And that is inspiring the ways in which we are working with Emerald to move forward. We don't believe that we are the only place doing this by any means, but we have tried to structure an end-to-end approach for open science beginning with experimental design through to the reuse of data that have been generated during the course of research work. And you can see in the red boxes at the bottom of the slide the different tools and the different companies that we are working with. It's an interesting philosophical question about advocating for open science and open science workflows when, to an extent, we are working with commercial enterprises. And one thing we may well talk about in the conversation is the question of digital sovereignty, the Digital Services Act that is beginning to dominate a lot of conversations in this space in the EU, and what the opportunities and implications of that might be for those of us on this side of the Atlantic. For us, one of the big proof concepts of the Cloud Lab is the ability to export data, export fair data to our repository. We have a comprehensive repository for publications, data, and anything else that is structured on bits and bytes in the FigShare platform. So we've been working with Emerald to take some experimental data with the associated instrumentation metadata and port that automatically into FigShare. This was done last week and seems to work. So we are hopeful that as the Emerald facility comes on stream, we will have a direct conduit from the Cloud Lab operating system into Kilt Hub, our branded version of FigShare. That then takes us to the big question, what do we keep? Is it everything that comes out of the Cloud Lab and how enough do we accommodate that? How many boxes would you need to accommodate that amount of data? And if not, how and who decides which data are kept and which ones are just languishing in a sense in the Cloud Lab operating system? And that is one of the big conversations we are having with the research community at CMU at the moment, the persistence of data generated from the Cloud Lab. So that's my opening framing. I'm going to sit back down here where hopefully that microphone also will work and we'll have a conversation and please be ready to shout out questions, comments, feedback. Thanks. So I have a couple of questions to start us out which I think will probably replicate many of yours. So it seems like one of the questions here is almost philosophical, it's how much lives in the Cloud Lab space for how long and how quickly you export things out into FigShare or indeed into other repositories, particularly things like, I don't know, NIH specialist repositories that are associated with certain kinds of projects. What's your thinking on that? Are you really going to just export the things that you think are going to persist as soon as is feasible sort of thing? What we are certainly positing at the moment is that our repository represents the ultimate version of record. We don't particularly use Kilt Hub as a repository for operational data or data that are still being analyzed. So we, to a large extent, are viewing our FigShare instance as an archive. And I think that has to be simply the practical application, you know, there are a lot of very nice visualization and analytical tools in the Cloud Lab operating system and the feedback we are hearing from our scientific community is they want to make use of those. They are very sophisticated but let's export data that we know we want to keep as soon as we possibly can. So it's entirely possible that a data set could exist in the Cloud Lab for analytical purposes and if it has already been deemed to be worthy of archiving it will have been exported to our Kilt Hub instance as quickly as possible. There is a separate but related conversation with our local supercomputing centre about deeper storage because clearly our FigShare instance isn't terribly scalable if we are getting into vast data sets. And I would presume also if you are getting into vast data sets and vast computation you may very well want the data close to the computation. Yes. That's really interesting. I assume that there is a cost for retaining data inside of the Cloud Lab space. There is. Storage is never free anywhere and our friends at Emerald have been very generous in terms of providing access to the operating system but we are working with the predictable data storage provider so that in an AWS Cloud instance we will have those operational data. Okay. I wonder if you could reflect a little bit on how this looks in terms of being able to document data according to say the fair principles. I mean in some sense the metadata coming out that's generated as a byproduct of using the Cloud Lab is just exquisitely detailed but needs presumably mapping into some kind of external fair structures. Yes. One of the advantages we see of the Cloud Lab is the metadata that support reproducibility because I like to describe those as exquisite because there is a tremendous amount of information about the calibration and setting of the instrumentation much more so than we typically capture in a regular laboratory environment and that absolutely is at the heart of why we are doing this. What we found in the experimental work we're doing with FigShare is that is the proof of concept that we are testing the fair data principles against what we are able to capture. It's very much a work in progress. My screen grab was from last week but what I'm getting from my data services specialists is that they are very confident that they are going to have well-structured data, well-structured readme files and by archiving these in our Kilt Hub instance they are automatically being exposed to Google data set search and other discovery layers. Are you thinking in terms of integrating this with other pieces of the sort of research data management infrastructure? One that came to my mind was protocols.io. That is something that we are discussing with Emerald and I think conversations are going well. We were I think the first university to have a contract with protocols.io and our research community has become very passionate users and champions of protocols.io. Emerald has its own version of a protocols library and what we are trying to move towards is ensuring that a protocol from whichever service a researcher wishes to use, either of those two or any other, can be brought into the Emerald operating system. Oh that's lovely. And that's part of the open source programming that we are working on at the moment to make these sorts of portability exercises as seamless as possible. I wonder if you have any particular thoughts on the new NIH data sharing policy that's going to go into effect in early 23 which basically says share early and often? Yeah. I think we have been on a journey over the past few years of dangling carrots in front of our colleagues and many have accepted those carrots and have become accustomed to data sharing because it's a good thing to do for so many reasons. What we now have for good or for bad is a stick and I'm honestly not aware of any particular problems at CMU where the stick is going to need to be used and I really hope that the research community by this stage because in terms of the global scientific enterprise other countries have been on board with this for more than a decade that we are at a point of doing things because it's the right approach to do. It's supporting data reuse and just as an anecdote there I was chatting with one of our faculty last year who was telling me that during the lockdown in 2020 when many laboratories were closed she was assigning graduate students work that was based upon them discovering data sets that had already been shared by researchers anywhere and running secondary analyses of these data sets as a way of keeping them engaged when they couldn't get into the lab and that activity not only kept them occupied but led to a number of interesting journal articles from her labs. So these sorts of anecdotes I think help create an environment in which data sharing is seen as an important part of scientific research activity. Last question from me and then I'm going to open it up for questions from the audience. I just want to follow up that very intriguing brief reference you made to data sovereignty and the implications there and I wonder if you'd be willing to expand on that just a little bit. Sure so I think what we're seeing in the European Union at the moment is something akin to a pushback against big tech and an early wave of this came in the GDPR development a couple of years ago but what we're hearing is that the EU wants citizens and institutions like universities to have a much greater say in what happens to their data to the point of beginning now to propose legislation that is likely to be fairly strong in regulating how companies can accommodate and use data that are shared by individuals. Now to a large extent you can pinpoint companies like Facebook or Meta as the primary targets of this legislation but what I'm hearing in conversations with publishers and other companies in the spaces that we inhabit is that they are concerned that the Digital Services Act as currently drafted might well have implications for them in terms of their ability to do business the way they have been doing not to say that that's a bad thing but it's just something we need to be conscious of as we think about the landscape in which we are working. Perhaps the opportunistic bit is that part of the EU's rationale here is to create greater opportunities for smaller tech companies primarily those based in the EU to compete on a level playing field so we may well see startups in the EU beginning to offer services that might offer healthy competition with some of the companies that we historically have done business with. We know that tech companies are very keen on working with universities because they are looking for large-scale partnerships to be successful. There are very few sectors that can do that probably health, defence, government, higher education. So I could see collaborative potential as we see how the sun falls something we clearly need to keep an eye on when we are talking about data sharing thinking about companies like Figshare. Interesting. Thank you for all of those answers. The floor is open. You have a great opportunity to get more insight into what they are planning to do at Carnegie Mellon. Hi Keith. Thank you for that. It was wonderful. I have a question that's not about data because this whole thing is reminding me of the iLab experiment at MIT a long time ago which was really more focused on education than the research but one of the controversies that came up was that by using a facility like that you are really limiting the scientific options that researchers have because no lab can afford to buy all the equipment that anybody might want particularly really expensive new things so you are looking for economies of scale and so it was really forcing people into a type of scientific research because of the facility which was the opposite of what you really want. So I guess I am asking you if you thought about the kind of long term potential of this both for research and education. I am surprised that you are covering biology because that brings in animals and people and all kinds of things that don't really fit this model. So could you talk just a little bit about where this is going at Carnegie Mellon and its limitations? Sure. I will immediately put in a disclaimer that I am not the expert on this at all but some of the things that we have been talking about as we try to put a rough measure of how much research we expect to be done in the cloud lab as opposed to traditional wet labs it is roughly 50-50 in the early stage so there will still be wet labs. We may or may not still have an animal house because I am not going to talk about animal research but clearly absolutely a facility like the cloud lab is not going to have an animal house associated with it or a vivarium or anything else of that sort. What we know with the Emerald Facility is that it can conduct research using either compounds that are typically kept in the cloud lab or it can work with samples that are sent there. The cloud lab will be about half a mile from our main science buildings at CMU so transmission of samples will be fairly easy. That economy of scale is something that we are very impressed by the potential of the ability for sharing of machines rather than every lab having its own mass spectrometer or whatever else might be required. One of the things that we are engaged with at the moment is helping research funders understand what this means that we will not necessarily be bidding for the typical laboratory expenses in the traditional sense but rather it will be help us pay for cycles in the cloud lab and that is a conversation that is ongoing absolutely. So those are just a few of the random conversations that I've been party to but I suspect it won't really be until the cloud lab is fully up and running that we really understand at scale the potential. During the pandemic a number of our researchers of the early days of the pandemic a number of our researchers gained access to the emerald facility and were conducting real externally funded experiments in the emerald facility partly to ensure research continued partly as a proof of concept. We anticipate probably in a couple of years time starting programs in automated science that will involve training the next generation of laboratory leaders in the operation and management of these facilities and we also expect that our students at all levels will have opportunities to work with the cloud lab as well as developing the wet lab skills. In terms of training those who use the cloud lab we have two training pathways one is the cloud lab training program which Cliff has undertaken and the second is a curriculum that is being delivered by the open science team in the University of Libraries and either of those acts as the pathway to being able to then register to act as a cloud lab user. If just I'll just add two comments you if you haven't looked at the video from December the the speakers there also touched on this issue of you know what what portion of the kind of research activities might be served by by something like the cloud lab and as Keith says they estimated it's going to vary quite a bit by sub-discipline but probably not more than 50% max in you know most sub-disciplines but that's still a lot. The other thing the other point that really emerged that's that I found very interesting is that the cloud lab also makes newish kinds of scientific inquiry very convenient in ways that it's hard to do now so for example you can you could couple the cloud lab control thing with a machine learning system to do a series of adaptive experiments that optimize some parameter and say you know material science or something like that and that's you know that takes a bunch of work to do with with existing lab gear so it may be that I don't know this is used as much for things that people aren't doing very much of now as anything else further questions. Hi I'm Kurt Heligus from Princeton University I'm the associate CIO for research computing thank you this is really interesting and the answer to my question may be you know watch the video stupid because I haven't seen the video from December but I'm interested in the financial model is this going to be a full cost recovery you know what are the thoughts about that and and really with the idea could could the could the financial model be used to incentivize good data management practices by incentivizing people to put things into to kill hub or other other repositories. Yeah we're still working on the the cost recovery approach but there certainly will be an aspect of that and I like your idea about incentivizing we are currently developing the cloud lab based on philanthropic gifts from a number of foundations and part of that is to cover the costs of the early years of operation what we don't want to do is put obstacles in the way of people really embracing the opportunities of the cloud lab but I've no doubt that probably after the first three years we will see cost recovery being used either for to ensure that search grants are covering costs or through the tuition model that we have for student course work but incentivizing data sharing absolutely should be part of that so thank you. Geneva Henry George Washington University and this sort of follows on from McKinsey's question and this last question you know Keith when I hear you describe this what I immediately pops to mind is research cores and you know clearly you have a lot of research cores at CMU and the idea behind that is to really bring together you know pieces of equipment that are very expensive into a shared research facility and then there is a cost recovery model associated with that so I have a couple of questions one is are you looking to replace any of your research cores with this new cloud lab and then the other is do you have researchers sort of signed up saying yep I'm there so that you know that usage is going to be there and are you taking it outside of CMU for that research as well. So there are multiple layers to the end so that I'll give. The first is our schools of biology and chemistry and I'm glad that Roger's going to bail me out on research cores but our biology and chemistry research labs are in the original Mellon Institute building and the intent is that that facility which is a huge science building will remain. We are constructing a second science building nearby and the intent is largely that that building will not have lab facilities and then what we want to do is optimize things so the existing Mellon Institute facilities will be adjusted and researchers in the new facility will either use you know share access to those facilities for the 50% roughly that will remain traditional lab research and the other 50% will make use of the cloud lab environment. So specifically whether or not we are adjusting cores I think will evolve over time. The second part of your question Geneva could you remind me? Thanks absolutely we have quite a number of our researchers who either have been using the facility in San Francisco as I mentioned during the pandemic and beyond we have a couple of classes using the cloud lab already in San Francisco. They are desperate for our facility to open many others are lined up so we are confident that demand is there. Whether demand is there for a 24-7-365 cycle we don't know at this stage the answer to that will in a sense shape the response to the second part of your question will other institutions be invited in to make use of the lab. We have a very strong collaboration with the University of Pittsburgh which sits on an adjacent campus. I have no doubt that we will see colleagues from Pitt being invited to use the facility it may well then scale to regional or broader usage but once it's open we'll have a much better sense of what it looks like. Roger Schoenfeld from Ithaca SNR Keith thanks thanks for this presentation in addition to the one at the last CNI as I think you know I see the potential here overall is being quite significant in particular you know thinking about the future of scientific labor and scientific work and in a lot of really important ways. My question is actually not about the cloud lab but about the Carnegie Mellon Library so I'm really struck by and I'm thinking to presentations I've heard you give at CNI you know going back some years I'm really struck by some of the directions that that this is sort of yet one more example of how you're positioning the library within the Institute. I wonder if you could just say a little bit more about the sort of library strategy and resource reallocations and where this and other adjacent direction sort of fits into that because I feel like you've positioned the library somewhat differently at Carnegie Mellon than some of our colleagues at other institutions have done and I'm just I just was wondering if you could say a little bit about what you've what you've done here. Sure thanks Roger it's great not to talk about cost recovery of labs so the way that I would often talk about our strategy relates to OCLC's evolution or expansion of the scholarly record and I've seen that in a couple of trajectories. First the institutional context we are renowned for our strength in engineering and computer science and in the fine arts and candidly none of those fields are typically very heavy users of the traditional collections based library. As I think about where libraries can make a difference in universities of the 21st century I see those as going beyond the notion of a big building full of printed materials. As I think about the horizontal expansion of the scholarly record from that print based library based collection it begins to move through the digital versions of those collections PDFs of journal articles ebooks through to web based scholarly materials policy documents from governments and national banks world bank type organizations through to digitized materials data sets way out into the open web and potentially even into the dark web but let's not go there for this conversation but then there is the vertical expansion where we begin to look at the objects that are generated during the research process the data code protocols all of these things that we will see coming out of a traditional lab a social scientists laptop or notebooks a cloud lab and then we have the various objects of that are generated post research the reuse and reproducibility of experimental data the community conversations and open peer review the impact and evaluation of work all of that the horizontal expansion and the vertical expansion of the scholarly record represent to me the scholarly record of the 21st century the library has always been engaged in curating the scholarly record through the 20th century and and earlier of course that really was represented in the materials we brought from outside into our campus into our libraries books journals and the indexes reference materials other things that were required to make the most of that record but as we look to the next 50 years what we see is this vast expansion which requires very different skills to manage and that has been the journey that we have been on so what we have done past sustaining and growing the work of the library in the print environment we have recognized we've had to bring in different skill sets people who are comfortable working with researchers as the instrument a cloud people who have gone through a scientific training of their own to help them understand the challenges of mapping protocols into experimental work and helping to navigate which data should be kept which data shouldn't be kept so we have really been focused on expanding our team and ensuring that we have a different skill set that can cover that expanded scholarly record we've been fortunate to be funded by the university to grow the team so it hasn't to a large extent been a substitutive approach at this stage but rather an expansive one that reflects that the library of the 21st century will represent a very different community than the one that preceded it but is one that is true to that fundamental value of the library as being responsible for curating the scholarly record thanks for the question and with that thank you for joining us and please join me in thanking Keith for a very informative presentation and conversation and one that I look forward to continuing as the cloud lab goes into operation thank you very much