 Welcome back to the closing plenary session for our December 2021 CNI meeting and before introducing this session, I just want to take a minute or two to remind you that in this new and different world we're in now, it feels like closing the conference is a little different because the conference really spanned two interrelated events, a virtual event last week and an in-person event that took place over the last two days. And in that sense, where in the past I would close the conference, wish you all safe travels and we would all leave behind what we've done here, a great deal of our conference is still going to be there waiting to be explored and examined. And I would invite you in the coming days and weeks as your time and interests allow to continue to explore the wealth of materials from the virtual meeting, you'll see a number of connections between what we've heard and talked about here and what we heard and talked about there. The two really were designed to complement each other, not for one to serve as a substitute. In that, I also want to just take a minute to thank people. I want to thank all of our presenters, not just the presenters who've been here in person with us but the presenters who contributed to our virtual meeting and even a few virtual presenters that have contributed to this meeting but couldn't be here in physical presence. So please join me in a round of applause for those folks. Obviously we can't do it without all of those presenters and contributors. And I also just want to take a minute to publicly thank the CNI team who have worked incredibly hard and had to be incredibly flexible as we've navigated our way from in-person to virtual and back to a mixed in-person and virtual environment. It's been a lot of work and a lot of making it up as we go along and I'm actually just delighted at how smoothly everything has gone. So I really think they deserve a big round of applause. And with that, let me turn to our closing plenary. This is really something quite extraordinary, I think, and I hope you'll agree after you see it. I started hearing about this project shortly after the pandemic hit when we started doing a series of executive roundtables trying to understand research resilience and the effects of the pandemic on the research enterprise. And Keith pointed me at what at that point was a very early set of conversations between Carnegie Mellon and the Emerald Cloud Lab, which is a company out in the Bay Area that was founded by a couple of Carnegie Mellon graduates. And since then, the project has grown in ambition, I think, and really genuinely turned into something quite extraordinary, as you'll hear. Keith and his colleagues at Carnegie Mellon have been incredibly kind to allow me an opportunity to get somewhat familiar with the details of the project. And really, the more you learn about it, the more fascinating it gets, I'll just say. So we have three speakers today. Two of them are here in person, Keith Webster and Rebecca Dorsch. And the other speaker is going to, through the magic of technology, join us from the Bay Area. He is Brian Freza, and he is a co-founder of Emerald Lab. And with that, I'm going to go sit down and watch the presentation. And I'll come back up at the end. I have a couple of questions that we might chat about a little, and then I'll moderate a Q&A session with all the folks here. So welcome, thank you so much for being here. Hi, everyone, I'm Rebecca, and I'm super thrilled today to talk to you about how we are envisioning automated science at Carnegie Mellon. And I'm super thrilled that I have the Dean of Libraries at Carnegie Mellon and an alumni, Brian Freza, and co-founder of Emerald Cloud Lab with me. It's just really fun to give these joint talks. And my understanding is we're going to have a panel for questions later. So let's have a lot of fun because this is really exciting to think about. So I'm Rebecca, I'm Dean of the Mellon College of Science. I joined Carnegie Mellon five years ago from Purdue University. I always like a disclaimer upfront that I am a statistician who does quantitative genetics, and I get excited about reproducibility and experimental design, so thank you. And so if I go off on a tangent, somebody reel me back in because this is, for a statistician, what I'm about to tell you is super exciting. So I don't need to tell this particular audience about how massive data is just, it's everywhere around us. And AI technologies, I mean you've heard today are essential to analyze and make sense of the data that we're collecting. And then the disciplinary boundaries are lower than ever and people are working on very, very complex problems across disciplinary boundaries. And really, the best way to help, there's two things that I need to do to help you focus this. If you think about life science research and think about the cell phone. Life science research hasn't changed in over 100 years. Yes, the instruments have gotten better, but we still need human beings to go in the lab and do things wrong ten times. And collect a bunch of bad data. We still can't reproduce it once it's published, right? Cell phones, they originated in 2007 and we're all walking around with this handheld computer. We have self-driving cars, yet science as we perform science hasn't changed in hundreds and hundreds of years. So this is what I wanna tell you about, the Carnegie Mellon University Cloud Lab. We are building the first remote controlled laboratory. So scientists do not need to be in the lab. You can be in your office, you can be at home, you can be on the beach. And you design your experiments using code and you submit that code to the cloud lab where the experiments are done, 24-7-365. Very much unlike graduate students and postdocs, they don't need to sleep, they don't need to eat, and they don't need to go to the bathroom. So everything is traceable, which points to reproducible science. So everything about your experiment is captured. You have the code that created it. You have the environment, the humidity, the temperature. You know the life cycle of each instrument. You have the parameters of each instrument when things are out of bounds. And so when you want to reproduce your experiment, you can reproduce your experiment from the parameter file that goes with that particular experiment. Now I'm just gonna get you thinking ahead because this is a lot of fun to think about. Imagine a day when you submit a paper to a journal and you submit your paper, your manuscript for a review. You submit your data because it's required. But you also submit your parameter file for your experiment and the code for your experiment because it was done on the cloud lab. So anyone who questions your data or wants to reproduce it, they have everything they need. This ultimately I think will move science forward much faster, much more efficiently, and much cheaper. So the Carnegie Mellon Cloud Lab is fashioned after Emerald Cloud Lab, which was founded by DJ Kleinbaum and Brian Frezza, who's joining us today. They are alum of the Mellon College of Science and Carnegie Mellon. And this is a partnership between a for-profit and a university. So the next slide we will actually start thinking about the differences between running a for-profit and having a cloud lab in academia. So if I could have the video, that would be great. And I see some of you on your cell phones already. And the scale of our work has been metered by the long hours required at the bench. It's time to change that. Emerald Cloud Lab is a remote controlled life science laboratory that allows scientists to execute their experiments without being anchored to a physical lab. In a cloud lab, experiments are driven by issuing commands over the internet, which are then run in a vast, highly automated central facility. With an ECL account, you have full control over every aspect of how your experiments are conducted. Control the transfers of volumes from less than a microliter to 20 liters. And the transfers of solids, with masses from micrograms to kilograms. There are over 150 models of best-in-class instrumentation online at the ECL. ECL facilities run your experiments on demand 24 hours a day, seven days a week, 365 days a year. Leaving just hours between the moment you conceive your experiment and the moment you receive your results. It's not unusual for an ECL user to be orchestrating dozens of protocols simultaneously, far more than one could ever manage in a traditional laboratory. When you're ready, you can build scripts which automatically execute a series of experiments of arbitrary complexity, reproduce results, or process the data and generate reports for you to analyze. As chemists and biologists, our minds are capable of moving faster and further than the laboratory has ever allowed us. Take your seat in the command center. Transcend the lab. Over the course of the last three years, I've talked to a lot of people about automating science and they're all very blatant, they shake their head. But that video really helps you hone in on what we're talking about. At Carnegie Mellon, this will be the first academic cloud lab in the world. And to tell you the truth, I spent probably a good six months to a year saying, why do we need one of these in academia? Can we keep it busy and is it worth the investment? So we had a working group of scientists across campus. This is a university initiative. It's about $40 million, so it's a serious commitment. And we had engineers, computer scientists, statisticians, foundational scientists in our working group. And they made a wish list for me. And the wish list has about, I think it's 211 instruments. It's about 11 instruments more than what Emerald Cloud has in their facility in San Francisco. And we are building one of these. During COVID, COVID actually helped us make our case to the university because we taught undergrad and grad classes using the Emerald Cloud Lab facility in California. The faculty transitioned their research. We had trained faculty prior to COVID. Those faculty were good to go, immediately transitioned. And then we kept training faculty and grad students in postdocs during COVID. So the three months shut down that most people experienced for their research labs in academia, a large portion of our research kept moving forward because of the generosity of Emerald. So this particular cloud lab, it's academic. You think about it in a very different way than you do for profit lab. But in our business planning and our business model, we did, we saved out, is the good word, 20% of the total for the Pittsburgh Life Science ecosystem. We very much want to get back to the community. And we feel that startup companies, if they don't have to set up their wet labs and set up their own computing, it's fast to fail or it's fast to succeed. So ultimately, this is going to support Pittsburgh, which leaves 80% for teaching, research, training, and so on. We have faculty who have put in grant applications already and in all transparency, the federal agencies don't know what to do with the budgets that budget for cloud labs. So we are now on a socialization project of going, coming back to Washington and or being on Zoom and doing exactly this sort of talk. And then there, we are already publishing papers and the publications have been easier than I expected. And so we have, we have publications coming out. So under one roof, imagine that you have 200 plus instruments. Your faculty are not physically in the lab. I don't want them in the lab, actually. This facility will not live on campus. And so looking at this diagram, each of these little gray boxes is one of those 211 instruments. And the cool thing about this particular setup is that we can parallelize science. What you're seeing in this picture are, you have blue, yellow, and red. Those are three different workflows, three completely different experiments, all running at the same time. And last week I was giving a talk and somebody said, how many experiments can you run? And the statistician in me wanted to do the calculation for them. 211 choose two, 211 choose. But we, capacity wise, we think upwards of about 12 experiments at the same time. And realize that, that not every, not every scientist is going to utilize the facility 24, 7, 365. Some of these experiments will be fairly quick. So this is the concept for the cloud lab. And so as I mentioned before, I spent a lot of time saying, why do we need a cloud lab? Well, it increases reproducibility, it deals with reproducibility. It research productivity. Imagine not being limited by the grants that you write and the instruments that you can afford to buy. You can do your science, the only limitation is human ingenuity, not money, not the instruments that you have available to you. ECL estimates say that their customers have experienced up to 7x increase in research productivity. We don't have this estimate for academic yet because our lab is not up and running. We conservatively estimate that we will publish two times the research paper in the same amount of time, stay tuned, we'll update that. And then reproducibility, how exciting is this for statisticians? You can just have a field day with this. And then we have open science. So improve collaboration. People don't have to come to your lab anymore to learn certain techniques. If you are a scientist who has bad hands instead of golden hands, you are no longer restricted to being in the lab. Knowledge sharing, you can share protocols, you can share experiments. Instead of reproducing or guessing what you've done, your collaborators can just go to the code, tweak, move forward, and the discovery science is that much faster. So this is, I think this cloud lab really is a great example of open source. Educating the next generation of scientists, that's our business. That's where we think this is going to go. Carnegie Mellon is a first mover in a lot of different things. So this is a moment in time for science that we don't think we're going to get back. So we are capturing this and guiding it. Democratization of science, I have a lot of colleagues at under-resourced universities, and I know a lot of people who grew up in under-resourced communities, with an internet connection, you level the playing field on science. Everyone can do top-notch science, and so again, it's open source, right? I was just in the previous talk where a woman from India speeded up the code and made it more efficient. It's the same exact example for a cloud lab. And then I'm not interested in building a static facility at Carnegie Mellon. I'm interested in starting this facility and then engaging the scientists and the engineers so that we can build the next generation of robots and we can build the next generation of instruments. And we can put on top of that active learning so that we can do our science more efficiently. We don't have to do every possible combination of things because the algorithms learn as you go. When I have a couple of examples later on for you. So this is why we need an academic cloud lab. I know that we are going to create new disciplines that I can't even imagine. I know that things will happen that I can't even imagine yet, but it certainly is fun to set it up. Meet Dima in 2019, when I couldn't get the board of trustees to listen to me fully, I went back to Keith, anyway, I'm being videoed. Keith went back, yeah, I went back to my office and said, I need a really smart PhD student who doesn't know how to code. And so we looked around, here is Dima. Dima works on synthetic DNA. So DNA, ATCs and Gs, but you create it on your own and you need to know what the order of the base pairs are because you're interested in how it holds. Synthetic DNAs are very difficult, it's very time consuming and typically Dima, who is an expert at this point, he could synthesize about three a week. We sent Dima to California for a month, trained him, embedded him in Emerald. And he was able to synthesize hundreds of these PNAs a week. And when he came back, I said, so how long did the code take once you were trained? And he said, I really didn't need to go to California, Rebecca. It only took me 20 minutes to write the code. So really smart person who doesn't know how to code was trained in about four sessions. And it is fairly in-depth training, but once you're trained, you're good to go. And the good news about Dima is he now works at Emerald Cloud Lab after he received his PhD, so we also got him a job. And then more recently, this is a publication that just came out in 2021, like maybe a month or so ago. This is Olys, and Olys, his work, this particular paper is on the MRI agents that you are injected before they do an MRI. And typically it takes a lot of time to find these different compositions. And so if you considered 50,000 different compositions, these scientists, they would just say, that's not worth the money. It's too much time. We're not doing it. Using AI and active learning, learning as you go from the experiments that you're doing and the results that you're getting. Olys and his colleagues found that by testing less, that they were able to test less than 400 polymers because they're testing as they go and they're eliminating things in a week's time. And this is just, we're able to do science and this is just proof of concept. We're able to do science that before they wouldn't even consider doing. So how will the Cloud Lab transform science? Reproducibility, productivity, open science? If we had one of these Cloud Labs labs in an academic open environment during COVID and we had them around the country and around the world talking to each other, we could have addressed things. And I think the vaccines for COVID have been spectacularly fast, but it could have been done probably even faster. And then active learning. Learning from your experimental data as you go. This is the pinnacle of AI, I think. And I just think we're at the tip of the iceberg on that one. And then how will this transform society? Cloud Labs allow you to pivot literally in a moment's time. If you need more access, you plug in more access, more instruments. If someone develops a new instrument and you've tested it and you want to give everyone else access, you plug it in and everyone else has access. So it's flexible, this environment is flexible, it's open. It is the workforce development of the future because this is the way things are going and it is democratizing science for sure. So where are we at Carnegie Mellon? We are doing this, first of all, which is really exciting to be able to not just lay it out, but I can tell you that we're doing it. We will have our facility up and running this time next year, fall of 2022. It's an academic recharge center, so we are going to learn how to run one of these in academia. The home of the Cloud Lab is not on campus, as I mentioned before. It didn't need a fancy new expensive building. So it's housed in a building that Carnegie Mellon owns. If anyone is familiar with Pittsburgh, it's about 10 minutes from campus. And it's in an area called Bakery Square, which is where Google's headquarter office is in Pittsburgh. The architects are chosen, the management team is hired. I walk the space, which is super exciting. We needed high base space, so this is terrific. We know that the floors can take the weight of the equipment. We have designed the lab. This is what it looks like. It's a little bit bigger than the one in California, which is extra exciting for me. We have a little bit more than 200 instruments. And so we think about the future. What is this going to do for science? It's open, it's collaborative. We're training the next generation of scientists for sure. This platform is expandable in ways that, again, I can't imagine. And we know that we're creating new disciplines going forward, and we're streamlining the flow of technology and knowledge. So the previous talk that I heard had a prediction of the next 10 years. My prediction in the next 10 years is that this won't be the only academic cloud lab. There will be other cloud labs, government academic facilities that are connected and talking to each other and sharing data in a very open way. And this is the moment in time for science that the cell phone had and the self-driving car had. Thank you. Oh, yes, sorry, it was Brian. Are you out there? There he is. Hey, Brian, how are you doing? Can you see me? I can. OK, great. Hi, you're on. Excellent. So following up on Rebecca's wonderful introduction to the academic cloud, I'm going to take you through, we're live in the South San Francisco facility. So I'm going to take you through what this technology looks like in action, talk a little bit about the birth of it and how it came about. And then give you some background on where the statistics come from, for instance, on the operational capabilities and all that. So we're driving around now what is the order of 13,000, 14,000 square foot facility in San Francisco. And as Rebecca mentioned, there's a larger one going in in Pittsburgh, more to the 2,000 to 16,000 square feet. It's home to around 200 plus different types of devices, although there's redundancy on a lot of devices so that the total number is larger than that. So those are unique types of devices. So that may be a specific model of mass spec, a specific model of liquid handler, DNMR, things like the even speed backing is here. So every manner of small and large equipment is packed into the same facility. And all of that is put under the control of one single software operating system. So there's one software platform, which I'll just really briefly show you. We'll go through a little bit in a moment that's designed to control all of this. And really, this technology developed over the better part of about 10 years, started actually at a previous startup that myself, my co-founder, had worked on and then spun out into a mobile cloud lab as a commercial platform more recently, about six years ago, I think. And that's been how it happened. And really where the birth of it came from was in grappling with the reproducibility problem that everyone has been doing. Rather than approach it in the sort of classical manner, which is to say standing kind of behind scientists, taking notes on what they're doing and saying, how do we capture all the information on what they're doing and then try to repeat what they're doing exactly? We were approached it very differently. So myself, my co-founder wrote that background in and that's the classical sciences. So we both did chemistry in graduate school. He was at Stanford, I was at Scripps. But in undergrad, we did, in addition to doing chemistry and biology, we did computer science at Carnegie Mellon, back when that was a very new field. Now it's a normal thing to do that. But we thought about it very much from the history of the early days of computing and said, well, there was a transition state that had to happen back in the 1940s where computers were human people at one point that you would give calculations to do in a room full of people, right? Go through by hand and try to calculate something. And there was a movement, of course, to turn that into an automated, at first mechanical and eventually electronic environment to compute these things. And in trying to approach that problem, you wouldn't go about that by standing over the shoulder of the human computer, the human calculators doing those calculations and try to record everything they're doing. The way you'd approach it is a little bit different where you would sit down and try to come up with a baseline instruction set to the machinery that says, these are all the actions, the sort of unit level actions one can take. And then a composition of those unit level actions is the actual, is going to be the actual end computation that you're doing. And as long as that instruction set is sufficient to cover the space of everything that's possible, then you should have a, you can say with completeness that this is a way to completely get the electronic environment to recapitulate what would be happening with the humans, right? And so that's very different because you have a known closure condition there right up front where you can say, well, is the instruction set sufficient to carry out the task on it? You know that before you even roll into the exercise if that's a true statement. Whereas when you're standing behind someone trying to write down everything they're doing, you never really have an eight-parry way to sit down and say, is this enough information? You can't just look at the information set and say it's complete. When we have this same level of frustration where you try to be as careful as possible on all your note taking, but then you couldn't look at those notes and from the notes themselves say, oh, I definitely have enough to get this exact same thing to run. So when we first built this instruction set, it was about connecting all the different instruments you're seeing here. So that's, as you can imagine, a massive software undertaking to get every piece of software here talking to the same central platform. It also meant that there's a lab execution system which is managing traffic throughout the facility. So it knows about the many, many different experiments that might be going on simultaneously and make sure that they're not going to collide looking for resources at the same time. So you can imagine there's a complex resource management system there. And then there's also the human element as well. So we have the traffic samples between the stores onto the different instruments and elements here. We have to do a bit of sample preparation, some of which we do with the sort of best of class automation that we're able to get access to, but also some of which is going to be somewhat manual. Some things like, say filtering a leader of solution is going to have to be human hooking up to a pump filter at the end of the day. But that stuff can still be proceduralized and controlled remotely in the sense that you are issuing instruction on how that's going to happen. And then that's managed internally in the facility the same way you would any other sort of industrial process. And so the collection of those things end up building the cloud lab that you see here. And what's novel about this, I think more than anything is what you don't see here. It's not necessarily the instrumentation that you have connected a lot of us. You have access to your own laboratories. What's novel here is that the people who are driving this facility are never in it. So they're scattered across the globe, most of which are in North America. There's a fair amount in Europe that uses every day. And I think the farthest one is there's a large group in Australia that will log on at very odd times for us. And in a few of their experiments that way. The way people are getting the sort of productivity numbers that were reported there, that seven X number is twofold. One is of course, as Rebecca mentioned, parallelism. So if you're just operating all this stuff remotely by issuing instructions into the facility, you can of course issue parallel instructions. It's just like in a data center spinning up more than one server to do more than one calculation. You can spin on many instruments simultaneously to get many different protocols running at once. So that's a very common loose case for the users on the system. And that's a lot of where their efficiency numbers are coming from. The other part is interesting and that's about, once you take an exercise of being able to capture the execution of these experiments sort of out of space and out of time with where the experimenter is, you can keep them running 24 seven. So the experimenter can be thinking ahead about what they want to be doing in the days to come can put that into their queue. And then that stuff will continue running over nights and over weekends. And then 24 seven environment will not stop, right? And so good experimenters on the system who are taking most effective use of it will in queue experiments for many days in advance. Wow, they may be sitting in their queue. And what will happen is they'll be running, say, in this case they've got, this account has 12 simultaneous experiments running. You can see they have more than that in their queue. And so if they want to get these others to run, they can sort of set the backlog order that they want. But what'll happen there is even if at four in the morning anyone of these finishes it will roll on automatically into the next one. And so part of where that efficiency comes from is as long as you're planning ahead a little bit on those experiments, they continue to run continuously. You also need a dynamic environment. So a lot of the technology development was around this idea that you have to get things up and running very fast. And you have to have sort of as close as possible to live control. Now you're not gonna be there when it actually runs if that happens in the middle of the night or when you're not at your computer, when it doesn't switch to one experiment to another. But you're also gonna be not making decisions like months in advance on what experiment you want. Because a lot of that stuff is based on the sort of day-to-day activity of you get the results from yesterday on how your purification went and then that's how you're gonna decide how to do the next set up the next day. And so part of this technology and setting it up the real challenge here is that more than half of the jobs that come in on this facility are completely novel to us. We've never seen them before when they show up. We get them running up and running on average within five hours. 10 hours is kind of the point at which alarms start to fire that's been sitting in the queue for too long. And so that really moves at the pace of research such that you're able to come in with the days or day or two worth of experiments, get them into the queue as they're running they're gonna come back the next day you come in you have a whole new set of data to look at and you set a decision to make on what to do next but you don't have to find a way in advance. As Eureka said, the other big part of the value of being able to run this out of space and at time is that you can use the sort of industrial might of that centralization to democratize the whole thing because it's a shared access facility. So in a sense, it means that as the users are all around the world, you can pack everything in the one space ahead of time, write it most efficiently in terms of all the maintenance and the GXP level qualifications that we're doing with all the instruments ahead of time. So that by the time you get to the machine, you're using something that's in sort of pressing condition but you also pack it with a diversity of equipment which means that the researchers don't have to plan vastly in advance, which equipment they wanna want. We sort of go out and we often refer to this facility as sort of Noah's Ark is like one of everything packed in here so that you can try it first see if it's gonna work well for your application and then move on immediately to something else that doesn't work or stick with that and scale it up if you want to continue using that instrument. But that doesn't become sort of a major part of the process is the shopping for instrumentation and then trying to get it online and then fighting to work with a single piece of machinery because it's some very large percentage of your buzzage into it. In this case, it's just sort of try everything in parallel, see what works well and then continue along with that. So that's sort of one of the more intangible ways I think we've seen speeding up of the research that people are using on the system here is that kind of fail fast and move on quickly. So if you're attempting a new purification for example, you'll try four or five different techniques in parallel on the same day and maybe one or two of them looks promising the other ones don't and you sort of abandon the ones that don't, you go to the ones that look promising but at that point you have any sort of sunk budgets and months into getting the equipment in the door want to go that direction. So that's the basics of how it operates. As Rebecca mentioned as well, part of what took a long time when the technology development was getting to the point where we could have this environment be run by someone who has no computer science training whatsoever. So our bread and butter user and certainly in pharma is a chemist and biologist who knows of course their experimentation very well but doesn't necessarily know anything about coding. And in some sense we're sort of teaching them that in that their experiments at the end of the day will get reduced to a set of code but that will work through a point clip no code type interface that we set up which is rarely heavy based on AI but don't get too excited it's an expert system so it's like 1980s AI. So that's what we use to help design the experiments where you'll start to set up an experiment maybe give the system like what samples for instance I'm purifying here by HPLC and what we'll do is take me through each of these options and give me some suggestions on what they could be and then these suggestions are adaptable and changeable and I can go through graphically say oh I want a different gradient here and so I'm just gonna push buttons to sort of change them and I get my nice interface here and you kind of see as I'm messing around with this graphically on the right there is the actual piece of code that's gonna come out of this and so we're trying to make it easy to make that bridge where you are effectively just designing your experiment from the point of view of like what would it look like graphically to set this thing up but everything you're doing always results in an end command which is composable so that you can take these things and build them into for instance larger scripts where you have a series of experiments running back to back to back you can take parts of this command like say we have this gradient calculation here and this is, I just hard coded something in here but you could write a function for instance that calculates how this might go based on some property of the samples and substituted in so the more, even though you're up and running like we said in the four training sessions to use the whole graphical system the more advanced users very quickly and especially in academia we've seen have been playing around with doing things like writing code that writes its own experimental execution so it's like self optimizing code that will try many different experiments come back with something that has more of an optimal parameter set for what you want and go from there. So that's sort of in a nutshell what the environment looks like and I wanna take up too much of the time because I know Keith is gonna follow up here but I'm sure there are lots of questions that will come out of this this is very new technology that the world is just getting used to we could not be more thrilled that Carnegie Mellon is opening up this technology to date we've mostly been working with farmer companies behind closed doors and under a mountain of NDAs telling anyone what we were doing for the work but I think that in the long run it's pretty clear to see for anyone who uses this technology how valuable this democratization would be of these techniques when you can pass on experimental methods to people in software where it really is just a series of commands that you push a button to reproduce on the same equipment even potentially the same facility using the same materials in exactly the same manner that it was from the first time that can be a major gate changer to the way we put together even on scientific protocols and the way we deal with results importantly too as we build this database of results I think Keith's gonna talk a bit more about this I think everyone is at the point right now we're all fighting to make sure all the data from the publications are available and out there but that's the most obvious first step you can't do anything without that but I would argue that the very next step that everyone is gonna realize is in the future becomes the most important thing to move forward is that any piece of data that you're gonna see in a system if it's not actionably tied to the methods used to generate it and by that I mean they don't have an easy way to go back and reproduce this experiment on my own without having to spend a month or so fighting not trying to get access to the same equipment dealing with the fact that there's ambiguity and the expression of how this is done stuff that's just generally not captured by the way someone has communicated it in pros if instead every piece of data in your database is tied to an electronic method that you can pull up, push a button, get to run a facility change one or two parameters do derivative experiments off it immediately that same day it's a very, very different environment so a world where the data itself is always inexorably tied to the methodology that was used to generate it from the point of view of like I'm less than a day away from pushing a button, getting that experiment back is definitely a very new and exciting world and something I think that this academic environment is gonna be able to take advantage of it much more so even than the industrial environment has been able to today so I think that's it for me I'll be around for the final question I'm sure this leads to more questions than it does answers in the beginning but hopefully it's the beginning of the discussion for a lot of people here to wade into this new technology Thank you Brian Okay, we'll be back for questions with Brian in a few moments time if we could switch back to the slides please Great, thanks So as you might imagine and that was great seeing a Dalek's eye view of a cloud lab I don't know why the Doctor Who producers didn't capture that one years ago but if we think about the cloud lab from our perspective in the university libraries where part of our charge is overseeing the end-to-end approach to open science at CMU and managing the products of research clearly the sort of productivity gains that we anticipate from this facility are going to have huge impacts on our roles in capturing, curating, sharing the products of that research Brian mentioned the cloud lab command center which drives everything that you've seen The challenge for us is in an open science environment how do we interplay with the work we've already been doing and how do we meet our obligations as research funder mandates begin to pick up and for us it really is about pipelines how do we connect things we are already working with into the ECL command center This is very much work in progress but just to give you an example of the sorts of things we are exploring we were early adopters of the first university to adopt protocols.io and many of our researchers have built their workflows around protocols.io how can we feed that work into the ECL command center at the other end our elegantly named institutional repository kilt hub you get the RIFON GitHub which sits on top of a fixture platform how do we derive data and other records from the cloud lab command center into the repository environment One thing we've already determined is that almost certainly there will be just too much stuff to capture everything so how do we decide what comes, what doesn't how do we help researchers make that decision it's not a decision we will make but how do we provide the decision matrix to support that Larkin had to leave for a flight so this was a slightly wasted slide but I always have to throw in this one many of you will be familiar with the evolving scholarly record work from OCLC and I think the cloud lab really exemplifies the focus here that traditionally we were really focusing on the outcomes of research the end of project reports journal articles conference presentations but in a digital research environment everything is susceptible to capture and curation the products of the research process the products and outputs of the aftermath of research and we're going to be dealing with that on a substantial scale as the cloud lab comes to fruition I'd like to say that we had anticipated this back in 2015 when the university strategic plan pointed to this we knew that the data that was coming we hadn't anticipated in what shape or form but it now very much is in front of us and we recognise that as the scholarly record evolves we need to be at the forefront of supporting the entire infrastructure that is at play a couple of years ago we modelled out our perspective on open science at CMU and the graphic at the top is our end-to-end approach from the research design through to the post-experimental reuse and reproducibility work and this maps out the university library's approach to open science for the university community and you can see in the boxes underneath the different tools and platforms that we've put in place to support that end-to-end approach we see that as particularly critical as funders begin to really articulate expectations for sharing curation across not only publications and data but I anticipate code and other artefacts we absolutely need to be ready to understand that one of the things we've been careful to do is to shy away from a platform-by-platform approach but rather to begin with the end-to-end mindset and then find the best partners and collaborators to meet different use cases across that workflow. Open Science is very much at the forefront at the moment we've heard about the National Academy's recently released toolkit that builds upon earlier work I think in 2018 the Division for 21st Century Research but the pandemic undoubtedly has accelerated the drive towards Open Science. UNESCO, amongst others, has really talked about the importance of information sharing through Open Science and the pandemic has illustrated the gains as we think about trans-border or cross-border sharing and the industry higher education sharing of research outcomes. UNESCO recently, I think three weeks ago, released its final recommendation on Open Science do have a look out for that I'm not going to belabor the point here. I just want to spend a couple of moments talking in greater detail about our approach to Open Science at CMU and if you can think about what I'm seeing in the context of what you've heard about our cloud lab facility. We have five pillars to our Open Science promise to the university. The first of those is a suite of tools. The slides will be available. I'm going to go through these fairly quickly. We very much have adapted and adopted the Open Science framework as a core part of our approach to Open Science at CMU widely adopted many hundreds, if not thousands of users across the institution. And then we have carefully identified partners whose business models and products align with our approach such as protocols and lab archives I've mentioned, killed hub already as our comprehensive repository that so far seems able to accommodate anything that has bits and bytes and hopefully could accommodate atoms in due course. At the tail end, we have been promoting open access agreements. I've talked about these in a number of flora so I won't go into that in detail. Our second prong is that of training. We offer a number of carpentries workshops every semester. These are always over subscribed and we could be running these I suspect every day and still find that we couldn't meet all the demand. We complement the carpentries which typically are two or three days in duration with one, two, three hour workshops being delivered by the university libraries in computational frameworks such as our Python shell and Git. And these satisfy people who need just a quick hit of how do I do a particular activity rather than a broad based introduction. And we are beginning now to work on series of workshops to suit a particular business case. So reproducible research, we will pull together four or five different workshops into a curriculum that meets particular needs. And we are now delivering discipline specific training workshops. We just completed a series in neuroimaging. We have a series in genomics coming next semester. Third component is our broader outreach events, artificial intelligence for data discovery and reuse started as a three day conference just before the pandemic. We went online last year. We took a skip this year because we had core faculty members on leave but we hope in 2022 to resume our open science symposium which we run in partnership with Mellon College of Science. Similarly is a broad based event that attracts an international audience. Collaboration is something that we found is very much at the heart of our open science approach. Our data collaborations laboratory pre pandemic was held in the library once a week bringing together data producers, data scientists to share data to tackle data questions that moved to zoom and has continued without missing a session throughout the pandemic. We know that the data collab will continue for some time but we're seeing a lot of demand to introduce also a coding collab particularly for the reasons that Brian described with the Emerald toolkit. And we're seeing a lot of interest in broader community engagement supporting citizen scientists around and about campus and we're happy to add that to our collaboration framework. And finally, we are trying to wrestle with questions of outreach and assessment. How do we make sense of this recently? The open science team has developed a logic model to try and weave a way through the question of what are the long-term outcomes and impacts of this work rather than just spin up a platform and hope that people will use it. We're viewing this as a gradient of different practices recognizing that one size doesn't fit all. Different disciplines have different approaches to the balance between private and open or public data, private and open science. And we recognize that different research projects also will bring with them different expectations. So for example, we could imagine in our work with protocols IO, different approaches depending on the circumstances in which the project is being conducted. And we're now turning our thoughts to metrics to help us understand how this is playing out. Are we achieving the institution's aspirations around open science? What of work beginning to get underway? This is something that we hope we can develop into a toolkit that we can share more broadly, leveraging and building upon the National Academy's work that I mentioned earlier and really trying to answer questions such as what difference do our services make to a researcher? How much time are they saving? How many grants are they able to attract based upon the models that we are putting in place? A lot more information is available on our website. We produce a monthly newsletter. If you go to the link on the bottom right, you can find back issues and sign up to receive a monthly mailing of that newsletter. My colleague, Melanie Ganey, who is a neuroscientist, Bai Jin Wang, who is a cell biologist. They're both faculty in the university libraries. They would be delighted to hear from anyone who wants to talk more about our service model. They did a great presentation at virtual CNI last week and you can connect with them at the email address or the Twitter hashtag. With that, I am going to sit down and let Cliff read the conversation. Thank you very much. Thank you all, including you, Brian. I know you're out there somewhere. For a wonderful overview, there is so much we can talk about here and I'm definitely going to leave some time for questions from the audience, but I thought I'd start with a couple of questions since I've had a few months to mull on this. So obviously this is not a complete replacement for individual faculty labs. So this fits beautifully with some workflows. There are other workflows that are very different. Can you give any kind of a sense of what disciplines are best covered by the cloud lab that you're envisioning putting in and also some sense of maybe the percentage of faculty that are affected by, that are going to really be affected by this? Right. So do I think that all of science is automatable? No. Do I think that we train students without ever stepping foot in a lab? No. But I do think very much like teaching statistics, you do the analysis of variance tables once or twice by hand and then you appreciate the software. So I see this very much the same way. With respect to usage, I think across campus, I mean, one of the things that I talked to Brian and DJ a lot was I don't wanna build this and have it sit idle. So can Carnegie Mellon University keep this thing busy? And they were very polite and they said, Rebecca, between the PIs, the labs, the faculty, the courses, grad and undergrad, the postdocs, the graduate students and at Carnegie Mellon undergrads, yes, they're all doing independent research. So I think about, I think it's fair to say in the first two years upwards of 50% of the science that we do and the teaching that we're doing, so I'm including that in there. And then as this takes hold, people will be more creative. They'll trust it and we'll go from there. So right now my hope is 40 to 50% in the first three years. Of everything. Brian, what do you think? Oh, and now the second part of the question was which disciplines? So chemistry of very much so biology, biomedical research, chemical engineering, material science. So it goes on and on at Carnegie Mellon. We have a very broad umbrella over the University of usage. Brian, do you have anything to add to that? No, that was pretty perfect. I think maybe the way to think about it in terms of how it will complement existing technologies is similar to the movement we all made to data centers, where initially you had to run all of your own local servers and then these larger data centers came about and you saw them first being built sort of institutionally and then there are more global data centers that get run today by huge financial institutions that are serving everyone up. And that doesn't mean that your computers go away certainly even if you're using AWS or Google Cloud for something, you surely have PCs at your local site too and you certainly have work that you're doing locally. But a lot of your daily driver work for that sort of advanced capability in terms of the commercial efficiency of getting access to that, the scale of getting access to that. So you can spin up many many servers simultaneously so that you can share that amongst a much broader community and share that cost is so compelling in the long run that it changes a lot of your daily driver activity. Keith just whispered to me, start-up packages. So what we hire assistant professors, we have to do lab renovations which are incredibly expensive and we have to fulfill this wish list individually of each assistant professor and for those of you in academic, there's this little competition that goes on with assistant professors who got the better package from which university. We are now negotiating time on the cloud lab with our assistant professors and this is an efficiency that's built in that we will appreciate over the years because the lab renovations will go down. There will be more shared facilities when you do need a wet lab. Not everybody needs their own wet lab anymore that they move into and move out of 30 years later. And we'll be able to really maintain top level instrumentation because we don't have to buy one for every faculty member on campus. Thank you for the reminder. Yeah, I mean it must. I would think it would also reflect in productivity because all of that startup package and renovation translates into time and lost productivity. Literally an assistant professor can come on day one and they're good to go. They don't have to wait around for their lab to get complete. We actually, we often hire postdocs and wait for them to finish their postdoc. We have postdocs that are coming in January that have been working on the cloud lab already writing grants and so it's, we're already seeing really exciting things. So I suspect somewhat like Brian, I'm very struck drawing parallels between where this could lead and the evolution of computing. So for example, one of the things we've seen in computing is institutions working out these balances between when they use the public cloud, when they use local facilities or local cloud and when they use certain very high end computational resources that are national scale resources. And I can readily imagine this coalescing, particularly as you have private and public cloud labs, this coalescing with some developments in central instrumentation, high price, really high price instrumentation in a very similar kind of a way. Does that feel at all plausible? The analogy is pretty straightforward in my mind. Brian, you and I've talked about this. Do you have anything to add to that? Yeah, I think you've already seen that movement in some sense for the, I think the physicists are ahead of the life scientists in some sense. They know how to share telescopes in a way that's effective across the community where not everyone's gonna have their own individual, Hubble Space Telescope obviously has to be shared by the global community. And what's different about life sciences is often it's considered a sort of hand to hand thing where you have to be doing the majority of your work with locally hands on. And so really the transition there is we're just taking a lot of these daily activity instrumentation and putting that on the cloud for where he's used. And it means that it has to be very fast paced because you have to have this sort of daily access to it. But that same transition and definitely the same analogy is on what's happening with data center technology, what's happened with the global sharing and other just ones that in life sciences should apply here too. Last question from me before I open it up to the audience. I agree with you that this active learning coupled with experimentation is really looking like it's gonna be a huge productivity multiplier. And this of course makes coupling in active learning really easy. How much of this are you actually seeing at this point, Brian? Are people really starting to do that now? Or is that still something that's kind of nichey? I would say pretty much everyone either has a mind on it or actively has projects going in that space. And it's been interesting to see, especially Karni Mellon has been at the forefront of publishing some of the journals. There's a great paper that came out a while ago about, like I said, auto-optimizing experiments which is very interesting where I don't wanna poorly summarize the work that should go look at the paper, but it's like instead of telling the system like we have right now, this is exactly how I want you to run the experiment. It says, okay, these are the degrees of freedom I can run on the experiment and here's the outcome I'm looking for. Let's write code that automatically runs a series of experiments, analyzes those results, comes back and says, all right, we headed in the right direction, moving down some multidimensional surface and then try again with another set of experiments to get you closer and closer to the prescribed outcome which might be, say, isolating a novel protein or might be trying to get good mass spec resolution on the details of how to handle all the settings there. And then you're also seeing, and I think this is sort of even more cutting edge than the self-optimizing experiments is the people who are writing these learning algorithms that can actively sit on the cloud. What's kind of cool about the degree of granularity in the database entries that are generated when someone runs an experiment on the cloud is just it's so detailed because no one's doing any data entry. It's all the capture coming from all the metadata, all the stuff from the primary instrumentation is getting captured in as well. And so people can set active learners on the database and say whenever anybody releases a new piece of, say, chromatography data, I wanna completely update my training on my model and then have an active model that every day is getting better when anyone across anywhere in the system releases new data, it automatically reruns the training. So you can imagine just from just a physical standpoint, if you have to take what we're kind of at in the state of the art right now if the publication goes into a journal and then sort of manually try to munch that data into the right format that's gonna be useful for some learning algorithm at the end of the day, it's a very different step than, well, if the thing started and lived and breathed on the computer to begin with and everything was computational to begin with, it's not hard to start automating, connecting all those wires to really get turbocharged on that learning initiative. That is really cool. It kind of makes me wonder for, not for all data, Brian, but there are certain synthetic data that maybe we don't have to store anymore because we can just reproduce it cheaper than we can store it, just a thought. Yeah, no, I mean, it's an analogy again with computing. You know, as computing has gotten cheaper, you sometimes say, I can recompute that cheaper than I can store it and take care of it in the long run and you're gonna be able to make exactly the same kind of judgments here. With all the data you need to reproduce the data. And again, not all science, but I think there's a large portion of it. Well, you can kind of see that in little optimizations too. There's stuff already in the lab, like we could do computational predictions of density, but we just take a trip through the density meter. First time was even compound always, and that's just another bit of code. The execution time of that is centered against like how long would the quantum computing take to figure this out relative to like, well, it's just gonna physically mix with the facilities, take the machine and get a result back. And then that too is executed code. So what in the end is the difference in that execution if it's tied to the physical thing or if it's tied to the purely computational thing? Let's open this up for questions from you. The floor is open. I bet you have a few. Really liked your talk, thank you. Thank you, I liked your talk too. Oh, good. We have a fan club. I'm Rebecca Hyde. A couple of questions, but I do wanna first comment on something you said about grad students not needing to sleep, either go to the bathroom, I wish you had told my advisor that. The two questions are actually about going to scale. I don't have any doubt that you will have tons of demand for this, right? So one is operational, one is more strategic business-like. So back in the day when we didn't have access to computing the way we do through Amazon Web Services, we had batch queues, right? So I'm completely behind the idea of the active learning. That's great, but can you imagine one thing about how you scale is having sort of these batch queues like running jobs in slow batch mode. And the second is how do you make sure you don't get acquired by Amazon? Or actually is that okay? Brian, I'm gonna let you think about the second question. Yeah, so you can batch these for sure. You just have your prioritization for Brian's show to this. So yeah, the batch question's the easy one. I'm gonna take that one, easy peasy for sure. And yeah, it's just like computing, you know, just prioritization. When an assistant, Brian, I'm buying time so you can think about an answer for the second question. When an assistant professor buys threads, it's like cycles on a compute system, we sell threads to the assistant professors. Those are their dedicated threads. But in a facility, if other threads are available, then we just wanna keep them busy, right? So it's just the analogy to computing is solid for sure. All right, Brian, how'd I do in buying new time? Yeah, that was really perfect. Yeah, absolutely. And part of what we do in the resource manager is actually slip in. You can imagine a lot of the work facilities maintaining all these machines, qualifying all these machines, so constantly running controls on them to test that they're working nominally. And, you know, the traffic manager knows when there's a quiet moment and sneaks in a bit of maintenance, sneaks in a bit of qualification. And that happens on the same execution system that all of the protocols do. We had a discussion at one point, even for the bashing. The farmers didn't like this, but I think academia would be much more open to it. Of creating like, we were joking, we're probably like the Uber pool version of an experiment where you're like, well, my parameters are similar enough to this other experiment that's running, that what if we just batch that together on like the same run on the same aspect, for instance, and then do one run together as opposed to two individual runs. You can imagine why a farmer doesn't like that idea, but for a cost-saving measure, that makes a whole lot of sense, especially if your end goal is to publish the thing anyway. Why not? That makes total sense. That's the question of being acquired by Amazon. I would hope that we remain independent and see this technology through the end of the day. I mean, one of the sort of, DJ and I have talked about this at length. Doing what we've done is the 10 years of technology development and getting us as far as we did was a huge investment and there are much easier ways to make money than the way we find a amount of doing things. So for us as practitioners that wanted to use this technology and really we built the lab we wanted to use, it's most important for us to see this technology become sort of a new standard out in the world and that everyone gets access to it, that we will have access to it in the future to do our own research. I very much imagine my own sort of retirement as being able to go into the cabin in the woods and run experiments all day if I wanted to with my computer. So for me what's most important is whatever gets this technology out into the world and the broadest way possible. Yep, thanks. We have another question. Hi there, I'm Cody Hansen from the University of Minnesota. Thank you so much, this is very exciting. I'm curious, this is probably mostly a question for Brian but I'm curious and forgive me if I missed this but what aspects of your platform and execution system are proprietary and what aspects are open? Thinking in terms of the open science and reproducibility aspects of this, is the notation for the experiments proprietary to an Emerald Cloud lab or similar lab or is it something that is open for implementation anywhere? All right, so thank you. Very nice to meet you. This is a legal partnership between, talk to a lot of lawyers, between Emerald and CMU, the software that ties all the instruments together is proprietary. Brian, you wanna take it from there? Yeah, I mean it's an excellent question. It is, as you can imagine, a incredibly deep technology stack because we got it all the way down to the bare metal and what's going on on the actual instruments and then all the way back up to sharing the data and processing. A lot of what you saw in the screen here is actually just the simple thing of data visualization. We found that we had to write like a single data visualization platform for everything because if you had to purchase the 200 different visualization platforms that come out of each of the instrument vendors, that would also be cost prohibitive and not work. So we had this essentialized thing. The way it works right now is the acquisition software that's tied to each instrument is still the manufacturer's acquisition software. So if for instance we're using an NMR or from Broker or the XRD from Magaku, it's using their acquisition software and so if you build one of these, you have to commercially purchase that in order to get the thing up and running of course the end of the day. Now all of them have data export and very few of them have method import and that's most of where we do our fighting and try to write all the software interfaces at the end of the day. What we do right now is package it all into this big database called Constellation and that ontology is very public. Anyone can see who has access to the cloud lab, exactly what the ontology is and it is extremely sophisticated. We're up to I think 1,300 different data types even more like 10 times as many of that different fields that are all indexed and controlled at the end of the day because of the intricacy of the data you produce but all of that structure of the data is public and out there. The actual application we run, this thing called Command Center is a collaboration between ourselves and actually Wolfram Research. So the folks who write Mathematica, Steven Wolfram and I work actually quite a bit together on the language into the philosophy of that. We talk pretty regularly, he's a partner with the company. And so, if you wanted access to the libraries of software that we put on top of that which is this thing called Symbolic Lab Language, you might need to, for instance, purchase a copy of Mathematica to get access to that. You have these board of notebooks or you get a copy of Command Center which is not the major cost in getting access to the cloud lab anyway which is our software to be able to integrate all that stuff at the end of the day. Further questions, going once. Thank you for the talk. Boyan Ken, University of Rhode Island. I have a question about the sort of post experiment in the cloud lab. I'm curious about ways to disposal after the experiment. So in the web lab setting, I can see that the bench science, the practice is highly automated in the cloud lab but waste disposal is also performed by staff in the lab setting physically. And you mentioned that you do not expect the scientists to be actually visiting the lab if possible and things like that but I'm sure that there are some people involved with coming in and stuff. So I was curious if best side is also more automated in this setting and how you are doing that. There are, yeah, thank you. There are technicians in the facility who are trained to, sometimes you do need humans to move things. So yes, there are technicians in, trained technicians in the facility that are on, I've learned recently on three shifts because it's 24-7, 365, but they are not faculty and students. These are folks who are trained to be in a facility like this. So yeah, they do the waste management and things like that. And yeah. The waste looks pretty traditional. The one thing that I would mention that's kind of cool, that's interesting that's new about the cloud lab is you can, in advance of running an experiment, you can do the sort of forensic accounting. One might do to sit down and say like, what is every microliter of source coming from? How much waste is that produced? What type? How much plastic where do I use? Like exactly how many tips? Like pipette tips I'm gonna use, et cetera. And because that stuff has to be provisioned for anyway by the resource manager, which is the big software system we write to sort of execute all the experiments, you can get a picture of exactly what that is down to the penny before you run an experiment. So I haven't seen anyone do it yet but I think it would be really cool to see someone write an AI that, instead of just optimizing for some experimental outcome, could optimize for the least amount of waste or spend the least amount of money on materials. Those are all things that are now computationally accessible and sort of connected to the experiments you run. This has real legs with the federal government for sure. Right? I mean, it's a whole, it's a whole discipline, optimizing your experiment, right? And with every aspect, waste, finances, it's really interesting to think about. Yeah, sure. It really is. We are at time. Brian, thank you so much for coming in remotely for this conversation and for showing us around the Emerald Cloud Lab. Keith, Rebecca. Thank you for a seriously mind-expanding presentation and I can't wait to see this in operation next year. So please join me in thanking these fantastic presenters. And with that, we're done. I wish you safe travels home. I wish you a good new year and I hope that I will see many of you in San Diego in, I believe it's very late March, but the dates are on the website. We will be doing a virtual event as well and we'll be announcing plans for all that in the not too distant future, but enjoy your holidays and thank you so much for coming and joining us for this meeting. It's wonderful to see you all.