 Welcome everyone I'm Cliff Lynch the director of the coalition for networked information and you have reached one of the project briefing sessions for the spring 2020 CNI virtual member meeting which will run through the end of May. We have four speakers today all from UC Berkeley and we'll we'll show you a slide in a minute giving everybody's title and identifying everybody. Four speakers do their presentations and then Diane golden Burkhardt will from CNI will moderate Q&A at the end. There's a chat box and we will be putting out a few URLs in there and feel free to use that. There's also a Q&A tool down at the bottom of your screen, which is probably the best way to raise questions and feel free to ask questions at any point during the presentation as they occur to you will address them all at the end but certainly don't hesitate to put them in as they come up. Now let me introduce the topic here very briefly. One of the things that is extremely high on the agenda of just about every research institution that I talked to is developing a good institutional strategy, which is going to involve multiple players around the institution and the particular, you know, assortment of players varies quite a bit from institution to institution. In order to address research data science, Berkeley has been one of the leaders in this area. They were one of the participants in the Sloan Moore data science work, for example. They did some very innovative things very early on and continue to evolve their strategy. And so I'm really, really pleased that we can have this team from UC Berkeley to give you a set of perspectives on their pioneering work in this area, which I think is going to be of great interest. So with that introduction, let me thank you for joining us. Let me particularly thank the participants from the panel from Berkeley. And without using up any more time, Salwa, I will turn it over to you. Great. Thank you, Cliff. And hi, everyone, and a big zoom welcome from the sunny Berkeley, as you can see from our background, or you will see as we continue to talk, though in reality, it is drizzling here and it's a little cloudy. But that's what zoom is right it's a virtual world we can make it whatever we want it to be. So today our topic is infusing data with compute developing and advancing an institution wide strategy around research data science. The impetus for this work is, we have lots of players on campus who have come together we talk about research data but what are we doing as a campus. We wanted to share some of the work that we're doing will continue to do and what we expect our solutions to be our speakers today are Our speakers today will be Shauna dark who's our chief academic technology officer and the executive director of research teaching and learning. And let me tell you I have to give her a big kudos because while she's coming in this presentation with us she's simultaneously putting out fires around A canvas and academic learning and all instructional resilience, you can imagine that most of our universities have been thrown into myself. I'm Saul Ismail I'm the associate university librarian for digital initiatives and information technology. I'm also the associate CIO for UC Berkeley libraries. And yes, we're also managing like most libraries, how we're going to work through providing instruction and research to our faculty staff students. While we stay sheltered in place we're still sheltered in place at our county in our city. And then we also have Anthony Swen, who's our director of programs at the division of computing data science and society. CDSS for short because we're like we're not calling that long we're not using that long name Anthony. Anthony's been with the program for about three years for an even longer since before I started at UC Berkeley, and he really has seen these this program grow but he's also seen how our research data in teaching and other things have come together. And then finally we have Ken Lutz who's our interim director for research it in the co director of Berkeley wireless research center. And one thing I'll say about Ken is that he and I started the same day at UC Berkeley in our positions though, Ken's more familiar with UC Berkeley than I. So at this point, just to give you an agenda will initially start off with what are the opportunities worldwide that have actually made us realize that there's something we can do. And while we call it a problem it's really not a problem more like an immediate opportunity that we need to act on. From there we'll discuss some of the key players in the landscape and the roles that they have played and that they've pivoted to playing to make sure that this opportunity gets realized on campus. And finally, the requirements and concepts that are needed to make this opportunity come to life, and the solutions that as a campus we're exploring and we're looking to emerge forward with. So, at this point, I wanted to talk about an article that was published in 2018 in the Atlantic since then it's been published in a lot of different places, which literally says the scientific paper is dead. What's next, and that's a big deal because as more sophisticated science becomes the harder it is to communicate the results. Yes, you can have your data files in one place and yes you can have the published paper. You can talk about it. But again, can your data be realized with just a paper. So, however, we like to think of it in a different way we like to think of it as the scientific paper, reemerging on steroids. But what can this reemergence be attributed to. And our guests and our money and our campus and based on the work we've done is interactive computing. So the steroid that's feeding into this reemergence of the new scientific paper and the way it should be is interactive computing. But what again is this opportunity called interactive computing and many of you might know this but just to repeat from fields that might not think they have anything to do with interactive computing, such as digital humanities or agriculture precision or even a field like climate analysis and management, or of course data science the obvious. It's basically we're having to learn the need for how these fields can actually be supported by interactive computing, how interactive computing can actually help them enter into an era where large complex research based organizations like ours need to scale up to provide the support with data and basically need to scale interactive computing with data to the entire organization. It's not. It's not appropriate. It's not scalable. It's not an enterprise solution. If I'm doing something for one faculty member and I have to redo it all or again for another student. So how do we do that actually though in a collaborative secure and a human centered way. And we think the answer is in Jupiter but just to take a step back here, I would like to put in a plug. Fernando Perez who's currently an associate professor of statistics at Georgetown at Berkeley University I'm so sorry. I've, I moved to Berkeley about nine months ago from Georgetown University so I'm still adapting a little at UC Berkeley and he's also the co founder of project Jupiter. Basically, interactive computing as we say is a program that runs. The user is writing running and tweaking the code while on the fly interacting with the results making changes to their code, bringing in their data changing their data simultaneously, going back and forth to basically the back and forth between human interaction, human authorship and the code that the computer is running. So as you can see using a Jupiter notebook. This is a simple example but you can tweak you can change your code you can change your fixed points, and your results will actually be different. If you realize, if as a researcher you realize this you're not importing the right libraries you've made some errors on the fly you can make the change you can see what your results sets are and see if that is what you had wanted to do. So Jupiter, Jupiter is a perfect example of interactive computing project Jupiter is actually a spin off from iphone and Jupiter notebook is just a set of standards that uses open interactive computing in the back end. But then what is Jupiter hub that is something that our university has really scaled towards and we'll talk a little bit more about it in the next few slides. The good thing about the Jupiter hub basically what it is is it serves up several Jupiter notebooks at the same time to multiple users in a preconfigured environment so as a user you don't have to worry about the versioning the libraries that you've installed authentication so as a user you could log in with your university's authentication we call it cal net ID here but whatever your net ID password is spin up an environment spin it down interact with your data. Move your data around and make it portable make it so that it's portable you don't have to worry about Oh, I'm running this on this computer. Can I spin it up from another computer. Oh now I'm home and I left my laptop at work since we were shelter in place and how do I work on this so Jupiter hub has actually shown us and projects around Jupiter hub have actually shown us how interactive computing can be delivered at a national scale using cloud computing technologies. If done the right way, which makes the user environment around research and the compute more, more with curated tools, more scalable and more portable but also more customizable. And how can we actually do that how have we actually do that and how are we planning to continue to do that will be explored in the next few slides. I'm now going to turn it over to my colleague Anthony. Oh, one quick thing before I turn it over to Anthony so what we're trying to show is how the paper the physical handwritten paper evolved to a PDF which is a more closer version to the publishing that we're seeing these days, and how actually interactive computing is making it the new paper the new scientific paper of today. And now on to Anthony to explore the next steps. Thank you so much. Now that you heard this amazing promise of interactive video I want to give you just some quick vignettes about how it's being utilized at Berkeley and the surrounding ecosystem for higher education or research and publishing. Next slide please. So, our first example is the foundation of data science class at UC Berkeley. It's quickly becoming the largest class on campus, and it serves well over 1500 students you can see this auditorium. As all abroad call it can fill several thousand students and the lectures are massively popular. Now, the question that might be how do you run this class, provide the computing resources for so many students. Our answer is that our campus Jupiter hub, which is called data hub, and what it allows us to scale so quickly was essentially just instant login through a calnet authentication. And you don't have to do any setup at all downloading any piece of software, or download the assignments, the assignments are just one click away, essentially. So all the students do their labs and their homework on this computing platform. Same thing with the grading and sharing of information. And this really has really had profound impacts on the campus is me. This has become one of the bedrocks of undergraduate education has really exposed so many more students to computing statistics and data science and really allowed essentially evolution and thinking about how do you embed these tools and other types of courses to which I will talk to you about later. Next slide please. So, originally, Jupiter and Jupiter hub in some ways has been designed around researchers of mine. One of the really powerful use cases we've seen is at the Lawrence Lawrence Berkeley National Lab just up the hill from Berkeley. And, you know, it has become Jupiter books have become kind of a standard for researchers across various domains at the lab whether biology, genomics to physics and energy to really manipulate and visualize and collaborate with other researchers on data. So Python has become that kind of a common solution for all these scientists to work together and solving data data science challenges, large and small big data small data sets, and really it has really allowed for this explosion of innovation and across the sphere. And it's not just the lab of course it's across campus to many researchers using Jupiter hub. It's also widely adopted in industry and even nonprofit foundations such as the Chang Zuckerberg Institute. Next slide please. Also one of the also really interesting things that's been evolving is the opportunity in publishing, as you heard from so you know this is the, you know, the next iteration of the paper format or PDFs. And, you know, at Berkeley we've developed tools with the project Jupiter team, such as binder and Jupiter, Jupiter books to really enhance the opportunities to publish interactive computing notebooks. For example, one prominent example with project the binder project is the instant creation of these notebooks for to disseminate the latest scientific research in an interactive format. With example here and you saw in the some earlier slides to was the discovery of gravitational waves. Instead of just reading all over paper, researchers across the world and you know the public can really see the data, see the data play around the day to understand the how the researchers were able to understand what was happening. The next example is the Jupiter book, sorry, Jupiter books which is a kind of a format which builds upon binder. So you not only have interactive notebooks but also now embedded in a format that's publishing worthy if you will, kind of nice journals. So the example on the right to the Jupiter book is the data eight foundations of data science class, and it really allows a student to both have a high quality reading experience, while also being able to interact with kind of modern computational interactive computing data sets. And there's a lot of other examples are just sprouting up it's a very new field in a way kind of like economics for example there is now interactive notebooks being published by this group called quantity con. And so these are just where at the early stages of this. Next slide please. Again I saw it already mentioned this is really interactive notebooks are really having a transformative effect while so many fields data science bioinformatics precision medicine. The list goes on even digital humanities. So what were you trying to do, what is your issue here it seems like it's all things are going well right next slide please. The problem is really, again, is with this exploding demand. And as we saw with next place is really how, how do we manage as a campus to deal with this explosion in demand. And, frankly, based on what we've seen very few campuses have figured this out, probably no campuses really figured out to holistically deal with kind of interactive computing as a whole. So I'll pass it on to Shawna to explain what Berkeley is thinking about in terms of to meet this challenge and opportunity. Thank you Anthony. So one of the challenges that we have on the Berkeley campus is that it's a highly decentralized campus and as Anthony stated the idea of interactive computing is exploding on our campus and as a result. There are a bunch of different units that have evolved and are providing different services around data and compute. We have the CDSS a computing data science and society division, which is a new division at Berkeley which is an interdisciplinary division of science and Anthony was speaking about. We have D lab, which is a center that provides services around data and analytics and training around software for social science we have research it, which is embedded in research teaching the unit that I oversee. We also have a really great partnership program around research data management with the library and other organizations across campus Berkeley library has lots of data guides consultation services. And then we have one it, which is basically represents our campus it units that are really provide administrative and business services across our campus. But we're finding that they are actually really valuable as we start to pursue a more strategic vision around how we can support data and compute with regard to research and instruction on our campus. Next slide please. So creating a campus strategy is really critical for all of us and the campus has made the leap and articulating an institution wide strategy around the formation of data science and the new division on our campus. Various campus units have seen an infusion of new knowledge participants discussion forums community of practice. We're all learning from each other but I think one of the key things that's been happening is that many of our units of new leadership and so changes really being spurred by this shared purpose of reinventing it and there's conversations that are happening now that I think we're conversations and weren't going on before and we're super excited about it as evidence by this presentation and having all of us here working together. Next slide please. So the campus has created and identified values associated with data and compute. And those values are sort of three that are really critical I think to this discussion as well as to the strategy that we're building on our campus and the first is that we're really looking and recognizing that data and compute is happening across domains across different disciplines on campus, and we really need to embrace all the diversity of applications and needs across these different domains. We also have a really clear collaborative vision and the idea of being that we work as a team to help identify what the next steps are for our campus, and how do we share resources to reach those steps. Finally, we're very interested and focused on supporting scalability and when we say scalability we're really referring to the very normal practice at UC Berkeley where we have people who are innovating in absolutely incredible ways and the challenge for us has always been well how do we take and this is true everywhere. How do you take these unique innovations and really work on scaling them out so that they're beneficial to our campus as a whole as a whole. And that's all three of these are values and things that we come together on and that we agree upon and that we focus on as we start having discussions about a larger strategy for our campus. Next slide. So research teaching and learning is a unit that I oversee and this unit is a really expansive unit and includes our Center for Teaching and Learning which is professional development for faculty digital learning services which includes online course conversion. We have a development team and we have research IT which includes our high performance computing secure research computing as well as data consultation. And one of the things that's happened on our campus over the past, I would say five to 10 years is the recognition that we really need to pivot and we really need to provide resources and services to faculty and a new and interactive way and in a way that really recognizes those three values that it was just discussing. There's a lot of work in one RTL is a new wish unit like we really came together and whole completeness I would say about a year and a half ago and we're really reimagining the research and instructional supports experience for our campus. One of the services really again as I said I've had to pivot to support programs not just for instructors but overlapping needs for researchers and this is one of the things actually that can and I were on a call yesterday talking about how we're beginning to see the convergence between instruction and research and a lot of areas but particularly relative to the fields of data and compute and around data science. We also want to continue to support opportunities and areas for innovation so you can see how those three core values are really critical and helping to drive the way in which our units are thinking about how to support our campus. Next slide, and I forget who comes next Anthony. Yes. Well, you already heard some of our work with the foundations class and how it's been scaling. What have we have done to do that make that happen is really pioneers technologies, such as utilizing the Kubernetes to really scale the scope of our hub to not only support, you know, hundreds if not thousands of students, but but even students across the world utilize such as our Jupyter notebooks across courses. And I think we're also by getting the scalability that where thousands of students can instantly access notebooks. We are we are also percolating on the curriculum level injecting data science into so many other courses. As you can see for neuroscience psychology geology. So, they can be full blown courses in majors, but also specific modules that are embedded within a general course that might not be traditionally considered data science. We have many examples that in digital humanities, for example, we also are working on really with folks at RTL and other departments who really better integrate the core Jupyter Hub service. For example, will be mentioning further on that the be courses which is our campus canvas instance to be the fully integrated with Jupyter Hub so when the student learning management system these interactive notebooks can be instantly accessed and seamless going forward. So these are some of the things on the education side specifically the division has been working on. So next up is the library and this is interesting. So what has the library been doing so when I first started here about nine months ago, and I started meeting with Anthony and can and Shawna, they're like, I'm going to go up Kubernetes environment. I'll dockerize all these things and let's let's embed the library's repository. And I'm like, Oh, great. But library has also had to pivot so libraries in general, we've always been the repository for data access and data deposit. We buy data sets we make them available. We have faculty staff students when they're publishing their data and try to provide them with data repositories, either in their subject areas or hosted by the libraries through institutional repositories, or other specific data repositories. And so while the demands have stayed the same. We've had to pivot because the questions asked are great. You have this data set it's linked in your library catalog and it's downloadable through a box instance, or through your dry add instance or through ICP. Sorry, through your ICPS our instance. Yes. Now can this be available to us in a normalized way. Can my server just call this data when I'm running in this environment in this computation environment. That's ubiquitous and seamless. Can I just call this and it'll give me the normalized data I need. And I'm like, Great, let's work on this. Can there be a data lake. So the question that I asked here is like, What will for the library the data repository be. Is it the same for our published data versus our purchase data and the storage where does the storage come in and we're talking about data sets that are not just in gigabytes not just in terabytes but probably an exabytes of data that then need to be normalized and made available in a in a very seamless portable environment. How do I work with RTL our colleagues and research teaching learning RIT CDSS to see if there are integrations possible with can there be a research data lake that the campus is thinking about. We often hear of the concept of data lakes with business data bi data that campuses collect around patron information or other vendor and financial information but what does this data lake look like in light of education and research based data. And for us taking it a step further adding another complexity is that libraries not the only entity on campus buying data. We have labs that are buying data that are purchasing data we have D lab that are data lab that's also purchasing specific data they have the federal senses data. Home to their institution. We also have other entities that actually negotiate data licenses for campus use and then the question comes up how do we house them library help us house these so that everyone can use them. We work very closely with our California digital library and last September we migrated or they as a university system University of California. We migrated our dash which was our data repository to dry add and so the question here also comes in what role does the dry add repository play and in light of dry add working closely with Zenodo. What they're calling project DJ DZ where they're mixing up the repository that not just can house data but software the algorithms that support this data or that support the analysis that goes into this data. One of the first questions I was asked when I spoke with our dean for data. So our dean for the high school and vice provost for data who leads our CDSS division. When I met her in January and she joined us and she asked she's like so where did the algorithms go where what is the library doing and helping us provide access to these so that they can be reusable. So these are questions were answering but these are also questions that we need to actually work on developing solutions for and we are we are developing solutions for and the library can do at it alone. We have to partner and we are partnering with all our campus units. And finally as Shana mentioned and Ken will talk about it in a little bit. We have a very strong partnership with a research data management program, which is a joint program by the library and research teaching learning where we actually have shared positions that have dual reporting. We do a lot of outreach and consulting around this where we help faculty members and researchers from things such as help me upload these millions of files sitting on my computer to a box environment to another environment to help me with data management to you name it around data and Ken will explore that a little bit more. Thanks. So, yes, just as the library has had to pivot in response to the challenge and opportunities that interactive computing and presented us with the data services and research data management operations have had to adapt as well. Research data management needs to change from individual researchers working with managing their own data sets to a new world where researchers are drawing on shared and dynamic data sets that may be ubiquitously available and supporting extensive collaborations both within research groups and outside research groups doing similar work. So just as the Jupiter notebook can be shared. So can the data that underlies the notebook. An initial, an additional responsibility therefore for the research data management people is that verifying and maintaining data provenance integrity in this new dynamic world will become part of the data management process and life cycle. Next slide. In research it once again we've had to pivot and the landscape is now changing provoked by this really exciting new thing called interactive computing. It used to be that a data set was very tightly associated with the particular computer data and compute were essentially location bound. If you wanted to use some data you hadn't said you had to ask yourself the question where is the computer or IP address, it has the data that I want to compute on. You had to go to the computer, go to that data and compute, or you had to explicitly move the data to the computer that you wanted to run on. So in that world data had a location and computers have state researchers worked in labs their peers and only their peers. But now with the advent of interactive computing the world is really changing and becoming much more dynamic and collaborative and so we need to support this slide. So the pivot for research it is first to realize that we need to factor data and computation. Interactive computing can map on to a variety of portfolio of computational resources depending on what is the work that you want to do. We have high performance computing clusters both on premises in the cloud. We have secure computational resources for sensitive data and we have virtual machine environments. The computers in these worlds are both stateless and fungible. Data sets under their hand are ubiquity available and map in a similar and agile manner to support this interactive computing. So we want data to become locationless and yet readily available. Next. Thanks Ken. So as we think about this new model and the way that we think about data and the ubiquity of data as well as the capacity for our faculty and researchers to access data in a way that is dynamic right and allows them to do so from home during 2019 among other things. We thought a lot about how you know how do we come together as a team and so over the past year we've actually had two different data and compute workshops. We've got our participants listed here but there are folks from all over campus from that group and that list of different units that I discussed earlier as well as folks from our Academic Senate and our Darius and T office and and really what we were doing is a SWAD analysis and identifying what are the opportunities and threats and and how do we come together to really build the future for our campus. Next slide please. Thank you again for campus and data data and compute really is this universal interactive computing and the hybrid scalable computing infrastructure and integrated consulting services. We really see these as the three kind of foundations of our strategies moving forward on our campus. And that's it. Thank you very much and we are eager to take questions. Thank you so much to all of our panelists. Very interesting model. I think probably raised a lot of really good questions. In fact, we had one question come in during the presentation that Shana weighed in with that was from Jesse, who asked if the department is part of the of the library and the department. I believe he was referring to as our our team research teaching and learning. Yeah, I think that's what you were referring to Jesse and we are actually located we I direct directly report to the Vice Chancellor of undergraduate education, and I have a dotted line to into IS&T. Okay. All right, great. Thank you so much. Thanks for that question, Jesse. And, Shana, thank you so much for clarifying and just to remind everyone we there is a Q&A box you're welcome to use the chat box. There's also a Q&A box if you want to type in your questions there. And a member of the panel will be happy to respond. We have plenty of time for questions while we're waiting for folks to type in their questions. I just want to remind everyone that this webinar is part of CNI's ongoing spring 2020 virtual meeting which we still have a couple of weeks left to go so lots of great offerings still on the horizon and I just shared with you there the link to the schedule for the upcoming webinars. We hope that we'll see you back here for some of those offerings. I just want to point out to everyone that we did just early this week add a wrap up closing plenary with Cliff Lynch that will take place Friday, May 29. So if you haven't registered for that please join us for that final session. I was curious to ask, oh, before I ask my question, let me switch over to Mackenzie Smith. Mackenzie has a question. Hi, Mackenzie. Her question is, has the need to deal with 100% remote teaching affected your plans for this integrated service next year, and how about the shift to remote research? Shana, would you like to start with that or do you want me to take that? You know what, I think that's such a difficult question to answer at this time. It's funny because I just sent out an email to all of my staff sharing with them the challenges of being in a leadership role during this time where people are asking us questions about well what's next and we don't really have an answer for that right and this is definitely one of those. The great thing is that when we are thinking about the idea of data becoming locationless, it certainly helps to build a more resilient structure and framework for our faculty. So, you know, this really kind of falls in line with thinking about all the impacts that we have had on our campus with regards to power outages as well as COVID-19. This really is the way of the future for us in terms of being able to access data and allowing our faculty to continue to do research during this time period. So when we had our fight, so we're in Berkeley, California, Northern California, and we had a few fire outages last year due to power outages due to fire alerts or I'm still learning the terms of California uses here. This was this research and data compute the strategy was actually given a boost because we realized if we moved on it faster, it would provide us with the research resilience and we actually did from January through almost end of February. Donna, myself, Ken, Anthony, Catherine Carson, who's our associate dean for strategy at the CDSS division, our colleagues for infrastructure at ISNT, which is our main campus IT. We had started to come together, we had started to put plans into action around, how do we ensure that we have this Kubernetes dockerized environment that actually can scale and scale down. And that was that work from an enterprise level which campuses need for BI or business needs to how do we do, how do we transition that to for research and academic environments. And then I guess COVID had other plans for us because in March when we have to take a little bit of a pause to restructure, re-pivot on providing research and instructional resilience as is. We did use some components of our strategy to answer the question so our high performance computing center, Savio and AOD analytics environment on demand and Ken can elaborate on that further. Researchers are using that. However, entire strategy has of course taken a slight pause, but it's still moving forward. Given that we don't know financially where things will go for University of California, Berkeley campus. Other plans may have to be altered or pivoted because that's been the theme to support that. However, this is a priority for campus. It's in our strategic plan. And our hope is that if we're not able to deliver at it in the next year, it might be the year and a half, but we'll continue to work at it slowly. Ken and I actually just had a meeting yesterday to talk about, okay, well, how about how do we bring this consulting and outreach together so just goes to show that the different flavors are continuing to exist. And the collaborative approach is also there it might just have slowed down. And I'll just also just real quick, I will add that the two things that have been really interesting is that we're seeing now very kind of higher level discussions on our campus around integrating some of these tools into the learning management system as a portion of that, you know, in response to COVID and remote instruction. And then the second thing which maybe can can speak to a little bit is the virtual consultations that we've had around our research data management and consultation services have actually increased. And so it's really fascinating because that pivot has actually been something that's really positive and I think that will allow us to be more flexible and helpful to our faculty and researchers on our campus. And Diane, I'm not monitoring questions. Is that okay because my screen is being shared. No, that's fine. I will definitely be moderating those, but I just wanted to make sure did Ken want to weigh in there. That's just anecdotally I can say that the move to virtual consultations and trainings and present interactions where we normally convene people in a room or hold office hours and have people come and visit us. We were forced to move very quickly to obviously to virtual and previously we might have viewed that as a somewhat risky step to take of perhaps this will our customers will not get as much out of it. Turns out that it's even more successful on our levels of attendance have exceeded any for in person trainings consultations have gone up so we very quickly learn the lesson that taking this leap forward is actually tremendously beneficial and we will not abandon it it will become part of our tool chest. That's tremendous. Wow. Well thank you for that great question Mackenzie and that very thorough in enlightening answer to all of you. We have lots more questions coming in here Jesse has another question for the data management slides. Can you please provide an example real or hypothetical of the shared dynamic data sets to be ubiquitous and collaborative. Sure. One of the ideas that that comes to mind is if researchers are working with a base data set say a large data set that in one case may be used as a training data set for developing machine learning models. Researchers essentially going to take that base data set do some processing do some computation produce a derivative data set which in the interactive computing world can be put out for other researchers to look at both the computation that was performed and the output that was was derived from that. They can rerun the computation themselves they can do variations on it. So this world of being able to take a data set process it put out the results for people to look at they can then try variations that processing they can take the machine learning model they can look at it. It's a much more dynamic back and forth kind of world where the the dynamic the dynamic nature is, you know, in one case is in the form of these derivative data sets to the result of your processing that other people can then pick up and take forward and try a different algorithm on or take it further. That's kind of that that's kind of the world that we're thinking about. Interesting. Great. Thank you. We have a question from hyping Lee, who asks is data carpentry a part of your program. If so, which unit is managing that. I can speak to that a little bit. It's not a direct part of our program but our research data management program, which is again a joint program between the library and research teaching and learning that are Shawna and can oversee and research it. We actually do support so we do have coordinators we do have instructors who have supported library carpentries. It's run through CDL so California Digital Library provides a lot of support we have instructors who've gone in. We do push and publicize and do a lot of outreach around data carpentries when they're hosted locally. One of the things that was done right before I started at Berkeley was that we had written up a data plan around librarians and bring data savviness for librarians outside of the librarians who actually are part of the RDM program. And how do we bring about not just a basic but an intermediate level of data savviness across all our librarians and data carpentries was part of that is. As to who runs it. It's again a collaborative effort between our D lab our RDM program and the library. And all of them have instructors and it's a small team they work together really was a small team of librarians and folks from D lab. They work together and they communicate the information it was almost as though if not only is data ubiquitous but our services are ubiquitous it's like seamless the user faculty members or students don't really see who's running it behind the scenes they just see a service that's coming to them. courtesy of these different organizations. Oh that's terrific. Very transparent to them just in terms of getting what they need, not having to think about where to go that's great. Exactly. All right. Thank you hyping that was a great question. And thanks for that response Salwa. We have a question now from Cliff Lynch and clips asking the idea of data, especially large data resources becoming locationless is very desirable but hard to do. Can you talk about how you're approaching this technically. I'll take that one. I was being being somewhat purposefully inciting that that question by putting that out there it's, it isn't locationless data is really, it's the dual of ubiquitous computing that we used to talk about back in the day. So the way that I think about some of the elements that make this technically possible are, for example, having data be assigned globally unique IDs to identify a data set, and then providing satellite networks to route within, for example, a data lake or some sort of data environment where you have an object store, looking for that globally unique ID. In addition to that, especially for large data sets where you say I like to make this ubiquitous available. You want it to be. It should be available by it's totally unique to be but you also want to start to provide facilities like caching so that you're not. Everybody's not copying the same data set down to their local computer. Perhaps you're only pulling in the portion of the data set that you need. You can either use it, your cash gets another hit, but ways that you can, you can let data reside in a data lake that doesn't have any particular location but yet you can hide the latency that that would otherwise imply. So it's a multifaceted solution and it's not, you know, this is not a done deal. This is something that researchers have been working with at Berkeley and the rise lab companies have been started around this idea. It's a work in progress but we have made, we have made some interesting inroads in overlay networks to let you locate globally unique identified data, similar in flavor to what's done with content content centric networking but without a bunch of the caching content centric networking has flavors of what what we imagine, but this is definitely a work in progress and it's aspirational but we think it's the right direction to go. And can, if I may add we're thinking about this as the data lake and then this orchestration layer which has these networks which has these facilities that over that sit on top of this data lake or I shouldn't say sit on top but that are then available to the user so the user doesn't really ever see where the big data is being moved back and forth from it's the orchestration layer that's then doing the heavy lifting. Yep. That's so interesting. And I was just thinking the next time you give one of these talks maybe your background should be a big data lake and the matching sweaters clearly dying. That's true. Yeah. So CNI December. Berkeley team again. Right. Okay, so we have another question from Jesse now. Our campus interactive notebooks are not widely used and we've experienced some resistance of learning quote another tool from students and faculty. Do any of you have any advice for how to approach teaching these platforms and quote selling them as learnable. Thank you we're trying to do something similar using health medical data on our campus. So that description is very helpful. Anthony is this something you'd like to take. Yeah, happy to take that on. I think are the key. It really helped Berkeley that we had the founder of Jupiter here and also some of the key developers of the technology, which is unique and maybe hard to duplicate, but I think the key impetus at least my perspective has been this fundamental course you know if you essentially ensure that all incoming students to meet computing or statistical requirements are taking this foundation course and that course is approachable accessible to every student because I think that this is not only CS students or data science students taking this course. It's students from every major, actually. And if you make it so broad and ubiquitous, whether the student wants to move on to the next class, or that has a data statistical component. I would love to use the same technology and toolkit and not some proprietary software that's using a that's been traditionally used in certain departments. So I think the bigger vision is like Python and are on Jupiter notebooks and on Jupiter hubs is in many ways going to displace a lot of these proprietary tools that are used by certain departments. And that, you know, I think for certain people that's kind of upsetting but that is a trend in industry. And I feel like if we want to prepare students as researchers or as a scientist and kind of an analyst going forward they have to use this source of technologies. But I think the best way is to start from the bottom up to make sure that you know when every student is aware of these tools. And it just doesn't make sense to train them on a new tool that has lesser features potentially than these open source toolkits. So that that's for us is really like for me, I think that's a key mobilizing message. So that that potentially is difficult to to create such a course. But I think that's been really the big momentum shifter in the least the last five years for ever quickly. Yeah. Okay, thanks. It's all it was there. Is there anything in particular from the library side about liaisoning liaisoning with departments that has helped in that process. And as Anthony mentioned the data eight course the foundations course that we do which is the second largest course, growing course on campus. The way it has helped us so I'm giving a little bit of context before I answer your question directly Diane is that as we said we have students who come in from social sciences we've students who come in from all different departments who are just in data science or just an engineering or electrical engineering is what we call them. They're across the board biology health, non health courses. So where the library comes in is we actually partner with data peers we actually data peers is a data student to student support service for students doing data analysis data questions anything related to anything around data. But because the course is already late foundational work around Python and using Jupiter notebooks. That's where people come in with their questions. How do I do this analysis and this data peers consulting is actually run out of the library supported by CDS as and their students and they run the program. We provide the support but then our librarians and our consultants who are not librarians or consultants and research it or consultants from research data management and our librarians who are part of this consultant group outreach and consulting. When we get questions around data support data analysis, we actually do promote the use of these interactive computing tools Jupiter hub being one of those because we actually have access to it. We have the tools the configurations that have already been set up. Again, before I started I have to keep reminding myself I've only been here nine months. We were actually working on some a few of our librarians are data librarians actually working around Python and Jupiter hub and providing that interactive support to our users. One of the things that we've actually realized is where the library is now putting in its effort is at the beginning life cycle of the data. The acquisition the finding the let's get you this normalized data which is still something we're working on on how does a researcher just go into Python so great example is actually this is a perfect example of faculty member once asked me. I know you have this data set I'm running this Jupiter notebook. I just want to do a call for my Jupiter program that hey go to the library's catalog bring me this data. And like can library provided to me in a normalized way and I was like, well I've only been here three months at this point but I'm like probably in about two years we can do that absolutely we should be able to do that. So it's one of these things where the library also promotes these tools that are around with the campus is using and hence the enterprise scalable solution in form of our Jupiter hub being integrated with canvas, which is our be courses as we fondly call it Berkeley is very be male be Berkeley so be courses. And that's where the library has also played a role so thank you Diane I know there was a long way of answering that question. No, no quite that was interesting thank you. Okay, and also Jesse thanks you for that answer as well so we do have a little more time for questions. I would like to type those into the Q&A, or into the chat, if you have comments if you're working on a similar model or aspiring to a similar model at your campus. This would be a great opportunity to chat with folks who have a lot of experience doing this and I'd also like to point out that we have the option in this environment to unmute attendees. If you'd like to make a comment live or ask a question live. Please just raise your, your hand your virtual hand, and that will signal to me that you would like me to unmute you and feel free to do that. All right, just giving folks a minute or two more to make sure there are no more questions out there. So while folks are typing in or pondering what to type in I'll share a funny or an interesting anecdote. This is our first data and compute workshop that Shawna mentioned we had held on campus, which had hired members from the Vice Chancellor for research and our campus it but our campus computing deans and lots of different areas are Berkeley Berkeley lab the LBNL. It was the first day that Shawna started and Ken and I actually planned that in a way so we're like alright great this will be a great way to welcome Shawna. Let's throw you into this workshop where we're all brainstorming how to make data and compute happen within the first year of you being here. I still remember that date so vividly Shawna because that was the first day. And I had only been there three months before you started. Yes, your first meeting. Oh my gosh. We do have a question from Mackenzie Smith. Is there a steering committee or some type of oversight for the programs planning and resourcing. I will say an informal steering committee does that sound right panelists. So it's mostly as we said a purely collaborative effort it's Shawna myself can Anthony Catherine Carson who's our associate dean for strategy at Division of Computation Data Science Jen Stringer who's our deputy CIO and few other colleagues from across campus we meet and we move forward. The most part the driving force so far has been Shawna Ken myself Anthony and Catherine. I'd say we're the core team leading this forward talking about what kind of budgetary decisions we need to make. This was again before the budget before COVID so we had plans, but we were looking at these questions like how do we play what role do we play for campus because we know this is a campus. Priority. How do we ask budgets how do we ask for money from Cal Hall which is our California Hall. How do we move this forward and then bring in our deans so my university librarian, Jeff Mackey Mason, Jennifer Chase who's our vice provost for the division and then Shawna and Kathy Koshland and Larry Conrad who is our CIO and Jen Stringer who's our deputy CIO. Did I get that right. Okay, great. Thanks. And another question from Cliff now one point you made early on in the presentation is that we might think of Jupiter notebooks as the modern version of the scientific paper. What kind of uptake are you seeing in the scholarly publishing world for this idea and what infrastructure is the library developing to support these developments. Oh, that's a tough one I knew I should that would have come from cliff. So the uptake that we're taking, we're seeing at Berkeley and please Anthony and can and Shawna jump in as appropriate. Is that we're actually seeing faculty ask for this. I didn't there's this amazing little video that talks about how someone wants to date and it's a funny video and maybe later on I'll share it with some of the members of this group. And for us is our faculty are asking for this we're seeing our are postdoc doctoral researchers we're seeing our PhD students actually providing their, their research through these data notebooks as opposed to this long written papers personally personal research and computational social science and informatics that I do internally within our department the research that we're doing we're sharing it within notebooks and then my colleagues can go in and change actually my code and bring in different data and do things. As far as what is the library doing. The first thing that the library is looking at is actually being able. So again this goes back to where does the library fit into this ecosystem. What I ideally see is fitting into the front, where the data is being acquired, and then the data being published, and I see actually this publishing happening at both ends I see it at the beginning where we have this environment available for them, where this data can then actually be accessed by users are reused through this in their scientific papers without coming to the library, or coming to the library website and downloading it and then moving it somewhere else, and then automatically publishing it right back into the library. In terms of where we are, we're, we're, we're trying to make the strategy work we're in conversations with CDL around, how can we actually expand this or see what integrations we can do within their easy ID so their unique identifiers as Ken was talking about for this data lake. It's not the dry add repository that we actually have access to, but making it just for our campus. We've actually started thinking about what could a repository be if the repositories in name, and the reality is that it is actually just a data lake, and the user doesn't care what the name is, if the name is dry add or ICPSR just throwing out names right now but name, or if it's a data lake and the user just knows this is where they can go. So, what the library is doing is we're trying to work very closely and partnership we know we can do this alone so I know I call in Ken all the time and Ken's like yeah let's think through this and Shauna and other members of our campus so Walter and others were are my campus partners at IS&T. How do we think through this and this actually happens a lot at that informal group when we're thinking about moving forward in this for this strategy. I know I don't have exact answers because I've really we've only had like two months to work through this thing since January February before we were pushed back a little bit but we're hoping that when we come back with matching sweaters and the data lake in the background, we'll have more concrete examples to share a cliff. All right you heard it here. You heard it here folks. Back to CNI December 2020 for data lake hats and more. All right that was terrific thank you so much. Seeing as how we are now beyond the time allotted for the webinar I'm going to thank our panelists one more time for this fabulous presentation. All of our attendees for joining us today and wishing everyone well.