 I fod yw'r berthynas yng Nghymru sy'n gwybod bod ysgolwch yn ei gynhyrchu, Mike Holloway yn ysgolwch yn ymwneud y Ddata Sianthus ym Mwneud UKCH Llincaster, oedd yn ffocus ar gyfer ac yn adeiladu'r ysgolwch, ac mae'r cyllid o'r newydd yn gwybod y Ddata Sianthus. Mae'r llyfr yn dda i'w ffasgol a'r cyllid yn dda i ddata, yn ffocusio a'r lleir y Llincaster i dda. Michael Zell yn y Ddata Sianthus ym Mwneud UKCH, ac yn dda i'r Llincaster, a'r debyg ffocwsio o'r gorfod o gweithio, byddwn yn defnyddio'r wneud i gweithio'r cydweithio o'r gweithiau o'r dydagau cyfforddol ar gyfer o eu cyfrifiadau envir yng Nghymru, yn gweithio'r ddau cyfrifiadau newydd yn y ddefnyddio a'r ddau cyfrifiadau. Fy ffynol, Tom, yn y Cyfraeg, yn y cyfrifiadau ac yn ei gwybod yw'r gweithio. Byddwn yn gweithio i'r gafan ar gyfer y cyfrifiadau cyfrifiadau ac yw'r cyfrifiadau cyfrifiadau. Tom works in the Biological Record Centre, the BRC, and he works with statistical experts to develop methods for analysing species occurrence data and works to make these methods available to other academics and researchers. So, with all that out of the way, let's make a start to the webinar. Mike H, well, Mike Holloway, is going to lead off and he's a brave man because he's going to be doing a live demo to start off with. So over to you please, Mike. More on everyone. Yes, like Mike said, I'm going to be quite brave today to try and show you the data lab project live and show you how it was used to do a scientific challenge within a project that I'm heavily involved with, joined with Lancaster University. So just to put a bit of context, the project we're working on is called data science and the natural environment. And it's pretty much what it says in the tin. So the idea is to bring computational scientists, environmental scientists, data scientists, and yeah, statisticians together who try and answer, bring a new way of looking at environmental data, because often people just resort to the same sort of method like mean fitting a regression model. So we want to sort of see if we can use new methods. So in this particular case, we were challenged with coming up with a new way of evaluating an American model aside from just comparing some time series and looking at the mean or something like that and how well it performs. So, like I said, like Mike said, the deduction, the data lab that you thought of a great thing here because we thought it can break down the barriers of collaborative working, you know, sharing code sharing data, ensuring everyone works in a consistent version of data. So here we are with the data lab. So when you log in, like we said, it uses the power of the cloud, it's hosted on Jasmine, so all you need to access, you don't need any special computational skills, you don't need anything like that, you just need a web browser. So you come into this web browser here and you see an information page. I did actually just update this information two minutes ago. It doesn't seem to have come through to explain what Disney means, like I said, data science and natural environment. So you come into this area and this is something everyone on your project can see and share. So, first and foremost, we have some data. I'm not going to go into details because I can prefer to the previous seminar or webinar where you're going to deal with this aspect system, but we did have a storage volume here where everyone's data is stored. Everyone works in the same version of the data and this is where sort of our small level data model data stored. One thing I will say is Lancaster University had some data stored on their system here. So we actually used a mirror and we use API access to the thread server to read that data directly off. So whenever Lancaster University updated their data, it fed straight back into the data environment. So, first and foremost, I'm actually going to go through a notebook. And for those of you, most people who have used Jupyter notebooks before, but that's the sort of the backbone of the collaborative aspect of data labs is using these great environments to work together on a particular analytics piece of code. So, like Mike said, we have Jupyter and we have our studio. So, I'm going to show you one I've preloaded. So you come into this environment here and you have a notebook nice up. So, in explaining what's going on, and it has a nice combination of code, data set and also a list of all the files that you've got working in the environment here. And the other thing to show is when people are sharing code and data, you often have to get the environment working on your own system. You have to sometimes share the data like I said already, we've catered to that in this project. We've got the numerical model data sitting at Lancaster University, which we're reading into the labs. And we've also got the observational data we're comparing our model to sitting here live in the data labs. So, again, everyone's working off the same version of code. It's linked through the notebook through Git, so you can attach to a Git repo here. This one particular one isn't because it's a demo notebook, but the actual notebook normally is linked through to Git, so everyone's working for the same version of the code. And everyone can edit the code as well, and also it's version control, so it's backed up. But more importantly, we have the, we utilise nice things called condor environment. I'm not going to the technical details, but basically it's a package management software so that everyone has the same environment available off the bat. You don't have to install a load of software to get the code working. This code will work, working for me if I was to share it with you or my colleagues at Lancaster straight away. They would have access to the same code and it would run if it broke for them, it would be broken for me. So that's a bit of a catch-22 in the sense that if something breaks, you're broken for everyone, but at the same time, it's a good way of me, you'd have to spend ages getting things set up because you've got the data there, you've got the notebook there to tell you what's going on, and you've also got the environment so you'd have to set that up. So you can concentrate on the science. So speaking of the science, let's show you actually how we used the data labs to do the science in this particular case. So as you see here, we've got a notebook environment up and running, and it's a combination of code and text to explain what's going on. And also we have this thing up here, this is the kernel, so this is going to ensure you're working off the same software environment. So to put into a little bit of context what we are actually doing for this particular project is we're using a method called change point detection to see how well our regional climate models in this case with capturing the dynamic scene of weather station data. Now normally, like I said, people will look at global statistics across a time series and I will actually scroll down quickly to show you sort of something, the example of what I mean. So you have a time series here and people, if they were evaluating their model or their data side by side, you would look across this entire mean or variance across the entire thing and see how well it compares. What about these local scale events, say these peaks and cross, how well do they capture? So this is where we use change point detection because that look for these changes. So in the notebook we just load up the packages so for people who know are it tells you what packages you'll be using in your environment. So other people can come along and understand what sort of software environment you've used. Then some warning messages or package loading messages to tell you what's going on with the analysis. And then you come into the sector of the code where you read in reading your observational data. So like I said, this is located locally on the system. So you read it in from the file store and then you can sort of plot up a map to see sort of visualise the data. Now what you can do here is you could bring in another expert, they could sort of say, well actually maybe you want to look at another station so they can have a more better knowledge of the Greenland ice sheet in this case. So they can actually bring in their data set and add that to your data set on the map. And what you can do with the notebook section of the code is you could actually edit this particular code so you can see it's a completely live coding environment. They could edit this code so here to bring in their data set as well. If they've got something extra or something they want to share with you and say, I'll collaborate on this. And then you can automatically view their side by side with yours on this map so you can sort of see what's going on. Then you extract the model data. This is read in, read in from Lancash University through a thread server. So that's actually uploaded from our data frame here, but actually it's being directly read from the university side. So like I said, if they were to update their data at that side, this would be able to pick it up and pick up the new data. And then you can see again, you can visualise the data and you can critique it together to understand what's going on. And finally, you can then plot the two time series side by side, which is what this code is doing. And you can visualise it and see that how your model and your observations compare. So by eye, they look pretty good. But finally, I'm not going to go into the detail of the change point now because I think Michael touched on some change point things in his talk. So I will leave it as like I said, we detect the change points in the time series and we see how well they compare to each other. So as you can see from this rather big plot here, the dashed lines of the model change points, the blue line, sorry, the dashed red line of the horizontal vertical words right second are the ofs. And you can see that some cases you've got an observed change point, the model hasn't picked up. Another case you can see they're slightly missed. So we came up with a new method that used overlap of confidence intervals to see how well it captured and it gave a score out of one. Basically, it's closer to one, the better the model evaluated. So that's essentially what this table down here is showing in a second. But you can also, like I said, focus on specific events and localise if you've got an event you particularly interested in. Say your environmental scientist looked at this time series here with you and says, oh, I want to focus on a specific event. Let's code that up now and see what we can see. You can edit the date range here and you can look at that live. So you could change a particular date range here and focus on a specific year. That works quite well. So you can say, well, you can you can work together live to edit and say focus on specific events. And then finally, this large room of code here and actually producing the analytical method. You can get to the bottom and you can look at the overlap of confidence intervals. Like I say, if a lot of code here, if I'm going to get into all the details, it's too much. Actually, there's no table there somewhere. There should be a table that gives you the scores of each overlap of confidence. And like I said, closer to one, better evaluation. Now, like I said here, if this method wants to be used with something else, you can take the code notebook, you can bring in your own data set and you can apply the message for that. Or if another scientist comes along to that, actually, have you thought about trying extreme value theory here to coincide with the change points to look at that? What about if we try running that, you could get that expert. They could come in, they could add the codes to your particular notebook or they could run their script and bring it into your notebook. And you can do the two methods side by side and then look how different methods compare evaluating different localized events. So it's quite handy to bring everything together and explain what's going on at the same time with the description of the code and the data sets. But also, we appreciate that not everyone's a coder, not everyone wants to look at rings and ring the code. They want to explore the method, understand what's going on, but not exactly see rings of code. So one thing we can do is also use things like our shiny panel, which is a shiny equivalent for a dashboard for Python's coming to data labs with a feat function that you've not quite there yet. And so what you can do is this is also public facing. So you could put the URL in now and run this out. It's the same method that I've described in code, except it's for a nice graphical user interface. And this is sharing, like I say, if you've got an environmental expert that isn't a coder, but can bring a lot of valuable expertise to the analysis and you want them to explore using the method and then give you an environmental angle because all you're looking at is the statistical side of the thing. They can use this app to go through the method, execute it and give you some insight on things you might want to focus in on or things they want to look at themselves. And, like I said, it's using the same code base because it's all in the same environment. So I'll just quickly run the one site. Now, blow that down, you could take a while, but you can see it executing that code. I just showed you live in the background. You run through it and then it could print out some analytics at the end for you. So you can see your map there. You've got locations, different sites. You can run your method, look at your localised change points, look at your evaluation and then also get a summary of the scores and the evaluation statistics for that particular model versus the observation dates. If they want to focus on a different site, rather than have to edit the code to do that, they can just pick a new site from the drop-down menu. So we'll get with that one, extract the data again, bringing it in from the local store or the remote store, analyse and run like that. And if you can see it running away in the background and in the second or produced new time theory. So this was enabled us to share with our environmental expert on the project and they looked at it and said, oh, actually 2012 seems to be an important year. How well does the model perform in 2012? So you can zoom out and you just focus on the year 2012, January, zoom in and you can zoom in on the peak of the year and actually focus on a particular event and see how well the model cope. And that did a great way of breaking down the barriers to getting complex data science methods to an environmental challenge. And the final thing to say is if you've got a big data set you want to work with, you can utilise the power of the cloud, like Mike said. We've got not HPC options, but DASC and SPARC, which are native to R and Python, which are likely to do highly paralysed data sets. If you've got a lot of millions of points, you can paralys your method to execute this and speed everything up. And I think what I'll do now is stop there and then hand over to Michael to go into some more actual scientific and some projects he's been working on, if that's OK. How do I stop sharing it up? There you go. Great. Thanks so much, Mike and everyone, see my screen. Yes. Everyone, I'm Michael. So from UKC, I've worked heavily with the Data Labs platform. And today I'm just going to show you one of the example use case for using Data Labs, which is the UK information change network, ECN. Right, so I will talk about what ECN is and then I'll go into free example challenges and how Data Labs can help tackle these challenges. And then I'll present also a little bit of research finding at the end of it. So what is the UKN? It is a network of 12 sites across the UK that collect long term called about a whole system approach to integrate data and models of various subsystems for study ecology. So it collects everything from physical, chemical to biological, driving response variables. Hybrid is quite difficult to take full advantage of this whole system approaches often require certain methodological, cultural and infrastructure shift to, you know, this kind of collaboration. So I'm going to present some of the progress we have using Data Labs to play in the past year, how we can tackle some of this challenge and I'll summarize them at the very end. So the first example I'm going to show is one of using third party data. So we're trying to understand what affects the ECN rainfall chemistry. So even though using ECN collects a wide variety of data, it's only doing so at the site. So we have to bring in lots of other third party data sets. And it's often difficult to do with individual researchers as just doing it on their own on the local machine. There'll be a lot of copying and sending files around. But with Data Labs, we can bring in all the third party data in the same data store and everybody can analyze the data together. So we are bringing the land weather classification data and also the error five and the Google weather analysis data to facilitate mass and mass trajectory analysis. And all these is done on Data Labs and so that everybody can look at the same data together. Yep, so that's our solution and the data from the service to Data Labs directly. And then we also produce our markdown reports and the books so that everybody can understand what exactly is analysis and have a better narrative of what's being done. And then we can also help the discussion going on to understand the data a little bit better. And this is some of the results we've got by the end of the project. We have found that the land weatherly or cyclins frequency has a negative correlation with gluten concentrations. But this effect is not as good as if for the active rainfall. And there are other things that we will continue to look at such as the frequency of sampling or how we can be more adaptive in the way that we sample for the network. But this is a good example of how to add value to the ECN data where it may not show very much by itself. But as we're bringing more third party data, we have a better contextual understanding of what the data is telling us and it presents a more complete story in our study. And one thing we're trying to do in the data project is also to how to help lower the barrier and really to adopt some of the methods, how we're sharing methods more readily with others. So one way we've tried is to convert on some of our research books into this kind of a life document or kind of app type boxes, co-boxes that people can run themselves to generate the same figures that you've seen in the paper. So this is a screenshot of the kind of app that we have. So it's a notebook by itself, but we just turned some of the cells into a box where users can just come into edit. So it's a very nice way to let users customise the output they want to see, maybe test their own alternative hypothesis or just have an easier way to learn a new method to adopt it to their own research. We have submitted this work for publication recently. Right. So the second example I'm going to talk about is one of methodological development. So Mike has already mentioned change point methods. So the change point is location and time series where visitors before and after are different. So there are statistical methods to dig at it, but the ones that are used traditionally has not been very useful for environmental science because sometimes we, especially for manual sampling, we don't sample at very regular intervals or we try to use data with very different sampling rates. So basically we need a method that suits our environmental science needs. So using data labs, we collaborate with statisticians to develop a new mixed sampling rate change point method. So it would relax the assumption that all the time series at the same frequency and it's really nice to be able to use data labs to do that as we have this project during the pandemic and we couldn't meet, but using data labs platform, we can have online workshops together. We can actually sit down and look at the data together. We can make life changes to the notebooks on data labs that is actually working for environmental data and the statisticians can understand our requirements better. So it's a really powerful way to work together and demonstrate the use of data labs for the collaborative aspect of it. And finally, the third example I'll just touch on briefly is how to bring in generic data science for environmental use. So we, as I said, DCNs has a very wide variety of data. There are different needs and sometimes it's very hard to do quality assurance. So how can we do better than what we have been doing now? For example, how can we do better just by doing a simple rain check? So an idea from data science is that we use a clustering method. So we develop this idea to detect the state of the system and then we do a check around that state. So what you see here, we use some logical variables to classify what state a site is in on a date. And then we use this state in terms of prediction intervals to the data. So here's an example of MOF counts. So basically what that would allow us to do is we have a different expected value and frequent interval for the different states and that would give us a more nuanced way to apply this type of quality checks to our data. So this is a much straightforward way than we are going back to develop a quality assurance method for each of the different B-clocks. We want to go data that we are collecting at BCN. So to conclude, I have this quite busy summary slide on some of the challenges I've touched on earlier. Some of the details I didn't have time to go through, but you can see that data science can help with a lot of the challenges that we often find around the science. Some of the challenges are not just scientific ones, they're also about the culture of which we collaborate and some of them are actually more of a computing software or infrastructure challenge. But data labs can contribute to help solve some of the challenges and it would help us collaborate better. And I think they are very powerful tool to help us going forward with that. Thank you very much. I'm happy to answer some of the questions that you've got. Thank you, Michael. Now we're going to go to Tom. Thanks, Mike. Let me share my screen. Great, so my name is Tom August. I work at UK Centre for Ecology and Hydrology and I want to give an honest appraisal of what it's been like using day slabs from Earth. I know environmental scientists point of view don't have a computer science background, but myself and others in the group that I'm in have been using day slabs for a year or a year and a half, maybe longer than that now. Time's gone on a bit funny with the pandemic, hasn't it? For an extended period of time, we've been using day slabs and this is hopefully a bit of an honest appraisal of our experiences. I've also noticed just as I was in the waiting room that quite a lot of people from the group are actually in the call, so it would be great if later on they can feed in as well. Their thoughts, hopefully, haven't misrepresented anything in my slides. So I'll give a little bit of background about our science, but I kind of want most of this to be more generalisable across science domains, but I'll give you a little bit of background for what we're doing. And I'll talk a bit about the areas of our science where I think day slabs has helped. Touch on a few case studies, I won't go into a lot of detail. And I'm going to close with some sort of pros and cons I see with using day slabs. So first of all, within the UK Centre for College in Hydrology, myself and the others who have been adopting day slabs work primarily within the Biological Record Centre. So we work with data collected by citizen scientists. And this data is of species occurrences. So I saw this butterfly species on this day in this location. And we use this data for analysing how these species and species groups are changing over time. And that's important for understanding the state of our environment. We also, so we help with the collection of that data, developing mobile phone apps and databases that data live. And we analyse that data for assessing trends. And then we also develop the tools and things for doing that, for doing the analyses and for disseminating the results. So we, one of the highlights of things that we produce are these long term indicators of biodiversity change. So you might often hear and then use about a certain group being in decline over the past sort of 40 years, quite often that has either come through us directly or we've helped indirectly with curating the data that's gone into those sorts of analyses. And these are really quite important for holding policy makers to account in terms of reversing some of the clients that we've seen over recent decades. So that's one type of output. Another type of output is creating tools for citizen scientists to collect these data. I thought about putting icons in for the various tools that we've created. And then I thought I will offend someone by not having their icon up there. So I didn't, I put in this generic infographic about the Biological Record Centre, but there's lots of smartphone apps and system science schemes that we either run or develop ourselves or have a big role in developing. And you can find all of those if you Google the Biological Record Centre. So in terms of our workflow, I don't think there's anything special about the way we do our work. Hopefully it's really generalizable to what a lot of people on this call kind of do day to day in the projects they work on or the projects that they manage. Okay, we have data. Our data is often in databases or in big CSV sort of files. We would typically have a local copy of this perhaps when we're doing some analysis. Otherwise, it might live on a shared folder somewhere in our institutional drives or we might call it directly from a database, a kind of variable. The outputs that we create, data outputs, data products might live again in all these different places. We have code where all our programmers and we version control our code to better and worse degrees, but most of us use GitHub. And in terms of tools, we might share our code on GitHub for the people to use. Maybe if we have some time, we'll write some tutorials on how to use those. And we might develop our Shiny apps as well, which we would deploy on a C8 Shiny server. Our analysis of these species trends can be quite large. So we have variously run these on a local machine if it's a really small species group, on a local cluster at CEH in Walingford or on Jasmine on the Lotus cluster that sits on Jasmine. That's actually where we tend to do most of our work now. So hopefully that kind of wings bells with a lot of people who are on the call. So we moved from doing all that on our local system to doing it primarily on data labs. It's always going to be all bit of work that we do locally on our machines, but most of the big stuff we're doing on data labs. So why did we do that? Well, there's a lot of different reasons. Desire to centralise. So getting all our data and our code into one place because in our group, there's a lot of people who work together at different times and that changes over time. So having consistency of access to code and data is really desirable. To kind of force us to work together more closely if we're sort of sharing these project spaces on data labs, sharing that to kind of data storage, it brings us all closer together as a team. Makes things more reproducible because easier to get access for others within the team to get access to the data also makes it easier to share and open it up to others as you've already seen from the previous two talks. Makes it more shareable as they've discussed and for bringing in external collaborators. However, if I'm being honest, the main precipitation of the change was COVID. We all started working remotely and being able to have this central online place where we could go and work together, really kind of made this happen a lot faster than we'd originally planned it to. So how does this work in terms of meeting those objectives? So in terms of centralisation and working closer together, we now have all of our big data sets, both the ones that we use as inputs and outputs, live on the object store, very large storage space that's accessible from the data labs. That's great because it's a much more formalised archive of our data than we had previously. And it's got very large capacity. Everything that's on that is also accessible to all of our projects within data labs. So that makes it kind of very easy to access for people on the team working on different projects. It's now a default storage place for all of our big data sets, both inputs and outputs. We'd also make it accessible to others who are in data labs if they wanted to access it. We have projects, so you're seeing why we're doing this. We have projects, so you're seeing one project here on the right-hand side, and this project has a whole bunch of different, it's got our studio notebooks in it. And so we will work in that environment, and we have some of these might be shared, some of them might be private for users. We do a little bit in Jupyter Labs, but not so much because we're primarily our users familiar with our studio. We don't want to stray too far from our home turf. And that works well. In terms of working with others, I see some real great strengths in things we've done. So what you can see here are the projects that I have access to. So Anna is a master's student who came and worked with us, and it was really great to be able to create this project for her, and we could set up the environment so it had everything she needed. She had access to the data was there, and when she had problems, members of the team could just drop into that project, have a look at her workbook, and you'll be effectively working on her computer. You can see the bugs she's having and you can debug. So it's really useful for students. And so I recommend anyone who works with students to consider using it. We use it with UK partners. So this notebook here on the Decide project, which I see funded through CDE, we have collaborators at Warwick University who work with data visualisations, and they are members of this project. So they've come in, they've got their own notebooks, and they're doing visualisations on the data that we are creating from our notebooks, and it's a nice place to work together and share that data, and it works really well getting other people involved. And international partners. So this KFD project was working with Indian collaborators, and it was useful to have this sort of third party in a way, kind of cloud space where they can meet with us and they can bring in their data, and we can bring in our coding expertise and collaborate in this kind of open way where they wanted to have eyes on everything that was happening with their data, and so it was kind of an open way and they felt comfortable working in that sort of environment. We produce shiny apps, as I mentioned. You can produce shiny apps in the projects that sit alongside the code and the data, and that's great having everything in one place. That Indian project I mentioned, this is a screenshot from the app there, it's showing you incidents of this disease and a predictive map of where that disease is, and that was accessible to the stakeholders in India to come and look at, because you can expose it on the web directly from the data lab, so you can have a look at it. It also meant that our Indian collaborators could upload files directly into the app using this button, the ones who are less comfortable with using a kind of programmatic interface, although they had that available to them as well. And this is, Simon Wolff put this together, a really quick kind of demonstration of a prototype tool which could be used as part of the decide project. Again, sitting on top of all of that data, so it's displaying some big raster data here, it's all in that one place, it's running like that. He was able to really rapidly prototype this to get feedback from stakeholders. I'm also really excited that you could create, using Plumber, you could create APIs there, maybe that's kind of a niche application, but something that we're really interested in because we developed smart phone apps for some things, not something I played about with, but it would be cool to see how that could be taken further. Right, so just the last couple of slides here. The Green and Pleasant Hills, in terms of what data labs can do, it's great for centralising data, big data storage, sharing these working environments for collaborators within your organisation, within the team, that's certainly where it's given us the most, I would say, but it also opens up collaborations to others across the UK and these national partners. You can have shiny apps close to your data and close to your code. I haven't touched on this, but it's also a big sell, but it's also a great opportunity to compute, and the guys earlier talked about being able to create desk and spark clusters, which to be honest, we haven't really used very much because we do our big processing on Lotus, but you can get big chunks of memory from time to time as needed, so it can be things up a little bit. I'm going to end on the challenges, maybe I should have ended on the Green and Pleasant Hills, but I'm going to end on the challenges. The first is, I think people in my team fully agree, so not my team in the group that I work in, is that there's a big cultural change. Moving to working in the data labs means moving away from what you're comfortable with doing on your desktop, and that can be quite a big hurdle for some people. I would say, at least for the RUZE community, on the right-hand side, you can see my RStudio console on the top as it appears on my desktop and my RStudio console as it appears on data labs is basically the same, so it's actually not too scary or a change, but we do need to recognise that it is a bit of a cultural shift and that's going to take time to have people comfortable with making that move. It can be a bit of a headache getting loads of data, if you've got big data, on and off. Now obviously once you've got everything on, you can do all your work there and it might be quite rarely to take stuff off, but that can be a bit of a headache and I think any way we can make that easier for people who are not tech savvy and who aren't going to do stuff from Linux command line would be great. The other main drawback for us is that because we use Lotus a lot for our big processing, it doesn't link seamlessly into data labs, so I can't execute jobs and things directly from data labs, which would be the cherry on the cake. Instead, what I should say is that Lotus can access the object store. Both these systems access the object store, so I can go to Lotus and I can draw down, I can call my data from there, I can do my processing, big parallelisation on Lotus, then I can save the data back up on the object store and then do any downstream math I want to on data labs. That system does work, it's just a bit clunky. I guess overall we've really enjoyed using data labs. It's certainly, for me, I think it's the future of the way that we do our research and strongly encourage other people to consider it. If anyone wants to have an informal chat about it, then I'm happy to do so whenever needed. Thanks, Mike. Thank you very much, Tom. Thanks to all the rest of the speakers as well. Please, if you have any open questions, please post them in the chat. I'll ask a few to go through to start off with, but I just wanted to make a comment before we went into the questions. Primarily the three presenters have been talking about data labs on Jasmine. What we've done as a team, the development team, is we've decoupled data labs from the underlying infrastructure. So actually you could deploy data labs on any public cloud, Amazon or Azure, et cetera, or you could deploy it into your own infrastructure and it's all open source. I just wanted to make that statement. Right, now I've got a lot of questions here. I'm going to pick up one from Andy T who asks, does reproducibility metadata, the versions of all the packages you use, the computational requirements, get captured neatly in some of the metadata? I presume Andy T is talking about the package. Who wants to pick that one up, if anybody? If we're Mike, Michael. I can take a little, so if you want to know in the fixed environment what packages are used and which versions are within a question, we use condor environments, so with a condor environment you do get an environment file that you can spit out. It is just a text file that tells you the exact version of everything that was used, so with that you can instantiate that same environment elsewhere when you set up your own code. If you want to set that common environment up on a different platform or your local thing, you could do that. It's not strictly recorded in metadata, but certainly it's recorded in that. On the other side, condor does deal with R as well, but if you're a native R user like Tom, there's something called packrout or RM which also has an equivalent setup where it will record all the particular versions of the package you use. That's a little bit more tricky because that would be required on the user to run the analysis, look at the requirements used and then store it somewhere in either a text file or plug it out at the end of the analysis. I don't know if Tom or Michael want to expand on that, that's certainly my experience with things. I see a note to that. We'll move on to a question from Carl Watson at BGS. If I had a project proposal idea that included the use of data labs, what is the process for investigating the feasibility and associated costs of this and perhaps securing a supporting statement from data labs for inclusion in the case for support? Does anybody want to pick that up or do I have to pick that up? I'll take silence as, oh, Michael. You can pick it up. I would say that the answer is contact us. It's related to another question that came in about how do I access it, how do I use it? Data labs is free for NERC research. The costs of running data labs for NERC are zero, basically. That includes if that's a NERC-led project and you've got partners. That is still zero in terms of the cost on the Jasmine infrastructure, because Jasmine remember NERC pays for and we get for free. We've been involved with many, many projects, project proposals that have gone in that include the use of data labs, so we have lots of statements ready if you want to include it in terms of the case for support. I think the answer to that is yes. I guess add to the visibility question. Feel free to schedule a chat with us and we can talk through what may work for you. Yes. I've got so many questions. I'll take me a while to go through this. Another question from Andy T. Sorry, another technical question. Are project accounts linked to the use of resources and can you set limits on how much computational resources to use for a processing task? I think I'm going to have to pick that one up again, to be honest. I see smiles from the presenters. Yes and no is the answer. Within Jasmine, where the data labs sits within Jasmine, we get what's called a tenancy and within that tenancy we have a set of resources and we can use those resources. Everybody within that tenancy can make use of those resources. Because of the issues of people taking all the resources, we have implemented some software that lets the users monitor how much resources they are using and then release them. That's, to be honest, all that we can do at the moment because the resource constraints are set by Jasmine, not by us within the tenancy. I'm not sure whether that answers the question, but that's where we are at the moment. Of course, if you move to a public cloud like Amazon, then those resources are automatically limited, depending on who the admin of the tenancy is. But we can't do that currently within the Jasmine infrastructure. There are lots of questions coming in now, so I'm going off to the bottom. There are another question from Andy Tey. That's a lot of questions for Andy Tey. Do you find the computational resource is sufficient for your current use cases? Are there things that you would like to be doing, but you currently can't because of data or processing constraints? What is more limited data storage or data processing power? Does anybody want to pick that up? I mean, we touched on that in the last question, Tom. Yeah, I'd just say that I don't feel like we are limited by any of these resources. The object storage provides us plenty of storage space for our needs, and having lotus nearby, if not quite as close as I'd like, gives us all the processing power that we need for our sorts of studies. But then I know that, you know, big to one versus not big to everyone. So it's big enough for certainly for what we need to do. Yeah, I mean, I could again add to that. Tom has pointed out in his presentation that one of the sticking points is access to lotus directly for labs. The Jasmine team to break that barrier down. It's more about security at this stage. And also in terms of storage, the current object store that's available to labs is 15 petabytes. So it's quite a vast resource to be honest. But if you just want block storage, we are more limited in that because that actually falls under the tenancy storage. Right. That's within through questions. So this is quite interesting. Lots of case studies and examples in UKCH, but are there plans to make this more widely available to other research groups? Or is it mainly a CH facility? Again, I should pick this up in terms of it. This is a NERC funded project. So it's open to all NERC funded researchers and your collaborators. The reason that it's very CH centric at the moment is because it's easy to pick on people within CH because I work with CH. And that's the only reason that it's not more wide. But there are currently approximately 40 projects on the data labs. And they're not all CH. So there's a cross section across NERC. And lots of collaboration with the HEIs. I'm just checking the time. I've still got plenty of time. Now, there is a question in the chat in the Talib book. Are the data available, I think, as I understand it, are the data available to anybody to access within the presentation in order to inform their secondary-based data research? Who wants to pick that up? So all the stuff that you've been showing today are they available openly? I think certainly the stuff I presented, so I was actually using era 5 data. There's my numerical model data. So that is available publicly anyway from the ECMWF. And you can use API access to get that into the labs. So that's one option. Obviously, sometimes you may be working with restricted datasets for security reasons, if it's health data, for example, or got health and personal data in it. So some of the data might not be able to be made or open available. But some of it might be able to be made in a condensed form. And there is a thing within CEH, sorry, within data labs that you can mount certain assets. So you can mount the sort of, sort of feeable version, so to speak, of the data to allow others to use it if you think it's a critical dataset that might give some information in its condensed form if that's one way. But also like we said, if you've got API access to data and it's freely available elsewhere, there's no reason why you can't make a version available to other projects within data labs if you have an asset tool or something like that. It's all basically on licensing of your data, really, what you can and can't do with it. So I hope that's one inside. I don't know if Michael or Tom has any further doubt. Yeah, I guess just to add that for us, there is data that we want to share within a project and that sits within a project and isn't that accessible to anyone outside the project? There's data we want to share across projects which is in the object store and it's available across projects with access. And then if we want to make datasets, we haven't wanted to make datasets externally accessible from the data lab. I guess we should publish those datasets. But I know that there's a lot of flexibility within data labs for all sorts of permissions, basically whatever you need. Thanks, Tom. Mike, another question about data issues. One, can I upload, excuse me, my own research data and share it with others? And two, can I access wider EDS data for my research through the data store? Anybody want to pick that one up? Yes, of course you can upload any data you would like to your data store. Basically you decide who to share within the project. And I guess the idea of data access to build its transparency is maybe ultimately you want to share research output to others. But while you are working on it, you control entirely who you like to share the data with and you can upload your own data to your data store. So there are two main ways that you can upload your data. One is through a browser tool to interact with data labs, data store. But as of the slow it may not work for very large data. And there is also a command line to where you can move very large files. So for example, you want to move files from XPC to the data store, you can do that. And this is very fast. So just a quick answer. Thanks Michael. The second part of the question is can I access wider environmental data services data from the data centres? The answer is yes. It's probably easier to do so if there's an API to pull it in. Or like Michael said, you can't upload it straight. You can't upload it straight if there's not an API. You have to go on the finder and then upload it. But it is available. Right, I think we're moving towards the end of the webinar. So I'm going to take a final question. I'm just checking. There is another one. I can worry about this. So some great examples and case studies being presented. Is there a webpage or something somewhere that gathers all these together perhaps with some hints and tips on planning to use data labs? That's actually a really good question and it's something we've kind of half done. In a way, Charlotte posted the data labs link into the chat. Once you have registered for labs there are lots and lots of resources and documentation on how to get started. We spent an awful lot of time on doing that. And here's a promotion that Kate will be happy with. We are very soon due to launch an environmental data service brochure site. And on that we will be advertising things like labs. So there'll be explanations on there about how to use it which will then point you to the data labs website itself, which I said there's lots of documentation and resources and things like Slack channels for you to ask questions amongst the current users. I don't. That actually neatly takes us to 1159 so we're nearly ending the webinar. I don't have any further questions so two things I'd like to thank you all for coming but I'd really like to thank the presenters for giving some fantastic presentations and Steve will shoot me if I don't yet again promote, give a big plug to the next series for the construction of the digital environment. Series five is the data digitisation and repurposing and Steve shows me that there's five fantastic speakers going to be part of that so please subscribe and also subscribe to the YouTube channel and thank you all very much for attending. Cheers.