 and welcome to everyone, and many thanks indeed for joining us. So I'm Professor Steve Hallads of Cranford University, and one of the NERC's Digital Environment Champions and hosting this webinar, concluding our Data for Decision Making series, supported by NERC and the Constructing a Digital Environment Program, which is all about aiming to develop a digitally enabled environment to benefit researchers, policy makers, businesses, communities, and individuals. And it's all about looking at advances in digital technology and how they have given rise to the volumes of data being created and curated. And alongside there's new technologies for enabling step changes in analyzing and understanding that data, looking at emergent patterns in that, a step change for the global capacity and integrated monitoring, analysis, modeling, and of course visualization of that. And we're running a series of webinars and I would invite the audience to subscribe to our YouTube channel. And if you haven't yet done that, you can see the rest of the talks that we've been running through this series. And so the webinar today is entitled Virtual Labs and Digital Environments Can Virtual Lab Technology Support a Paradigm Shift Towards a More Open, Collaborative and Integrative Environmental Science. And it's presented by the eminent scientist, Professor Gordon Blair, Distinguished Professor of Distributed Systems in the School of Computing and Communications at Lancaster University. And Dr. Michael Holloway, who's Senior Data Scientist at UKCH. And thank you both to you very much for giving this talk. And Gordon is co-director of the Center of Excellence in Environmental Data Science, or SEEDS, that's a joint initiative between Lancaster and UKCH. And Mike is gonna give us a live demonstration of a virtual lab in action, data labs. If you'd like to continue following our webinars for your information and your diaries, our next event is 11th of June. And there we'll commence a new series of webinars focusing on sensing the environment. And I'll mention more about that at the end. During the webinar, perhaps, feel free to post your questions into the chat of the Zoom and I'll collate those after the talk and we'll have plenty of time for a good discussion afterwards. So what I'll do at that point is hand over, thank you very much, Gordon and Michael, to you. Thank you for that introduction, Michael. I'm just going to get my slides shared. Thank you for that, Charlotte. We've got a few technical problems, so we're doing this remotely. So you might hear me giving instructions to Charlotte at various points. So thank you very much to NERC and for the CDE program for inviting us both to present this topic. We're very excited about this. It's always great to talk about virtual labs. And what we're going to try and convince you today is that these are the real deal, that these can make a big difference to you and what you want to achieve, which is often a better, bigger, more integrative and hopefully open and collaborative environmental science. So that's the topic we're going to discuss today. And as we heard in the intro there, we're going to do a live demo in the middle. So fingers crossed that that works. So I'm going to call in my colleague, Michael, halfway to lead that live demo. So I'll move on to the next slide. Thank you, Charlotte. So first of all, what are virtual labs? This is our own working definition that we use in a project we have at Lancaster, data science of the natural environment. And you can read that yourself. And I'm going to pick out a few phrases to just unpick this in a little bit more detail. I think the most crucial set of phrases here is that a virtual lab is a transdisciplinary collaboration space. The reason I like that definition is it says nothing about technology. It talks about what you get. What you get from using virtual labs is a place to work together, where you can work together across disciplines, where you can work together with other stakeholders in doing your science in a single integrated place. And it's that, which is really important and hopefully transformative. Going into more detail, so back, not finished yet. So going into a little bit more detail, these take advantage of the cloud, you know, the set of innovations in cloud computing. And through cloud innovations, I'm going to talk a little bit more about it shortly. They provide a way of accessing a whole range of environmental assets, whether it's your data or your favorite methods or assessment tools or visualization tools. And what the cloud offers is a very large capacity to do things at a scale that you previously might not have imagined. And that capacity is on demand. So you get the capacity that you need at that point in time. Well, it's known technically my field is elastic computing. Next slide. So why virtual labs? Well, I've got four key reasons why I think virtual labs are potentially transformative. And the first one is all to do with openness and transparency of science. And you're probably all aware of the need and the move towards being more open and accountable about the science we carry out. There's a really good report from the Royal Society, which I often quote, which I show in the right-hand side of this slide, science is an open enterprise. I think it's increasingly important as we do science that has important repercussions for society that we're fully open and transparent about what we're doing. And people can see the data you're using, the methods you're using, and perhaps that's even repeatable. So openness and transparency is one of the starting points of why I think virtual lab technology is potentially very important. Moving on. Another of my favorite readings of recent years is the other book that I show in the right-hand side, which is the fourth paradigm. I'm sure many of you in the audience are environmental scientists and are becoming aware of a real shift in the ground under your feet as more and more data is becoming available to support your science. We're moving towards the fourth paradigm of data-intensive science, whether that's coming from satellites or whether it's coming from novel sensors that are available in the ground and whatever you're trying to measure, or perhaps it's from citizen science or perhaps it's from mining data from the web. I'm sure you've all used combinations of these data forms recently in your science, but actually the risk is that we drown in that data. There's just too much data out there. And what we need is help and new tools to manage that data and to move us towards a position where we can really exploit data-intensive science. Next point. So you've probably noted my surname is Blair, and for those of you that are old enough to remember a prior prime minister, you'll know that our, as Blair's, communicate in triples. So the next point is collaboration, collaboration, collaboration. And what do I mean by that? Well, I already said at the start that collaboration is right up front in terms of my definition. It's really important in my mindset. It means collaborating within the science with other science disciplines, biologists, biogeochemists, physicists all working together. But then the second collaboration is collaborating beyond that. We're increasingly having to collaborate with data scientists. And who knows if you're really pushed, perhaps even computer scientists. So the need to collaborate is becoming more and more important. Hence, we need platforms that support that collaboration as an intrinsic feature for what they offer. Final bullet. And of course, another triplet, integration, integration, integration. Most of you sitting out there are probably faced with the need to pull data together. Very easy to say, but how hard is that to do? How much of a project time is taken to organize licenses for different data sources and actually get them in one place so that you can then start to do your science? So the first big challenge is integrating data in a single place. So all the data that you need is then available for your subsequent analyses. The second level of integration, which I think is even more difficult to do is integration of models. Where do your models run? Perhaps they run on someone's laptop, in some department in the bowels of your university. And that he or she is the only person that can run that model. Perhaps you're lucky and it's more readily available, but you're in a very lucky position if you have all the models you need in one place so that you can then start to combine them in different ways. So integration of data, integration of models. And what I think that starts to lead up to is what I think is most important, integrative science, which is what we're all being asked to do these days. We're increasingly being asked to do big science, answering big science questions that requires us to pull together maybe data on air quality and data in human health outcomes, integrative science. And again, we need tools to support that significant paradigm shift we're witnessing. So that's why I think that virtual labs are potentially important going forward. Next slide. I'm just going to pause here and just reflect on what data centers offer. We all use environmental data centers and there's a marvelous legacy of environmental data centers out there. There's a whole series run by NERC. And I'm not saying we're going to replace that legacy. In fact, I think that legacy is fantastic. There's real community support there in terms of wonderful assets, existing environmental data centers. They are there now and they will be there in the future. So in terms of virtual labs, what we're trying to do is build in that legacy and build something within and on top of data center infrastructure, but going significantly further. And I'll give you a feel for what going significantly further means as we unpack. Can you also see that in the demo? I just want to give an advert at this point because I'm part of a NERC strategic needs advisory group which is known as Snag, wonderful name, which is looking at the future of environmental data services. And this is ongoing at the moment. And we're wrestling with big questions around what is the future of environmental data? How do we manage this increasing the diversity of data? How do we bring in data science methods? Is there a role for virtual labs? If that speaks to you and you're interested in that topic, the working group is looking for input and in particular, there's a survey open at the moment. Please, please, please consider responding to that survey. And Charlotte's just posted in the chat, a link to that survey. That will be really useful for our ongoing work. So data centers are there. They're not a stationary base. They are evolving. They're important going forward. And what we're talking about virtual labs is something that we believe is complementary and adds value to the wonderful work in data centers. Next slide. Now, the reason I get excited about this, I'm a computer scientist, so this is my world. The reason I get excited about this is because it builds on real innovation. It takes advantage of some of the wonderful world of innovation that exists around cloud computing and big data. I'm often really sad when I see the innovation in this area going towards big questions of our day, like what should I be buying for my friend at Christmas? Isn't it so much better if this innovation is used to help us understand the natural environment and answer very profound questions around climate change and flooding and biodiversity loss? Is that not a better use of this technology? And this technology is really powerful. I used a wonderful quote here from Isaac Newton, which is all about standing in the shoulders of giants. And I think what environmental sciences can do here is stand in the shoulders of a massive amount of innovation that's out there, building directly on what's there already, ready for reuse. It doesn't have to be invented. It's there in terms of supporting better science. That includes massive infrastructure. In the cloud computing world, we know how to scale that as a solve problem. We can scale up to massive data sets and we can scale up to massive computations. Beyond that, it's a case of dreaming big in terms of environmental sciences because cloud computing can support the big. So dream bigger. There's another philosophy I think is very important, which is the fact that cloud computing supports what's known as everything as a service. So everything that exists in the cloud, whether it's a visualization service or something to run your favourite data science method is organised as a service. And it's remarkably easy to compose these together. You don't have to redo that. The packages are there. And then the big trend in this world is to compose. And that's something you can knock together rapid prototypes. I get told off by my postdocs and PhD students when I say you can knock these together in the afternoon. That's an exaggeration. But you can knock them together very, very rapidly and build things that are really meaningful in a short space of time but building on the services that are already there. Your portability and interoperability has been massive strides taken, for example, with container technology in the cloud. If you know what that is, great. If not, happy to discuss that later. Containers have allowed you to move around and support the execution of sophisticated computations anywhere in the cloud. And then the other big advantage of the cloud for me is that someone else deals with all the messy stuff, the management, the security, the backups, all the stuff that you don't want to bother about, that just happens behind the scenes. So there's many reasons why cloud computing is a powerful vehicle to then build on in terms of virtual labs. Some of you that are older might be thinking, has this all not been done at the time of eScience? And I think eScience was a fantastic program for its time, but cloud computing hadn't happened at that point in time. Rolling forward to today, all these innovations are there and it's time to do eScience properly. Next slide. So moving rapidly towards the live demo with my fingers firmly crossed for Mike. And I just want to talk a little bit about some real life experiences we have here, because I've been talking about virtual labs, the concept. What I want to talk about is data labs, the specific project. So what data labs are are an instantiation of the ideas of virtual labs. Perhaps a minimal instantiation, as you'll see when we get on to some of the research challenges, I still feel need to be done, but it is an instantiation and it's real and it exists and it can be used. Steve Hallett tells me it's fun having a play around with the data labs infrastructure in the advance of this talk. So we'll give information about what to do if you're interested in data labs later. So this is implemented in Jasmine. So it can build on what I was saying earlier that the investments are already there in the NERC community. It's tailorable so that you can have not one virtual lab instance, but something that's tailored where you need. What you need, that could be in terms of the data science methods or the visualization services or whether you need distributed compute or what you want to do. One thing that we're very keen on is what we call methodologically enhanced virtual labs to put data science at the center of virtual labs and have data science methods as accessible as the data themselves. So within the virtual lab, you have your data all in one place, but you also have the methods readily available and ready to execute so you can do your data analysis in that one place so that that's something that's very important for us. And the other thing that's very important is the end-to-end analysis from the data entering the lab through the analyses, through the visualizations, through the presentations and eventually to the outcome, the scientific outcome or the policy decision or whatever it might be. So we're trying to support that end-to-end pathway. The diagram on the right shows you some of the components. I like this picture. I think this was down to Mike because it shows it from a kind of user perspective in terms of the user coming in, discovering data, discovering methods, applying the analytical tools, executing that in the cloud infrastructure, perhaps using parallel computation by using services like Dask, for example, that allow you to distribute that over a cluster of machines. And then eventually when you've concluded you have something that's worthwhile, something that's actually publishable. So one thing we're particularly interested in is making the science publishable that actually can execute in the cloud. And we use notebooks with some added value around notebook technology to allow results to be publishable. And perhaps one day we can take that to our favoured journal and say, here you are, here's our paper. And by the way, this will execute something we really, really want to do in the work we're doing in seeds. Next slide. So what Mike's going to show you is data labs in action. And what he's going to show you is one particular project we've been working on, which is funded under the constructing a digital environment programme. So thank you, Nerg, thank you, Steve. This is funded as a feasibility study and it's a project called, I'll give you the short version of the title because the long version would make the run out of time, change points for a changing planet as the shorthand version. And what we're working on here is having our data, which is the data from the UK environmental change network, all ingested into the lab and our methods, our data science methods. And then using notebook technology as a means of publishing and inverted commas the results. And we've developed some experimental methods here, the main one being a multivariate change point detection algorithm. Now that's quite novel because if those obviously statisticians will know that most change point detection algorithms work in a time series, this works across time series and allows things to drift a little bit in time so there doesn't have to be complete alignment as you get in the real world, you don't get perfect behavior. And then it will try to discover points of interest where there may be a change in signal that indicates some things going on. And then we've also used more traditional clustering machine learning techniques within the lab. And it's basically transdisciplinary collaboration in action. You know, the team that's worked in this involves computer scientists, data scientists, environmental scientists of different flavors. And we've collaborated and done everything through the lab. It's our prime mechanism for delivery. And that's been quite profound over to Mike and it's all going to work. Thanks Gordon. Fingers crossed this all works. I will just attempt to share my screen. Hopefully everyone can see that. Someone can give you a thumbs up to let me know that that's all clear, perfect. Well, yeah, thank you very much for the opportunity to present you data labs today. So like Gordon said, I'm going to give you a live demo of the system. But I think one of the first things Steve asked me to show you as well is one of the things is how do you access it? So the first and foremost, one of the things that breaks down the barriers of collaboration is you don't use data labs. You don't need a high-powered computer. You don't need anything. All you need is a web browser. And if you go to this URL here, data lab dot data labs dot CH dot AC dot UK, you will present to this landing screen. And all you need to do is log in. Or if you haven't got an account for data labs yet, you can sign up to the system as well. So if you hit the sign up button, you can just put in an email address in the past, set your own password. You will be asked to validate your email address and you can log into the system. So I will go ahead and log in now to show you that it is actually a live system. And this is where things can go wrong, but hopefully not. So just enter the password and you log in and you come to this landing page. Now, if you're a new user, this page will be blank because by default it's filtered by projects that you have access to. But if you untick this button here, you can see all the different range of projects that are available in data lab. And like Gordon said, it's bringing lots of different people together. So you can have lots of different projects focused on different things. So for example, you've got the British Antarctic survey using it for an artificial intelligence. There's an air quality project. If you want to focus on change points, you can filter the project to see what's going on. And so this is actually the project that Gordon mentioned previously. And this is one example of this is, is you can bring different collaborators into a project. So if you search a project and you find out you like it in this info bar here, it should be who runs it. So you can ask to be added to it, ask to view. And that's also a level of security because if you've got some level of protected data that you can't release publicly, you can have this project system where you can expose the levels of data to different people at different levels. So I will actually open up this project now. So when you come into the main project page, you're presented with a nice little bit of information here. So you can have the name of the project and a little bit about what it's about. So this is the full name that Gordon didn't want to present earlier. So the SPF, Digital Environment for Change Points. They can be a collaboration link. So this actually takes you to a wiki site where you can read all about the project and some of the things that have been going on for more on admin side of things. And then also it tells you different users and who's available in the project and who runs it and things like that. So the people to speak to if you want access to particular parts. So I'm going to just briefly go over the various components of the data labs and sort of discuss like in relation to what Gordon did, how they enable access to things, how they enable collaboration and sharing. So sort of the triplet of collaboration, so to speak. So the first and foremost is we're talking about data here. So you have a storage capacity in the labs. So you have two types. You have the sort of small level storage for local data and it's also where all the analytics sit which I'll discuss in a second. But within this storage area of the labs, you can store sort of local data, small scale. And what you can also do is this data store, you can create as many of these as you want within the lab infrastructure. And this is where you store sort of mid-level data up to a few tens of gigabytes. But the important thing is, is anyone who has access to that data store within a project has access to the same data. So everyone is working from a consistent environment and any change to the data made by one person or the other users will pick it up. Now, I don't have admin rights to this particular project, but if I did, there would be a little box here where I can control who has access to the data, what levels of the data stores there are. And also I would be able to create a new data store if I wanted to create another, a different version of the dataset. So that is a sort of storage aspect of the labs. And the other side of things is where the interesting stuff happens is we have the more the analytics aspect. So I'm sure a lot of people may or may not have come across or heard of Jupiter notebook before or our studio notebooks, the labs serve both. But these are environments which I'll share in a second where you can have a mixture of code and narrative. So you can bring people together and you can explain what's going on and you can share methods, you can share analytics of data. And almost like Gordon said earlier, if you've got a process where you're taking days to bring in lots of different datasets to your lab, what you can do is you can create a notebook that shows how that procedure works. Then when other people come along with your collaborators, they don't have to go through the whole setup process of doing it itself. You can do, it's already done for you as long as you've got access to this particular data store. So I will open up a notebook now. And as I said, this is the sort of methodological enhanced aspects of the lab. So this notebook is available to anyone who has access to this particular data lab. So as you can see here, it's a list of narrative and code cells. So this particular one is using the R coding language, but actually it does work with Python as well. And this particular notebook does call a Python package. So you can have, if you've got an expertise in Python, you've got expertise in R, you can have those collaborators working together so that's one level in a seamless environment. So as you can see, there's lots of different code cells here doing various different things, but there's also some nice narrative here that actually explains what's been going on, what the method does, and a demo of how to use it. So this allows you to sort of share with your collaborators. So the hardcore code is like Gordon and myself, we'll delve down to the code and want to know what's going on and how to work it. Some people might just wanna know what's going on, but not actually see the code or just hit run and get some results at the other end. You can do all that with the same consistent code base. And more importantly, the way the labs are set up, these are all using sort of package management environments. So if one person has set it up and it works, it will work for everyone in the lab. So gone are the days where someone has to come in and spend hours installing all the dependencies, making sure the data's there. If you pick up this notebook within the lab, if it works for me, you'll be able to run it yourself straight away. So again, you're working from the same environment, the same coherency. So like I said, this is a demo notebook. And as you can see, the execution has already been done. So we've got some time series here. And then the data has been brought in. And like Gordon said, this is the formatting of the data into data frames for the analytics and preparing them for the change point method that Gordon discussed. So you keep going down. So then here is some description of how you actually use the change point package itself. So a sort of step-by-step guide to actually running it. But if again, if you don't want to touch the code, you can just hit the run button and get some results at the end. And then finally, when you've run, so the explanation of all the different arguments to go to the package, when you finally run it at the end, you can get some results. So I'll just go right through. These are the sort of tabulated results. There's some summaries of location of change points, et cetera, et cetera. And then finally, you get to the end, you can plot your multiple time series side-by-side and you can look for the comparative of the change points. And then the idea of this is, well, we actually ran a very successful workshop doing this. So for example, Gordon came in and said, well, what happened if I bring my own data set into this? Or if I change something, it's quite quick. You can just go up to the top, because this is live code. I'm not going to do this now because someone has to be working off it, but I could say change a parameter in this code cell and I can just rerun it and look at the results live. And then someone else might come on and Stephen might say, but Mike, if you thought about this particular method and I go, oh, no, let's try that. I can add the code and compare the two methods side-by-side. And if it's a good method, we can publish it to a notebook within the lab and then make that available to all the other users. So that is sort of the methodologically at hand aspect is, and again, the collaborative aspect. So you can bring statisticians, environmental scientists together, statistician can run a method and say, well, does this mean anything? And the environmental scientist goes, oh, actually, yeah, if you tried this and you can do that iterative and we had, like I said, a very successful workshop where we went through a particular dataset and actually got some interesting science done from it. So that's the sort of methodological bit of the labs. Now, the final bit that Gordon also mentioned was a high-powered compute. We have that available and it's, like I said, it's elastic, so you can, it's not fully elastic like the high-powered cloud but we have Dask available. And that takes away the complexity of making a parallel analysis. You spin up a cluster through a graphical bit, you say how much compute power you want and then the Dask library will do the rest for you and there's minimal code changes, everything else. You fire it off, you do your big data, you come back and you get your result and that works with Python and then we also have a similar facility through Spark for R. So the, and one other thing is if you're sharing with people who aren't coders whatsoever, so you can move policy makers, scientists who don't use code and you wanna share your results, you wanna find a way to do that. So I've made sure Gordon mentioned before different levels of abstraction. So one of these is we use R Shiny quite a lot but there's a similar example for Python and we actually changed to a different project to show you these. But we use R Shiny to demonstrate, you know, access to methods without using code. So for example, this is using change points but not the same project. And this is actually a paper I had published earlier this year. So this is an R Shiny app that allows you to compare numerical model time series with ice station data over the content of Greenland. And as you can see, it's actually loading something at the moment loading a map. So if you want to sort of compare these two models, you can actually, this is live code and it's actually running one of the Jupyter notebooks that I showed you previously. So it's consistent environment again. So I can bring in the data. So this does it all for me, brings it in like that. So that's the site currently sat highlighted in blue. This is just a setting to determine the number of samples to generate confidence levels. I'll set it low just for speed. You can see the app is wearing away telling you that it's calculating something. And then what it does when it's done that is it shows you a nice little time series down here. You've got your two time series red and blue, one of the gaps is more observations obviously with missing data. These dash vertical lines are an estimate of the change point location. And then down here in the valuation of how well the model is capturing the timing of change points in the observations. Like I said, this is single variant. So that's a different type of thing. But this method again, you can access it through the code or if you want to run and share the method with other people and critique results, you can access it through the notebook side of things as well. And at the bottom here is just a summary of some results. And the final thing you can do is if you want to look at the code or download the results, you can actually access them with different formats of notebook or download a spreadsheet. And the final thing is if you say, oh, what about this site up here? Well, you just pick out the list, re-extract the data from the file store and rerun the method and off it goes. So you can have a quick exploration. And like I said, I'm not saying this as a domain expert. We haven't got coding knowledge or someone who a policymaker, for example, if it's more of a decision-based tool, you can just share the graphical bit with them, allow them to explore. And then if they have questions, they can come back to you through the labs and you can show them the code if they want to. So that is the sort of full level of the data labs. I think, Gordon, is there anything I've missed that you wanted me to go over of the labs or is that a perfect, quick overview? And I think that's about it of the system, but the idea is it serves different. You don't have to be an expert coder to use a system. You can see what's going on. You don't have to be a statistician or a computer scientist or a domain expert. Everyone can come in. Everyone can share and share the methods. And we are working on ways of making sort of what we call assets. So you can share notebooks and data, not just within your project, but with other projects. If you find something you want to publish, you can say, oh, I want this to be available to everyone on the system so they can actually mount that as an analytic or a data set within their project if they wish. So I think with that, I've done about 10 minutes. I will hand back to Gordon for the remainder of the presentation. Thank you very much. Yeah. Thank you for that, Mike, all went swimmingly. So, Charlotte, if you can share my screens again in advance to the first research challenges slide. That's it. So, Mike, just it might help if you look at the chat because there's some very interesting discussions around effectively access control and privacy and sharing. So you might want to look at that and we can come back to that at the end. So what I'm going to do for the remainder of the time is talk about our research landscape, what we're working on and also what we would like to be working on. Virtual labs are not something that SEEDS is working on or can work on. It's something that will only really take off if it's a community effort. Some of the issues we'll look at are way too big for one organisation or two organisations to work on. So I'm going to paint a picture of some of the things I think are important. And the first area I want to focus on is the data architecture that underpins virtual labs because at the moment it's relatively primitive. You can store your data, you can manipulate the data, but if you start to think of what you want to do, it's not really there. Suppose you're wanting to go back to the previous example of you looking at human health data of various kinds, various health outcomes and compare it with their quality data. These data sources are stored in different data centres. Some of them may have very strict constraints in terms of accessing them. How do you discover? How do you manoeuvre your way through that web of data? There are real areas of innovation here. You've probably heard of the semantic web. You've probably think of linked data. And I think what we need is more experience of using these in the very challenging domain of linked data for the environmental sciences. What would it look like if we had a linked data mesh that sits over all the existing data centres and allows you to browse the complete set using semantic web concepts and ontologies so you can use search terms that are meaningful to you and your science? What would that look like? You know, everybody these days is talking about fair and I think we need to move towards fair data so that that's intrinsically part of the underlying data architecture. I would go beyond that and I've used the term fair assets because you want findable resources. You don't just want findable data, you want findable methods. Your mic has this fantastic machine learning algorithm which worked really well in Antarctica sheets. How do I find that? How can I then apply that in my context? How can I find a great notebook? It just showed us that's really interesting. Maybe we could apply that in a different setting. So findable assets and then also rolling out to the other letters in fair, applying fair to all environmental assets. What we're really talking about here is something quite big and some people use the term data commons. I'm now slightly nervous about introducing that term because it's not very well defined but let's use it for now. What would it look like to have an environmental data commons that allows you to really step through all the rich environmental data? Stable tell you some narratives from the work in the constructing a digital environment program of how people have been running workshops to find data and apply it. Finding it is not as easy as it should be. Can we do better? So data architectures, yes, I think we can do better. Next bullet, the next one is obvious given one of our core interests and seeds. To have libraries and sets of methods of employing data science for the natural environment. Very easy to say but what does it mean for the natural environment? Unlike many areas of big data and data science, environmental data I think is uniquely complex. In some fields it's just big data, high volume data that's coming at you and needs processed fast. That's easy. What's difficult is when you've got different data of different levels of uncertainty at different scales and you have to combine it all to help you understand the problem but then as well as extracting meaning from data you've got this wonderful legacy of process understanding and process models that exist and almost defines environmental science community. So there's the added challenge of combining data understanding through machine learning or advanced data science techniques and process understanding. And some people will tell me well isn't that just data assimilation? And it is kind of but what I'm imagining there is way beyond the existing library of techniques that you would call data assimilation. Data science for the natural environment. The other thing to say here is we urgently need more experience in this field. We need more case studies. We need success stories of where AI or data science is really worth or actually we need some case studies of where it hasn't worked. You would desperately need more experience. Next slide. Now this is one of my favorite ones because it's something we've been working on within my senior fellowship that I've currently got. It's one of my bug bears about my subject computer science that would make things difficult and then perhaps we're needed because you need a computer scientist because of the complexity of the platforms we offer to you. Just think of Linux and all its complexities. Actually, what we want is platforms that anybody can use. What we want is scientists on platforms doing science not hacking really bizarre Linux scripts and digging out dusty old textbooks trying to remember what this does. It worked, it really worked but it's not working now. What was I trying to do there? No, I shouldn't be doing Linux scripts I should be doing science. So raising the level of abstraction, making platforms usable and there's actually a whole wonderful world out there and for example, software engineering. Again, some of these terms might be familiar to you some might not and that is perhaps the nature of the problem that we need cross-disciplinary collaboration to get people who understand containers, serverless execution, software frameworks, domain specific languages. You need to get people to understand these terms working with environmental scientists to bring them together. And I think that actually means you need also humble computer scientists because we're very bad at coming along and saying we have the solution, I'm sure it fits and then trying to hammer it into the problem and it doesn't really fit. So raising level of abstraction seems interesting for actually very demanding, very demanding to get it right. Next bullet, just to broaden the cross-disciplinary credentials even further so many of the core issues we're coming across in applying virtual labs are actually sociotechnical. It shouldn't be a big surprise that people are actually important and the end users and stakeholders are important. I would extend the web of cross-disciplinary skills to include those that study sociotechnical systems. And I give three examples of their three things we're actually working on. Well, two things we're working on and one we'd like to work on. How can we use notebook technology and similar virtual labs technology to enhance trust? If trust is a very important property what does it mean from a sociotechnical perspective? And how do we actually enhance trust so that when it comes to the eventual decision there is trustworthiness in that end-to-end pathway which is carried out. How do we support decision-making given all this uncertainty and remembering the people that are making the decisions are people not machines most of the time and it becomes a really difficult sociotechnical problem. And then the final one is the one we're not working on yet but we'd love to. How do you build communities of practice? How do you draw people in to build a sufficient community that this becomes vibrant and it becomes the way of doing it in different sectors of the net community? How do we build these communities? And I think that benefits as well from a sociotechnical lens. Next slide. So I'm running out of time so I'm just going to quickly go through this. So the next one I think is very important is migrating our models to the cloud. This is a massive stumbling block for environmental science the fact that legacy models work in rather baroque architectures and can't take advantage of the power of the cloud. So we need to move the wonderful legacy of environmental models so they are available as services in the cloud and they just run. You don't have to download it and try and work your way again through some curious Linux scripts. Try to get it run, it doesn't work. You've got the wrong compiler version so try it again, it still doesn't work. Three months into your master's project you're starting to panic because you still haven't got this code to work and then suddenly it works and you've got your week to do the science. A familiar tale I'm sure you recognize it. So let's get these models in the cloud so they just run. And then you can start to think about integrated models pulling these models together. And then one of my favorite topics that I could actually speak for a whole hour on but I'm not going to is don't forget the model coupling because one of my mantras is the model coupling might be as complex as the model itself. So we need some way of reasoning about how you couple models and all the complexities and downscaling and so on that has to go in to model coupling. So just to leave you with something final it's compulsory these days to mention digital twins and I'm going to leave that as a grand challenge for you if you can click onto the next bullet Charlotte. What would it mean to build a digital twin of the natural environment within a virtual lab setting? I would pose the statement that virtual labs are the right environment because of the integrated data capabilities and the potential integrated modeling capabilities. For me, this is a true grand challenge for the CDE program to really start building digital twins of the natural environment. This is not just running a machine learning algorithm on top of rich data sets. It means really wrestling with the big questions of how you get knowledge from data and how you get knowledge from process models and how they work together. I don't have time to go into this but I see a lot of parallels with, you know one of the people who's responsible for me to be in this world, Keith Bevan and his vision of models of everywhere which I think is a really profound concept that's been under realized in this community. With that, I'm going to stop because I want to leave time for questions. Thank you very much. Right, although that's fantastic. Gordon, thank you very much. And Michael, thank you very much for most stimulating presentation. That was really interesting. I love the grand challenge that you've posted to everyone thinking about digital twins. And of course that does remind me that one of the defining features about digital twins is the processing of near or real-time data and that of course comes from sensors and that's just a shameless plug for our next webinar series which will be all about environmental sensors. How do you get the data that you need for these tools? Now, I, goodness me, I've got lots and lots of questions to ask you but it's only fair that I ask the ones that colleagues in the audience have as well. And Justin, thank you for your queries. If I may just start though with the question that Charlie has just popped up here which I think is quite an important one which is maybe this is one for Michael actually but just wondering if you're using data labs projects with external people who are not actually registered in data labs, what is the best way to share the models with those colleagues? Either read-only or run-only demonstrations. Question, do you both ready? But that's the question. Oh, well, one thing I should have sorry said in my demo that I forgot to mention is the R Shiny app that I actually showed you, that is publicly available. So when you create those R Shiny applications you can create them with three tiers. You can do them so only you can see them. You can create them so your project can see them or you can make them publicly available. So if anyone puts the URL in they can actually access. So that is one way to do the run-only aspect. And we are working on getting a similar version in for Python probably using them in like panel or dash. So a similar approach. The notebooks, I also see in the chat as well that someone's mentioned about using Binder. We are exploring that as one possible option but like Gordon said, we want to find a way for publishable notebooks so people can actually have a runable version. Binder is an option. And also there's a nice little package with R called LearnR I think which you can do sandbox applications and that sort of effectively publishes a Shiny app. But what you can do is make sections of the code editable so someone can come along, change the code, run it and it won't break anything else in the notebook. And then also you can make other sections just read only so that they can't break the key code but sort of when you want to iterate a plot or something you can do that. So yeah, they are options for Python. We haven't got an option really for, apart from maybe the binder. But that is how I would say you share it with external people who aren't registered in the lab is using the sort of interface like that. Thanks Michael. Yeah, so there's a few options there aren't there? Justin, just turning to your questions that you've posted in the Q&A. Thank you for those. And the first one, Gordon really is, Justin's asking, one of the potential users for these VREs is to enable communication of underlying search results from large complex data sets to provide transparency to policy and that's underpin by policy such as the IPCC report. And you know, you're making the point, Justin, that'd be amazing to see the press interacting with the data and communicating with it. But of course, Gordon, you also mentioned it would be nice to publish some of these models in their native form in a notebook form. But you know, what do these endpoints under access services used by the research communities meet the needs of non-specialist users? That's the sort of challenge coming back to the environmental data scientist, Gordon. And does that help enable trust in the science community? Yeah, that's a really, really good set of questions. I mean, in theory, yes, but that's easy to say. Notebooks can be publishable, but it's down to whoever develops a notebook to make sure they're used right. I mean, Mike and his demo showed how you can see as much or as little of the information as you want. You can dig in and see the messy code or you can just see the shiny visualization. So that's one tool. And I think just to repeat what Mike said, I think you can give more controlled access through shiny apps and equivalent. So if, for example, there's reasons why you shouldn't see all the working, you can see the top end. I think there's a challenge here for the community though, because it's one thing saying, yes, it can visualize the data, but there's a challenge to visualize data effectively. And I think one thing you get in the cloud is a set of suite of services that enable you to do quite exciting visualizations, but they've got to be done well. They've got to speak to the end users. They've got to be meaningful. And that leads me on to one of the things I'm very passionate about is working with people who know how to communicate. In my fellowship, for example, I work with artists and I work with people who design good communication. And I think that's as important as the ability to call on some appropriate visualization package in Python. Brilliant. Thanks, Gordon. That's great. The other question, Justin, you're asking really, is about the specific question in some senses, but there's a wider aspect to it as well. The Envary Fair project that's going on is all about linked data and the linked data web across European RIS data. And how's that been looked at in the context of the NERC work and the linked data de-cat ontology maps and the links of those with schema.org? Is that considered? I think that's a very interesting question because in my wider work, I'm also involved with the Daphne modelling systems, which uses de-cat too and schema.org. And that's a question I've been wondering about data labs as well. So how do, maybe a question for you both, but how has that been considered here for data labs? Yeah, I think that's almost the wrong question, if you excuse me, because I think the question is, how do we have a community effort that builds in all these great building blocks? I mean, if you dig down to a number of people working in data labs, it's relatively small. We have relatively particular skill sets and that might not involve linked data, for example. The amount of research challenges I had in these last three slides is massive. It's a massive program, not something we can do all of. So it's a case, I mean, I'm very aware of that work and it's terrific and there's great linked data work in other communities as well we can link to, for example, Carol Goble's work. So it's a case of bringing all these things together and getting a community so that the people are good at doing linked data and discoverable science, get together with those that understand how to do data science of the natural environment and get together with all the other skill sets that we need. Okay. Well, Gordon, thank you very much, Michael, very much for your answers. I mean, maybe looking at the time, we perhaps ought to just bring things to a close now. So, and I'm sure you'll be prepared to receive emails from anyone who wanted to follow up with any other points on that. So if I may just thank both of you, Gordon and Michael, for your presentation and discussion, just to remind everyone that we did, and Joe, we did record this and we will be making this available on our website and if you would like, you can subscribe to the YouTube channel or it'll be on a link to the YouTube video or be on our website. That also just leads me to note that our next webinar will actually be held at a new date and time. So a note for your diaries, Friday the 11th of June at 11 o'clock and will be presented by Dr. Leah Chatzi Diaku on sensor networks and smart analysis. And her work is on quantifying air quality and health impacts. And that pretty much feeds on from Gordon's comments about digital twins. And that is a new series of webinars. It's all about the availability of ubiquitous, low-cost sensors, microprocessor controllers that are available for planning and undertaking environmental sensing research. So please make a note of that in your diaries and book your place on our website, Friday the 11th of June at 11 o'clock. And so with that, thank you to our speakers. Thank you to all of you for attending and to colleagues in there for administering the whole affair. Let's draw things to a conclusion now. Thank you very much. Thank you very much, Gordon and Mike. That was really interesting. It's great to have a demo as well.