 Gwbod i'n gweithio, gweithio gweithio gweithio am y cyfnodol, a gweithio gweithio gweithio gweithio gweithio gweithio. Rwy'n credu'n gweithio. Mae'n fydden nhw'n gweithio gweithio Robert McGreeby, i'r cyffredinol, gwneud hynny, ac byddwn ni wedi gweithio'r Cysyllt i'r Cysylltio, ond byddwn ni'n gweithio'n gweithio, a'r cyffredinol. Mae'n gweithio'n gweithio'n gweithio. Robert, yn cael ei gael amser, of you in all sorts of different capacities, I imagine, up until relatively recently, I think you're currently working as a consultant for STFC and you're involved in the formation of the Avalon lovely place center related to data and computing and so on, but before that you were well known as the director of ISIS, the ISIS facility which you were director of from 2012 I think and prior to that you were deputy director at Oak Beach laboratory from which Paul has come and before that back at ISIS there's a section division leader and before that where does that take you back to studs, to Sweden, and then before that to Oxford, I tried to stack that up in my head so hopefully I have any errors, but a warm welcome to you and we look forward to your talk and talking about Veep's Lens links and how we get more value out of all that data, it's a very topical point to the very point in it, a nice, nice and literative title so that. Exactly, your literation is good, yeah okay it's a great pleasure to be here and so um Ah oh my goodness, so we can, no that's the beam and then it's the arrows left and right it should work and it's the dongle, yeah so I've been giving this talk in various different guys for decades I'm afraid now, this is the oldest powerpoint I could find on my computer from 2021, again basically the same topic as all that, that was the SS general meeting in 2021, not 2021, not 2021, and while I was still working in Sweden actually, so it's a long-standing problem data but also a long-standing interest of mine and I think my powerpoint skills have improved both, both use common sounds any longer, not like Collin still uses common sounds like, okay so we have here Max for the SS, so you know big neutron synchrotron facilities, these things are either will produce or are producing very large amounts of data and uh you know just tell me I mean we we what are we going to do with all that data also you're now spending a lot of electricity to produce that data you're living a lot of electricity just quite often store that data to analyze that data you've really got to be thinking about how you get the most value out of that both from an economic point of view and actually now from an energy point of view and those two things are also related, this talking split into three is really three different talks, there's different things I've been through so when we talk about data, when we talk about separately about software and this rather different context and that I was really going to talk about strategy and and I think how at least how we are looking at taking this this forward but I'm going to start with start with data so you know the facilities I just talk about nowadays people talk about terabytes and petabytes etc of data so here's a question for you when I started doing neutron scattering what was the relevant unit of data it's actually the kilogram this is what we used to take our data on right full trunk or punch cards or paper tape how much how much data in cards can you get into a 20 kilogram suitcase because the only time I ever completely lost a suitcase were playing it had my data in it from the experiment at ILL so inside I guess how much data did you get in a 20 kilogram suitcase megabytes it's about half a megabyte right 20 kilograms of space half a megabyte so so things have moved on right you know things have changed they've changed quite a lot just looking at some big data challenges when we look at how the neutrons and synchrons are different so LHC right LHC essentially collects the same type of data over and over and over again every single shot is basically the same it's the same same data you make massive statistics on that but there is so much data but you've got to start at the beginning saying we've got to get rid of 90 whatever is 97% of this because we can't store it all so they have this stuff called triggering software which will look at look at each particular event or set of events and say okay we think this one they're probably missing all sorts of really important discoveries but nevertheless they they have to try and work out these are good events not good events but basically you're collecting the same data over and over and again and then you can analyse it you know for the next 10 years 20 years whatever it is because it's all about statistics of the events you collect so that's one sort of big data challenge and that's really about the size of the data you're collecting if you look at something like SAA which is probably the biggest scientific data challenge around at the moment that's somewhat different because you have all of these radio telescopes so half of them could be in Australia and half of them will be in South Africa and yes they're simultaneously collecting radio data but there you want to correlate all that data now it used to be in the past that you you would say well I can I can store that data I can do all the correlations later that becomes a different sort of challenge but nowadays you want to do this this thing they call multi-messenger astronomy which is which has really been built up since the gravitational waves we're using also you know when an event happens out in the universe somewhere they want to be able to look at it with lots of different things you know immediately so you really need to be able to um analyse your data from these things you know within a very very short time period and that becomes a very different sort of challenge and and to tie it up with all sorts of other data so that's that's why but it's still basically one type of data you're measuring from each of these these techniques when you're looking at these sort of facilities then yes the data volumes are increasing they're not they're not actually at the scale of the SKA but certainly if you look say compare more to LHC and so on we are getting into the situation particularly with techniques like imaging and so on where you're going to have to decide which of those images you want to keep and which of those images you're going to get rid of because you can't keep them all so you're going to have to have some equivalent of the triggering right which will quite often involve nowadays machine learning or something to actually take that stuff out for that but the data are very diverse so so you know facilities like this might have 20 30 40 instruments on um they're different types of data and even for the same instrument if you're using it you know you're using small angle scattering to do biology compared to do magnetism what you want to do with that data is then different for those two things so very very diverse data certainly becoming more instant that you actually want to be able to look at that data analysis in some way while you're doing the experiment so so you know for a lot of the the images games you don't want to do the analyze the images three months later because you want to actually change something the experiment so well that's what I call instant and and and one of my colleagues called this data inconvenient because it's not just lots of data and real volume terms but actually there's lots of different data files you may be changing something in your system as it goes along so it's about the number of different data files which are in themselves different they're different times things they're different states of the sample and so on so it's a very different data challenge than it is for either the LHC or the all the SPA and and actually multi messenger is the um so pretty much all of the people using our facilities will be doing other things with their samples collecting other data with other techniques again these need a grasp we produce back in the SS about 2000 voice it if I had a euro for every person who used the one on the bottom right and that then I could I could take you from we had a great lunch today but I could take you to the most expensive restaurants in London still get really nice lunch but this was all about the multi messenger thing multi messenger putting together different sorts of data is an intrinsic part of what we do in the sort of science although we're not tending to do that instantaneously and we just had a discussion over lunch actually in a lot of areas like in material science the use of databases is actually very limited and that's the thing we have to look at this is another graph I produced some some time ago for a different purpose but the idea here is just to point out that again the multi messenger thing is we have lots of different sorts of data coming in where I called it research there's lots of different ways you can measure data but that's that's essentially all we're doing we're measuring data and that's the analysis of the data that leads to the value right now I mean none of our scientific value comes out directly we have to analyse the data to get there and certainly the the race at which you could analyse that data either allows you to do better experiments or or in fact that they are called time to market right is really very important in terms of the scientific impact that it has if you can't analyse your data for five years you're going to have less impact but if you can analyse it in five minutes and then you have the two things either going around the scientific route where you produce publications etc and again the speed of publication is quite important if you want to make impact or the economic impact if you want to get the scientific value out but but the analysis and the synthesis then you're collecting the data and then analysing it and synthesising put it together that's the core of how we produce scientific value so what's the problem and what are the opportunities we're measuring more data no doubt about that but proportionally we're ending up using less of it now that might be fine right but if we're not careful about which bit of the less we use then of course we may be throwing away the valuable stuff or not looking at it and so on we're actually extracting very little value from most of the data we measure most of the data that's measured in the facilities we're talking about with the exception of particular things like cell protein crystallography is essentially only ever used by one person or one group of people or whatever the people the people are measuring it the collective use of data is very very small and we can't carry on doing that because because we are extracting so little value out of that and as I said earlier we need to maximise that benefit relative to both the economic cost and nowadays the carbon cost right and and those those that's a quantitative assessment we're going to need to make because that will become increasingly important so the opportunity of course is that we measure a lot of data there is a lot of value in that so there's the opportunity if we get ourselves together we can certainly add value by aggregating data and databases that's and that's again in some in some fields it's getting pretty strange in crystallography you wouldn't think of running doing crystal structures without looking at a database I mean how many people would do a crystal structure and say I'm going to work out where all the items are every initiative I'm not going to forget about the rest of the world that's what we're doing crystal structures I mean I started off studying liquids and glasses right now there are no general databases of destructors liquids and glasses you pretty much everyone's everyone starts from oh I'll just work out that myself where I think when I think all the items are so there's an enormous opportunity there and if somebody has to fund it it takes a lot of resources to do that well I'll come back to the data and of course the great opportunity now once we put those things together is using things like artificial intelligence and machine learning either the filtering data in the first place so what's the good data what's the valuable data or then once we're all together getting more value out of the collection of data so in a way AI and ML you know just just collecting data by itself is fine but unless you've been used that aggregating data in an effective way then and again you don't get so much value out of it this is my favourite database I just like to throw some of these things this is this is the spirit collection natural history museum in London you've ever there go go and see that they they have something like 20 million things in in in a the preserving fossils like this and this is this is quite a fire hazard right and if I had a chemical hazard so there's a special building for that would they put together some years ago uh the biggest thing they had is a giant squid um I don't think to myself I took a bit but it is what is it eight meters long or something that's that's quite a challenge preserving a giant squid but also the bottom top left if you have very good eyesight you can read but that that says beagle on it that's one of Charles Darwin's original specimens I have about half a dozen of those that he did so anyway I'm just doing advertising my my friends and natural history museum so it's a wonderful collection but it shows the value of collecting data right because nowadays with genomics right in the past it was just you had all these things and you cataloged them very carefully now of course they they're doing genomics on this whole lot and getting even more information out so so the value of data um I mean the managed data you do need the policy right okay data policies might sound very dull but actually data policies are very important we're kind of used to this stuff nowadays I mean all the facilities I think have data policies when we wrote the first data policy for isis in 2005 the user community were very definitely you cannot do that much you cannot make our data public right that was the view at view at that time you know 10 years later it still turns around and public data policies are quite that way this but there is still some attitude in some areas that my data is my data and it will always remain my data and I'm not going to tell you what the metadata is because otherwise I lose the intellectual property we've got to get over that and they're still around it in in some areas fair data so so fair data is the fashion and does it actually help so so fair is whatever it is it's fine the work capital interoperable reusable lots of organizations I don't know about that it's got to audit you probably say that the results of your funding should be made fair data is that the right yeah yes correct um you probably don't pay people the money to allow them to do that right because it's not sorry about that um to make data fair takes quite a lot of resources right really um and you know as I sort of said earlier not all data can or should be kept right I mean first of all you've got to decide is this data worth keeping um you don't just keep it for the sake of keeping it and actually not all data can or even should be fair right I think data should be findable and accessible because at least then you could do something about it whether it necessarily needs to be interoperable and reusable again is a value value judgment because because a lot more effort has to go into the latter two bits of that acronym has to go into the first two bits so it's different sorts of efforts you have to put in but it does require resources and I think most of you familiar with will understand that um to make data interoperable and reusable you need the metadata that describes what that data is you can do the first two really in a sense relatively easily in a relatively automated way but to do the second two you that's all about the metadata so what is this thing I'm actually collecting data on and then in some cases of science that's called it relatively straightforward I don't think ever is really straightforward but some of the experiments we will be doing you're actually working out what the sample was during the experiment so so how do you define no it's a kind of circular thing there and in a lot of material science when we have these discussions you know people say it's really difficult to describe what was the thing I was measuring but if you can't describe it you can't make it interoperable and reusable um I mean we do have this this thing about in a way that the the publication is a sort of paradigm where in a way you say that the publication tells you what was the thing you measured how you measured it what the data would came out etc so in some sense you think the publication is is a sort of metadata in itself so the publication could be regarded as that but it but the publications are always partial right we all know that they tell you a bit about what you did and they tell you the selective bits that people decided to release to the wider world they don't actually tell you enough to make data interoperable and reusable I mean you can there are people now who write stuff to do data to do mining or publications extract metadata from that produced databases out of that but again that gets you part of the way there it gives you a partial story it's not an complete story when you're trying to get people to produce fair data the thing you have to tell them first of all is the greatest beneficiary of fair data is the people who measured it I mean how many of us have had students who left and they hadn't written down what it was that they did or where they put the data so so just just be able to find your own data and use it again in five years time is it and if you're trying to sell to people what they've got put resources into doing something then telling them it's a benefit to them not just to the wider world of course helps helps quite a lot but as I said metadata is absolutely key so I think we I think we have to I said um no not not knocking funding agencies but I say that UK funding agencies say okay the results of all our funding has to produce they say it has to produce fair data but but actually mostly it's not fair data has to go into a database somewhere and we have what I call data dustbins that that is you know universities will have our places people will put their data to satisfy the funding requirements but it's just put that it's just just like a big dustbin you put it on the dustbin it's it's not actually reused by anything so we're actually just storing up we're just storing up stuff in the store for no real benefit and I think and again the idea that data is usable from it I mean the fact that you measure different data from a particular university has there's no particular correlation of them the fact that you work for the same employer right you want to get your data into databases that connect together similar sorts of data in a logical way and that structure is not there in most areas of most areas of science so certainly we should not be spending resources on data dustbins we should try and do a little bit better than that. Giving credit where credit is due is very important so if you're going to get people to put data into databases again they have to recognise there's some value to me and having done that and and that means we need ways of citing data you know we're very used to citing the publication that's the paradigm for saying what people did but the data is a data in this context should be a bigger thing because you might actually publish about a section of the data but you might put all the data in the database. Again we started doing DOIs for data at ISIS in late 2000s around 2009-2010 and so all the data we measure has DOIs it has a minimal set of metadata associated with it how many people cite that data even the people who in their own publications it's now I guess about 30% maybe something like that that's after more than 15 years it's very difficult to get people to cite it but we need to be having data citation as a standard part of what we do to give credit to people because otherwise again it's an incentive they won't do what they need to do because they don't see it won't help their career because there's no you get publication citations and stuff that helps your career your data citations also should have helped your career not to just link to a publication. There's a whole sort of series of things about data I think there's enormous opportunities but we have to put resources into enabling those opportunities it's not going to happen for free. Now I'm going to do sort of a completely different look at software right so it's not directly connected to and this is just things that are particularly interesting at the moment and so the first one I'm going to start with so you might wonder why I'm showing you a picture of a ultrasound picture of a baby apart from the fact that it's very cute and so on and that way two small grandchildren this year so it's really cute. So we kind of had the idea so this is an image of a baby right the use of the word imaging so you look at the picture on the right and say no that looks like somebody's got an old-fashioned camera and video movie and that's what they took and then think about how this is actually measured actually if you know you're used to the people doing the ultrasound stuff there is an enormous amount of very complicated software between the ultrasound measurement and the picture that comes out even for the one on the left because most of what you actually measure is is garbage right you get the picture of the baby out you have to get through all the garbage that's in the way and this this image on the right it's actually comes from a wireframe right it's actually a wireframe rendering just like you have in the movies right and it's actually deliberately rendered so it looks like a sort of granule like a white picture of a baby because nobody wants their baby coming out in purple or something because it's not not natural but but it's not it's not an image of a baby it's a wireframe rendering just like in a movie of the baby but you're quite prepared to accept this image because it looks like an image right it looks like what you're used to so it's an understanding that going from a measurement which this is into a result which the image is can involve a lot of software and for a lot of medical imaging there's far more investment goes into the software than actually goes into the thing you see whether it's a tax found on the law or the other stuff or even the other found like this um so when I'm talking about here really is what I regard as this disappearing boundary between the experiment and the measurement the analysis modeling and simulation because in the sense that that's that's a part of a simulation and an example might be using for the last couple of years is on you know the electron microscopy of protein structure that's taken off in the last five years or so but again a lot of people don't understand what it goes between the actual thing you measure in the first place and the result you get out again for one of my colleagues Tom Burnley which in the UK we have these things called collaborative computational projects which have been running for a long time some of them um I mean you'd be familiar with ccp4 which does protein crystallography for example we now have ccpm doing this we have the imaging ccp these these these are actually very good examples of very long running software developments on behalf of a community um with both development software and actually make it sustainable that's a really important point this has been I think ccp4 has been running for probably 30 years um producing sustainable software which is very important so in and I'm probably talking to people here who know more about this than I do so so please please forgive me um so you know what you're doing in the electron microscopy is you're measuring images like the things on the top right right it's two two two two dimensional sections doing an image within which you have lots of little images of proteins so it's a two dimensional slice through that and somehow out of that you want to reconstruct a three dimensional object like on the on the bottom right and then you essentially map into that three dimensional object what you know um about the more detailed atomic structure but what you're not doing is you're not directly measuring the thing on the bottom no a lot of people think you are but but you're not actually measuring the thing the thing on the on the top right and and if you look at this I mean this this is just this is just part of the the the framework of the bits of the data analysis and software that they use within ccpm for doing it again this is not a simple linear thing where I have one one program that does this and I go from here to here again you know most of the data analysis we do at our facilities people think data analysis is is uh I measure the data I run this program I go on the next program that comes the answer that's it and start to realise there's all sorts of different things you may need to do for different data sets putting together lots of different ways of constructing bits of this model and so on to actually come out with with the model at the end and and so some of the ways of looking at so again on the on the top left is is a typical image coming out of the micro microscopy you're trying to take the things you can very vaguely see as the the two-dimensional images of the proteins and on the bottom bottom left there that when there's those stars is that's just a sort of cartoon version of what you're trying to do is is to reconstruct that three-dimensional object and within that you then have you're trying to produce this wheel space model like this thing on the top left but you use all sorts of techniques like Fourier, backwards and forwards Fourier transform I mean this has been done in make from microscopy for years it's actually been trying to feel like tidying up your data getting rid of the noise etc in finding the real thing and I I get back to them and then why am I why am I specifically sharing that but a lot of complicated things to end up making that nice clean image you get at the end of the day because the clean image is not the thing you measured that's very important so yes this comes back to the nice protein structure you see is not what you're actually measuring it's a combination of a lot of software to actually go from two-dimensional images into three-dimensional images three-dimensional images into this atomic resolution structure you get at the bottom and an enormous amount of additional information goes in to producing that final answer. Now I'm going to connect this to something that I was involved in developing this technique called Earth-Modicarlo modelling long time ago for modelling structures of disordant materials this is this is a gift we produced of doing this in two dimensions that was when animated gifts were really new we were very happy about doing that because this was a real modern technology it just looks like like nothing but the idea is you're doing model the picture model to the data so that's what we're on the left we're taking we're looking at a liquid structure and we're actually fitting the model of the liquid structure to the data starting with the crystal structure so that's a very simple example and then going back to multi-mesh modelling when you actually start doing that on wheel system this is one of the systems we did which was a far-fine conduction glass um with that picture on the left is is the atomic structure that we we derived and actually got mapped into that nine conducting pathway of the silver ions which is the thing you want to know and all of that's coming from the data and in that case we were using this is still relatively unique we we did x-ray diffraction, neutron diffraction, two sorts of x-fats we eventually did another x-ray scattering and somebody did NMR and stuck it with that so actually there are six different techniques going to producing a single model that's still pretty rare even nowadays where anybody needs to do anything about nothing that's that's still and I really like showing this one because again this is a sideline but you know there's there's a those of you who research scientists will know there's times in your career where you do an experiment that's really exciting and so we we produced a model like this based on the data on the top left which is neutron diffraction data and the peak on the far left the sharp peak on the far left all of the literature at the time said this was due to silver clusters in the material that was the all literature said that was the case we produced our model and the model said no it's not it's due to correlations between the phosphate ions so we had something that was completely the opposite what was in the literature and the answer was if it went and did the x-ray diffraction data because silver scatters x-rays a lot then that peak should be enormous and if we were right it should be small so we started doing the experiment in the top right i'm afraid you know the answer back as you can see how big the peak is but imagine we have a scanning detector detector scanning right we start off the experiment you can see the peak coming up and the question is does it go this well does it go like that and of course you just went like that so we're down the pub forget about the rest of the experiment which would celebrate that that was really exciting to do that but again you end up with something that's completely the opposite of um in literature but it was illustrating the very important thing about that thing between modelling and experiment right the the combination of those two was telling you things you couldn't get just by looking at the data um now some of you might be more familiar with this so anybody doing small angle scattering either x-rays or neutrons will be uh we'll know about an issue specific on the nbl has been reducing software that's been very widely used for actually doing um producing the shapes of uh biological molecules in solution which is then used uh just here in a minute um this is what this is why i can complain about Wethanscarp's reward because Johan is here but he wasn't sorry i'll keep on getting at you so so so in the mid-1990s before Demutri developed this software i wrote a proposal for Wethanscarp's reward it's developed exactly this software because this is actually a reverse what we call latiscash reversal Monte Carlo choosing exactly the same techniques um to do that but they turned it down so we never developed it and Demutri didn't feel sorry about that but you shouldn't have come to the talk should you it wasn't it wasn't your fault it wasn't your fault you wouldn't know um but this is now a widely used technique in in the small angle scattering area in the biological molecules do you know who the do you know who the first person to write a latiscash reversal Monte Carlo programme was Kirk Cloudson who was the who's the vice chair of the ESS council and for a completely different application not the biology biology but anyway and and that's now used so so this is now neutron based on neutron scattering data Steve Perkins at UCL studying in the goblin again but the the um the low resolution shapes there that's done using this sort of software right based on different sets of neutron scattering data and then mapping into that the more high resolution structure in exactly the same way that's essentially being done in the electron microscopy but the difference is here you're starting with two dimensional data from diffraction as opposed to two dimensional data from image but otherwise many senses that the principle is what's being done is the same and I think on the next slide this is again some of the more complicated software they're using that's using combination again of x-ray diffraction data and neutron diffraction data but it's the same principle as electron microscopy so again if we went back to this picture then the Fourier space on the top right there right if you did the one dimensional average of that that small angle scattering data yeah so again it strikes me as sort of slightly peculiar that people are quite happy to talk about the thing on the left as imaging because you start off with two dimensional data and you produce a nice image and on the bottom right everyone will tell me that's modelling because you're starting off with one dimensional reciprocal space data and you're producing a three dimensional thing that has to be a model that can't be but actually they're basically just using the same techniques and I think that the main point I sort of okay there's one more thing in here we've actually again back in the early 2000s before that Sweden we put together these techniques so we actually had we did an experiment while we measured the data we did the analytical corrections of the data and we produced a model one button press right so so the whole thing happened completely automatically and at the time the guy said to me why are we doing this but why are we now while the judge I said that's the future uh it's not quite the future yet but that's but that's where you want to get to right is this is is this you're getting towards that one button press well I mean that happens now really in in crystal in protein crystallography right people want the one button press to take them make them out of robot person button now so it's even easier you press the button you get the picture of a structure out you've got to do it with that in many years and that was possible 20 years ago from a tiny laboratory right there very few staff but still that sort of thing is not really implemented in many places it could be if we had the resources to do it but I think this is another example you can do similar things this is residual stress measurement the top left is measure on basis diffraction so again it's a it's a reciprocal space measurement you produce a frame that the bottom right one is a different object but that's produced on the basis of imaging directly these are essentially equivalent they're just simply in the sense of the mirror mirror image data collection of the two things because because of the um what you're not measuring in the imaging what's taken out is the stuff that goes into the diffraction right so essentially it's the same data it's just it's just the inverse of that but again you in the imaging you sort of get the picture directly what you really want to do in the diffraction is to be able to do the diffraction measurement and also get the picture directly because because you usually have to do both because with the imaging you can get less spatial resolution and you can with the diffraction data but so so these techniques could be used in very different areas um but also we can we can go on to look at dynamics so again these are things we've developed a long time in the past using similar techniques again trying to look at iron uh these are conducting materials uh this one is using hydrogen sulfate I mentioned the story about this himself with sulfate at lunchtime so again this is based on on modeling this is not molecular dynamics modeling this comes from diffraction data um and but we're actually finding out the interesting thing here so in the top we're showing uh three sulfate molecules and then there's a hydrogen atom being transferred from one to the other and the thing people didn't understand at the time is if you do spectroscopy it tells you that you have HSO4 minus ions very clear that that's the that's the entity that's there but these hydrogen ions are zipping all over the place like mad and actually just from this way because you can you can see in that little picture basically the hydrogen gets transferred onto it's not it's not wandering around in the middle of nowhere but the fraction data on the bottom right suggests you have pathways and the hydrogen's just wandering around they're not they're actually transferring coherently from one one molecule to another so again combination of modeling and experiment gives you much more insight than just doing one for the other so we actually went through the process of trying to now combine molecular dynamics modeling with experimental data this is a really simple thing we did some years ago actually modeling liquid argon because that was the easiest thing we could do um but there is now a project trying to develop this molecular dynamics thing which is with people from from actually Thomas Holmrodd in the SS State Management Centre is involved in this people in Chalmers etc trying to do this they they always want to do water I never know why I really want to do the worst first thing I want to study is water and my advice is never study water water is a very dangerous liquid because whatever you say about water there'll be another scientist who absolutely disagrees with you so I've never studied water in my entire career because that's why I'm still alive so you know that's that's a sort of it's a wonder through a number of things that interest people going really trying to make the point about is this this disappearing boundary between simulation and experiments and it's not just about doing a simulation and doing an experiment and looking at the two together it's actually combining those techniques so you're very sure they are consistent with each other and getting a lot more out so we I'm trying to make the point to people you need to be much more imaginative about the way you think about experiment and simulation and so most people have this very linear view on what you're going to do things and it's not about when I had this argument so many times in my career people said well but that can't be correct because it's a model and it's like I don't care if it's correct or not what I care is it's useful to me to understand what I need to understand so it's about understanding and going back to the picture of a baby you're quite happy to accept that because it looks like a baby so I understand the baby and it's whatever it is but in science it's actually no different just we haven't seen the thing that is in a sense with our eyes slightly separate bit there but the bottom developing research software any of these things they need to be professionally developed it's good to be pay our staff that's really good but I've been a bit more than that now nowadays you're going to put together all these different things you need a combination of the maths experts the science experts you know the domain research software engineers who can produce efficient data etc and it needs to be sustainable because just in many of our facilities we will still have the one person who developed this great bit of software and then it's fantastic and everybody used it 20 years later they retire and you know we can think of examples right now of that and that disappears and we have to stop that that's a massive amount of resource that goes into producing something which become a single point failure so we've got to get used to developing software like these collaborative computational projects where you have teams of people and each individual has their own thing but it's not one person doing it and somebody needs to pay for that that doesn't get to be sustainable unless somebody's paying for it and I'm not looking at your game yeah okay so so in that context now I'm coming from the strategic side to tell you what what we're doing about this and and the sort of why so so I'm involved in setting up a centre called the this centre called the Adel-Able centre there's there's far too many things called Adel-Able we didn't realise that when we showed the name you could even buy Adel-Able stocks like that um anyway we called it the Adel-Able centre and the idea is to produce it's a centre for research data management and research software engineering that's sort of very generic for the benefit of our large facilities so it's very clearly based not just developing stuff it's developing stuff maintaining stuff sustainably etc for the bit we have you know the diamond light source i said neutron source we have our central laser facility and I include computing in that because they're the people doing the simulations right so simulations are also themselves data producers and then you know they may or may not be connected but but that's a big area of data production as well so these are all data producers we started this in 26 we started the idea in 2016 we got a very little bit of funding etc so we've had a project board failure we're talking about 20ft about 30 project but short term project it's short term funding etc so again it's not really what we wanted we wanted long term sustainable out of that um but now we've got funding I mean it's taken five or six years to get the idea through but you know it everybody everybody understands it's a problem nobody understands how they're going to solve the problem and the problem is essentially you've got to put funding into it so we've got funding now to significantly ramp up that activity from from 20fts we will not owner operate any of the hardware somebody else is doing the hardware we are we're about people it's just about the people doing the software etc and the data management I mean then you know then we have to work with the people writing hardware it's obviously that still has to be done but specifically in this our budget does not be with any of that the bottom thing is it's just at the moment we're small we will operate as a large program inside an existing department eventually you might expect this to let the spin out into something that was its its own proper centre entity um why do we need this okay we've already been through the fact that there's lots of volume of data etc many translation lots of areas uh at the moment within stfc we are also investing investing in developments of these facilities so we have the the diamond lattice upgrade so taking it like max four uh e-ac is the experience electronics application center new laser facility and actually there's two other projects laser policies funding endeavour is the ices uh instrument upgrade program together there's more than half a billion investment in those upgrades so those are those are not small nor are they going to produce more data right so somehow we ought to be doing something about the data and it really is not effective if we're going to generate all that data but but those budgets half a billion that does not include by and large stuff to deal with the data and that's always been the case we always separate out let's let's build the hardware let's fund the hardware and not do the data later i've already talked about energy requirements um you know in in in if we cannot afford to be energy and efficient any longer so we really have to think through this and i've talked about downstream use of aggregated data which is also um extremely small i'm just just throwing this as a sideline when we had we under one of the major software projects we had at ices was a data analysis platform called mounted which is now has now been used in quite a lot of facilities around the world the only way we started back was when we built the ices second target station and actually when i moved to ices in 2002 it was one of it was one of my um requirements that we we could use five percent of the cost of the instruments just five percent of the cost of the instrument it's not the whole facility on software and that's actually where we started managing from so that was the tiny little thing that grew into much bigger thing later but you have to build that in from the beginning um this is not a problem it's not a it's not a short term project right this problem is going to be with us forever and i mean that's right you know the the the large haven quite a recognize that the astronomers recognize that they're in a very different position we're still thinking oh i'll get a guide to write a program and then all my data will be sorted right this is this is a long term problem and therefore it requires scale and it requires investment it now requires a lot of different skills it's not your one person developing software you you want people to you know can write special code to either very effectively use hbc or now the graphics cards or all these things mathematicians this whole range of skills goes within that and if you're going to get those skills and particularly working for a government organization with with our problems with salaries which we all have etc then you need to have something of scale you need to be able to attract people it needs to be a you know a working environment that people can think i can go there i can make a career aspect because again you you can't do this when people are only going to come to you for two years and then disappear again um and of course the facilities themselves will benefit from the synergies right because again it's the facility if each individual facility was to do this entirely separately they still need something of similar scale and they can't afford to do that so putting putting it together gives you gives you that opportunity just thinking about the scale the example i use is the european bioinformatics institute that's been around again for 30 old years it's really a world leader in data and software the biological applications it's far as i understand at the moment it has about 500 staff who deal with the data and software side of it and about 250 staff who do research on the data and software but that's the scale and that they are only dealing with two types of data crystallography and genomics they're starting to get into imaging because of the electron microscopy and they they themselves say that imaging that's a hell of a bigger challenge than the crystallography and the genomics because it's you know these these images in our two dimensional data you've got millions of them uh so and um whereas our facilities have a much wider range of data for a start the including imaging and mostly it's much less well structured than crystallography or genomics so it's a if we need 500 people for this bit of biology data you can kind of work out for yourselves the scalable problem for um the type of facilities and as i said recruitment retention is really difficult so scale visibility and reputation you have to build a reputation so people want to come and work for you and and stay with you and you can't do that in little pockets and corners over the place you're going to have to centralise that and that has to be something that we can you can look at and see uh so what our charge staff are doing we're going to maintain that well we hope to maintain a critical mass critical mass is really important because people will come people will go you don't want things to fall to bits because you've lost one particular person um and so you have to maintain critical mass so you can keep extra fees uh mostly we will have a project portfolio that's very developed software but we want long term programmes in particular areas um not everything is people have thought in the past to have a project i have to have something that's used by all facilities now it i mean you have some things that for synchrons some things that for lasers i'm picking lasers doing some very different things um it's slightly important second point there but but depends how places work where most of us who work at facilities are used to the idea that facility scientists you know do their user they support the user etc but they also do their own thing for part of the time there's lots of other types of departments and organisations where you basically just you just do whatever projects you have funding for i mean paul paul will know this from from my quiz right the department you go and talk to and say i like to have a meeting and talk about something they'll say what cost centre do i book this to that's uh you've got if you're going to attract people you're also going to need to you can't pay them so much money like on the market one of the ways you attract them is allowing them to divert their own things on this you know on as part of what they do so that's a very very important part of our first fracture and will also look for external grants but the main thing is this work is to benefit the facilities that's got to be very very clear but the bottom bits are sort of about about the attractiveness as an organisation to working our priority things at the moment imaging reconstruction i mean that's that's a big thing across all facilities in different ways including one thing i can't say pygography is a big hot thing for the simulation and modelling so again out of our side of the computing department we have very big strengths in simulation of all sorts of a lot through these computational projects that have been running for a long time obviously techniques and applications of machine learning and artificial intelligence and that that goes from running your facilities and instruments better to the scientific applications there based on metadata management i mean that those those are things where you also need professionalism and need people to help you to understand how do i define the metadata how do i structure that ontologies etc so again that's a professional area of expertise it's not just about people putting things in this data document and services because of course these things will have to be provided as services because you know workflows whatever it is people want to be able to come on and along use those things fairly things so i that is what we call data acquisition analysis as a service also anyway something like that let's put this back providing services and that's a very different set of people you're going to find to do that and what we're going to try and do is so when somebody comes along and say i'd like to develop a bit of software to do this particular task in my experiments and say okay that's good but what we want to know is where's the data coming from what are you doing with the data long term downstream where's that data going to go to and say okay we'll support you to do this but we'll also support you to do that which you didn't come and ask us to do but we need to do that to get the bigger value out of it um and that's a very that's a very different approach to the very simple um single project small scale approach to do the things people ask you to do so simulation modelling services that said data and metadata and what uh where are we starting from well i think nobody's solved this problem right i can't look around and say this country has got this sorted right and we all have this problem so there isn't necessarily a place i mean i think ebi is great um but they you know they do a particular thing you can they'll be learning a lot from them um we're not actually very far behind and this we have some very well developed long term things we've been doing in in our organisation for quite a long time both with the facilities and and within the side of the computing but there's a long way to go there's a lot to do here so there we were very well established on the computing department but they have been mostly been working on funded grants they get grants they do work they get grants they do work they don't actually do their own thing um we've got the track records i mentioned over data policies for example i mentioned the maintenance on there uh we have quite a good machine learning group now ai so we started four five years ago uh there was no um i mean there were individual people who had bits of expertise and then we's um again coming from ISIS side we actually said because because this side of the computing department just works on grant funded work right we said i said we will pay you to recruit four people in machine learning and then they got bits of money from other side i said i don't care what they do they can do ISIS work they can do some other work i don't care what i want you to do is to develop expertise in machine learning so later on we can come back and use it and now they have 15 people in the group and it's supported from all sorts of areas so that's actually worked really really well but somebody has to start and say i'm going to put some money into this and that's one thing that's the some of the large facilities can do because you do have some okay as a director you don't have much flexibility in how you use money but you have some and so you can actually make initiatives that you don't have to ask somebody else for permission to do that so we started that and we now have several years of existing projects which we're now going to don't go round up so there's lots of collaboration opportunities i've already mentioned dmsc down here and links is another collaboration opportunity so certainly collaboration opportunities will be very important because there's lots of other people doing that so i don't have to go through that in sort of detail now apart from the side fact we're no longer in the EU so we're no longer part of eoscos we discussed earlier so we're a bit out of some of those collaboration opportunities but we will we will get in there somehow right and and these things like elites and lands and the issues like links are good because then we don't have to go through the EU we can still work with Europe not not the EU so what are we doing next so we have funding at the moment from next year our funding starts to ramp up and has anybody so we run anything tells you not having money is a problem but your problems really start when you do have money you have to spend it so we're going to start recruiting people so we're going to have quite a relatively aggressive recruiting campaign which in the current circumstances is really hard but everything we want to do depends on having people right so so we're going to ramp up and the really important thing is that stfc which is our funding organisation has understood this is not funding for the next i mean strictly speaking we have formal funding for the next two years but they are bought into this long term right this is not a thing that's going to go away this is a thing that has to be established and has to run just in the same way that we run ISIS and we run diamond and we take part in a second second and that's a sea that's a sea change that's really is a sea change in thinking about it this is not a short term activity so so we have a government structure we have to have all those good things we're putting in and we are aiming at 100 FTE in five years that in that's quite aggressive you know 20 people a year that's um that's not so easy in current circumstances but okay 300 FTE in 10 years okay i'm i'm optimistic and i won't be responsible for it because i'll be really having putting my things up by then but if you if we go back to the EBI example that's where you need to be right if you're really going to use the data from these from these certainly that's not including the people in the facilities already who are working this area these are additional people right so this is really a step change in what we'll be able to do the strategic lessons that i kind of learned from this are certainly the scale of the problem is the same scale as a facility it's hundreds of people right it's not a little problem it's not a little bit of the thing it's it's that is the scale of the problem and you can't then do that within a single facility it's too big you can't take your budget and incrementally increase it or etc right nobody's going to give the facility the budget to do that because it's saying you know i'm going to double the budget of ISIS or something no they won't they won't do that so you had to actually say let's take it outside of that make it a thing of thing on its own and again it's not a short-term project we've got to get over that idea this is a long-term commitment we um from so on collaboration sharing really really important um but you still need resources it's amazing how many people think i mean software is one thing that's very transferable between different organisations right but you still need people to actually implement it to put it on some machine to tell people how to use it to make sure it carries on working so even when raw software is existing or data or whatever you still need resources to implement it and again so the sharing of things does in itself require resources the sharing of things is not is not free and that is usually underestimated actually how much resources it does require to then take somebody else's systems and implement them in your place and make them usable as a user service i mentioned professional and sustainable absolutely critical right that that you you cannot afford any longer to be doing stuff that stops you know with grant finishes and things disappear and and i've mentioned skills and the critical match is the career structure you need to build all that up and and as I said to do that we recognise that we have to take that outside of absurdities and pretend it's own entity so it's taken five or six years to make the argument but now it's been bought into and have to make the difference and i think that i think actually that applies now whether i look at any other facilities i don't think the lesson is that is actually the same you're going to have to do this in a different way than it would before so it's taken a while now i've started and said i've been giving this talk for 20 years it's taken a while there is still a long way to go but i think at least as far as we're concerned we've now made the case now the only problem is that we have to do what we said we were going to do thank you