 Yeah, welcome everybody. Thanks for joining us. My name is Matt Fry from UKCH, posting this which is the second webinar in our AI environmental science series, supported by NERC and the Constructing Digital Environment Program. And the program as a whole is aiming to develop the digitally enabled environment, benefiting researchers, policymakers, businesses, communities and individuals. This is the seventh in a series of webinars that the CD program have been running, which you can see on the website. All the videos are there for most of the previous seminars webinar so worth catching up with. And in this one we'll be considering the role and opportunity as well as some of the pitfalls in the use of AI in environmental science. So the format of the webinars is to invite presentation from leading experts in the field, followed by a chance for a Q&A afterwards. So advanced industrial technology have led to a rapid increase in the volume of data being captured, curated and managed on a daily basis. And alongside this new technologies have enabled a step change in capacity for integrated monitoring, analysis, modeling and visualization of the natural environment, potentially transformative spatial and temporal scales. So by harnessing these advantages in technology in the UK's leading position in both environmental and observational and computer data sciences, there's an opportunity to create a digitally enabled environment. And this is something that can be achieved through approaches such as integrated networks of sensors in situ and remote sensing together with methodologies and tools for assessing, analyzing, monitoring and forecasting the state of the natural environment at higher spatial resolutions and finer temporal scales and previously possible. So that captures the aim of the program. And it'd be worth mentioning that I'd ask you to subscribe to the YouTube channel for the webinars if you haven't done so already. There's a link to the channel going in the chat. Thanks Cameron. Today's presentation is a seminar from Sari Gearing of the National Oceanographic Center and Rob Blackwell from from the Center for Environment Fisheries and Aquaculture Science. And I'll be talking about our plankton nets the thing of the past how can we use AI for real time high resolution plankton monitoring. So sorry, as a world leading scientist in ocean carbon cycling, recognized for key discoveries critical to understanding how micro organism marine organisms influence the capacity of the ocean to store carbon. She leads a group of ocean biogeochemists and computer vision scientists who together aim to advance and apply imaging technologies for measuring ocean carbon storage. She's involved in several national international programs for converting in situ particle and plankton measurements made by optical devices to global carbon fluxes estimates. Rob's an environmental data scientist at the Center for Environment Fisheries and Aquaculture Science and a visiting research fellow at the Alan Turing Institute. Rob helps marine scientists to implement data intensive scientific computing solutions using bespoke software development computer vision and machine learning modeling artificial intelligence and cloud computing. And I'll hand over to Sari for the start of the talk. So welcome everyone thank you so much for coming and finding the time this morning so Rob and I are very excited to talk to you about our project. So I'm sorry, you've just heard the introduction to me, and the way that we're going to do this talk is I going to talk a little bit about the scientific background and some of the caveats we have when we want to bring our new exciting technologies to scientists. And then Rob will talk about the more AI geeky bit. So it will hopefully fun and something for all of you. So, yeah, let's get started. Okay, so plankton, why are plankton important. So we're going to start up here in the top right corner is our flight of plankton so these are our little plants floating plants in the ocean. So we're going to take up CO2 from the atmosphere, using sunlight converting that into biomass. So this process is really important and taking up carbon converting CO2 into basically locked away carbon, and that take up about 25% of the CO2 emissions annually. And they're also produced in this process about 50% of the oxygen on the planet. So we have our fighter plankton we have the animals that eat the fighter plankton the plants just like a terrestrial system. So that is our zooplankton, and these together produce particles in the operation and some of them when they sink it's a little bit like the compost in your garden. When they die they're sink down into the deep ocean and they take all the carbon that's been stored, like, you know, all the carbon that they've been assimilated. It's been taken into the deep sea, and that is actually a massive storage for carbon. And yeah the deeper the these particles sink the more carbon is like the longer the carbon is stored away, and you can roughly say that if some plankton detritus makes it to about 1000 meter steps it's going to be locked away for 1000 years. So really plankton and marine particles play a critical role in the atmosphere and controlling atmospheric CO2 concentration and hence our climate. It's not just that so that's my favorite topic and we're going to talk about detritus, but obviously fighter plankton and plankton generally are also the base of the food web of the marine food web. So this is an example where we can see the zooplankton is eaten by small fish and then by squid and large fish and it goes all the way up to birds and finally humans. So really, they play a critical role in our food security and also in ecosystem health overall, and that has led to plankton being officially considered in a central ocean variable. So measuring plankton understanding what they're doing where they're doing it how it's changing really important. So traditionally, how do we get plankton traditionally we use nets, and I just compare here, my experience when I did my PhD, and like about 10 years ago to what we did about 100 years ago. So you can see on the left hand side, those are plankton nets that basically ginormous stockings that we drag through the water. And that's how we collect our little critters. Technology has moved on very little in terms of how we deploy these nets health and safety has. That's a good thing. And on the right hand side, typically we look at these critters in the lab using microscopes. So again, the microscopes have moved on a little bit in technology but it's effectively the same that we've used 100 years ago, which is on one side a good saying, but on the other hand, it also means that to deploy one of these nets, we need ship, we need a person to do that. And it's, it's a tiny sample, you know, we get maybe a few cubic meters in one of those nets, and then we kind of extrapolate that to the entire ocean. It's a bit tricky. So I really need to mention here that there is an automated way which has been going on since like the 50s, which is the continuous plankton recorder, which was trying to get around this problem by basically having a spool of plankton nets and it's been dragged behind a ship. So now we're taking the human somewhat out at least at this stage, you know, we can deploy it off a ship, and we go across the Atlantic or ocean and we collect our plankton, and then we bring the samples back. So we have a map here of where we have these continuous plankton recorder samples, and it looks quite good in the North Atlantic but remember these are single lines. Again, it's actually quite a small volume, only short like over very distinct times. And there are really large parts of the ocean where we don't have any of the continuous data. So there is a massive knowledge gap really about what our plankton is up to. In the last few decades things have moved up, moved on rapidly as we all know with the massive advance in technology and especially camera systems so camera systems you've got them on your phone now some of us like three four cameras on our phone right one phone. Amazing. So we can use these now and there is really the various imaging system now on the market that target different sizes of our plankton community. You can either use them off ships or even autonomous platforms and you can see a range of example photos here at the bottom. They're really quite fantastic in them in their quality. And the hope here is that we can use such such camera systems and integrate them on autonomous platforms. For example, we could have them have them on our ships but also, yeah, we can tow them. We can use vertical profiles we can put them on to glider and float or we can put them on a mooring. We can even use some of this technology in our lab to help us getting around the microscopy stage. So overall this is really exciting. The problem is that, well, the data is awesome, but there is too much. Okay, so now we're at a moment where we're actually getting too much data for us to handle. At the moment, the most prominent way of using all of these images that are coming back is still manual classification, and it still takes a lot of time. So basically, we image the plankton and then we use some sort of sorting platform like here, EcoTaxa on the right hand side is an example where we can use maybe AI to pre-sort them. So we have a pre-trade model that sorts them into rough categories, but then a human still has to effectively click on every single image and say, yes, I agree with you computer or no, I don't agree with you, it is actually something else. So if you collect a million images and think, wow, I've got an amazing data set, you still have to click through a million images and say whether the classification is correct or not. And it also needs still a lot of taxonomy still. So the big question now is, can artificial intelligence help further than just pre-sorting? Well, whenever we have a new technology, we need people to trust the new technology. We're interested about what the current state of trust is for plankton researchers. So we carried out a survey at the end of 2021, beginning 2022, where we asked the still plankton community on the experience and trust in two aspects of this technology. One is the imaging and the other one is the AI aspect. And we got 179 complete responses, so very good tearing out and you can see that where people work. So it's very heavy distribution or like kind of participation from Europe and the US and Canada that we try to really get a good coverage globally. And I got to talk you through some of the results because I think they're really important here. So we basically ask on the two topics, each of the topics, so the imaging and then the AI, six questions. Okay, and we asked them to answer either whether they agree, strongly agree, so those are the green ones, then either agree or disagree, that's gray, or on the left hand side disagree, strongly disagree. So to have an idea about how that like just a quick overview, if it's more green, then they generally agree with the statement. And we asked for positive questions towards the imaging technology and two negative questions. So those are the A and the B ones. So in terms of imaging plankton. There was definitely a strong agreement that image can provide meaningful information in such imaging has clear advantages, and even that time series can be continued with images. Interestingly, when we then asked whether images gives the same info as physical samples there was a strong disagreement, and that makes sense because something like maybe genetic information we cannot get from an image. So there's also clear consensus that we will always require physical samples. And interestingly, physical samples are also preferable to images. So overall, imaging seems to be a trusted tool. It cannot replace physical samples, but it can definitely have our signs. So the community is very happy about using all of these images for their science. Now what about using artificial intelligence for plankton identification. Now, there was a very, very strong consensus that AI and machine learning can help to analyze images faster, a little bit as I just explained before. But then when we asked whether AI and machine learning will be as good as humans, there was no really clear consensus and maybe quite surprisingly, there was actually a consensus or like a strong disagreement with a statement that AI machine learning is unbiased and more reliable. So there was actually, we're like, oh no, no, we don't really, we don't like AI, we don't quite trust it yet. And a very strong consensus that taxonomists will be required in the future. Now we can look at the big questions again it just reiterates the same thing so AI is perceived to be limited in its abilities, and it will never be as accurate as humans. Okay, so to summarize, I can help to analyze data, but there's little trust in its current outputs. And there's a very strong consensus that we will always need human taxonomists. Now I want to just have a quick dive into these two statements. The AML will be as good as humans. There's not quite sure and actually AI will never be as accurate as humans. There's an agreement with that statement. So we need to dive a little bit into human psychology here, because we also asked them what our survey persons have been thought as a perceived human skill of identification accuracy. And this year on the left hand side we asked, do you think how good are humans, less than 50% accurate 50 to 70% 708080 to 90 or 90 to 100% accurate in, I show you image or I show you a plant and you tell me what species is or what type it is. They think the majority thought that humans are over 80% accurate in their accuracy in their identification. However, Phil Culverhouse has done some repeat studies where he asked the same analyst to look at the same sample. And now he looked at two things first, counting. So from one day to the next looking at the same sample same person found an 8% difference in counts. So that is quite surprising. So we're not as good in counting as we think we are. And then in terms of classification. Same person looking at the same sample that was only on average is 70% 75% classification agreement. So even the same person only classifies the things in 75 like the same and about 75% of the cases. So we humans really aren't as good as we think. Now if we ask the same person so this is a plot here on the left. Do you think a human, like how odd and what do you think a human is and how good is in AI. So the predominant color here is is green and that means that they believe that the human is better than the AI, which I just showed you well actually they're really not as good as we think they are. So there's only a few optimistic researchers who think that actually AI might be better than humans. So the summary is that even though AI is is really is really promising tool. And the moment we don't have a community trust in using it. So there is something we really need to do. And one of the reasons why the community might not trust it is because garbage in garbage out. Okay, if we give an algorithm a data set which is maybe only 70% 75% accurate because that's what the human suggested it is. How good can we expect a computer to classify the same images. Well, they're probably going to come back for about 75% accuracy and then we'll be really disappointed. Well, that is not surprise so we really need to think about what how good our training sets, and it's a peer consistency do we agree between like in between the community about what the computer is actually looking at, and we need to do quality assurance so the explainable AI would be one of the options get people involved in understanding how the algorithm works. That's the transparency, and really having a better dialogue into disciplinary expertise. When we ask our zooplankton community about 60% said their novices in AI. So again to build trust, we need to also build understanding. So the project that we're going to talk about in a moment really started in December 2021 with the Alan Turing data study group, where Rob and I first met. It was very exciting so here we try to look at a classification classifier. And this is an output so there's only a two week, but I think it was one week was very short and we were very pleased with the outputs of like us putting it together. So we tested a classifier. And we, we looked at different groupings and the one that I'm pulling out here is only three main groups so this is detritus that's a single particle I'm interested in. So there is two types of zooplankton the copy parts and then the non copy parts or plankton. And we have here on the left that is the true label. And then we have the predicted label here at the bottom. And you can see that for the true label of detritus our classifier actually 94% of the times, identify that one correctly. And the cause all plankton copy parts also did really well was 80%. So again, that is pretty much similar to a human as we just discussed, and for non so non copy parts or plankton I was a bit more varied. We're just not that surprising there's a lot more shape variation, and there were only 660 samples and generally, I mean that was just a one week study, but it was very promising. So that's where we started. And we then thought, okay, let's take this further. Let's see whether we can use the algorithm and demonstrate at the first, the other first time a complete end to end flow for the real time zooplankton monitoring so can we put our cool, quite promising algorithm on to a ship and get the data straight back. And Rob, that's where you take over. So thanks for that. Sorry, really interesting to see what you had to say about Sir Alastair Hardy because he actually worked at the lowest off lab where I work and he joined the lowest off lab. I think almost 100 years to the day before I joined so it was a bit of fun there. Anyway, so I'd like to talk about this data intensive plankton research platform that we're putting together. And as always, I'm just one person in a wider team so thanks particularly to Sophie and James at CFAS. Sorry, Mojama and Jean you at NOC, and also plankton analytics Phil and Julian who are supporting us in this work. So, CFAS is the Center for Environment Fisheries and Aquaculture Science it's a government organization, and we undertake a kind of broad range of science. Projects are covering everything from environmental change, biodiversity loss, food security, etc. So everything from testing that the oysters and mussels that you eat are safe to fisheries stock assessment marine litter monitoring. I've been here for about 18 months and I'm having a great time every days a school day really there's really interesting stuff going on. We've got a new chief scientist who's just taken over and an executive who's now really pushing this whole data and innovation agenda so we're hoping to hire some more data scientists soon just a plug here if you're interested in that please ping me. So the plankton imager. So we're really proud to have the first one of these devices on our research vessel the endeavor. And as you can see from here, it's it's it's stored on the ship and we pump water in. You can see this tube on the right here coming into the instrument, and then there's a high speed camera system that photographs the particles as they go by. So I guess from a kind of plankton's point of view. I imagine it to be like one of those water rides in the Florida theme park you kind of get sucked in have a pretty bumpy ride and then get spat out of the other side. You can get this remarkable image resolution out of this so these are photographs from the prototype. The resolution is actually getting better the color definitions getting better as the camera systems have improved. But you can see this broad range of animal shapes and sizes. And we're really hoping that this instrument might be the key to sort of unlocking the mysteries of plankton patchiness and biodiversity and some of the things that sorry talked about. This instrument really brings with it a whole load of new challenges. It does something like 100 images a second. So that results in over two terabytes of data today. And we'd really like to be able to process those data in real time to know something about distribution abundance and behavior of these plankton. I appreciate the two terabytes a day is not in the same league as the Large Hadron Collider. But when we're bobbing around on a ship in the North Sea, potentially in a gale that that brings its own challenges. So it's too many images obviously to observe and classify and count by hand. And it's too much for a scientist to process on a regular laptop on board the ship. We have a lot of problems with performance of USB connected portable hard drives, and it becomes a real pain keeps swapping hard drives about. And it's obviously too much data to send back via the ships internet. So ships internet is usually satellite communications. That means very expensive. So you can't you can't really send the results back you can't send all these images back. And even if you can capture all these images. Let's say you're out on a five day cruise. 10 terabytes is actually quite a lot just to even upload into cloud storage. So these are all kind of fun problems for data scientists like me to work on. And this is the architecture of the data intensive plants and research platform that we're putting together. So this is really a test bed for ideas for solving these kinds of problems. And on the left here, we've got the plankton image instrument. And then the sort of top half of this diagram is all about trying to classify the particles and and and count the cove abroad and then the bottom part is all about addressing the storage problems. And I sort of want to dive into this architecture in a bit more detail and show you some of the pieces. So this is very much a work in progress. So if anybody's got any comments in the discussion at the end, be very happy to discuss those. But essentially what we're doing is capturing the images. At the top here, we're putting them through an Nvidia jets and box and we'll come on and talk about that. That's actually running the machine learning algorithm doing the counting. And then summarization of those data and sending them back via a satellite link so we're using that bandwidth carefully and within budget. And that's giving us an opportunity for scientists to see the results on a digital dashboard, potentially allowing a kind of adaptive sampling where we might change the algorithm as we go along, or indeed retarget the mission or the ship. And the bottom here, we're capturing data onto large disk storage boxes and then put it and pushing them up into cloud computing, where we can process these things. So we'll dive into the pieces in a bit more detail now. So the first part of this is how do you actually get the images from the instrument onto the various compute pieces. And to do that, we're using a local area network, taking ethernet backbone, and we're running something called the user datagram protocol over that. So we're all probably quite familiar with TCP IP we use it every day for the worldwide web but I think UDP IP is probably a lesser known protocol. Actually we ought to know about it because we're probably using it now these pictures that you're seeing on your screen are probably tunneled UDP. The reason we use UDP for video is that we want the video to update very quickly. We want we don't want there to be a lag between the recording and that coming through onto your screen. So we use for things like the domain name service look up when you when you type in a web address in your in your web browser that's UDP as well, but we're using it here, because it's really fast. It's very low latency and it allows multicast to talk about that in a moment. The downsides for this really is that it's an unreliable protocol. So if you send a packet on onto a UDP based network, you've not no guarantee that that packet is actually going to get through so packets can get dropped packets that you send can be out of order. So it's entirely possible to receive the back half of an image or the bottom of an image before you receive the top half of the image. And there's no congestion control as if you sit on a network here, you can't say slow down I can't process the data fast enough you've literally got to keep up with the data flow. So, as I said we use it because it's really fast the plankton image you can sort of throw out the images onto the network. And it can do that in a multicast way it can, where we can have a number of these listeners, we don't have to have separate communications between the plankton image, the edge AI system, and then plankton image or and the storage system, the other two just listen and pull in those data and start using it. So there's been some interesting challenges about building that we have to use things like ring buffers to kind of store the images, we have to be prepared to throw images away if that if that or not all of the data comes through, etc. We have to, you know, you can't guarantee that an image is going to come through at all. But this is the only way you can really kind of cope with the volume. Moving on and thinking about plankton classification and how are we going to run that plankton classification. I wanted to just sort of stand back a bit and think about machine learning. And what's the difference really between machine learning and conventional programming. Well if we were going to write a conventional program to classify an image. We'd have to start thinking about codifying, you know what we what it is to be a copepod is it about the length of the legs is it about the appendages. Is it about the general size of the organism and the color and so on and the truth is it's it's lots of those things the program would be really complicated. So when we think about conventional programs, we think about writing functions or procedures maybe in R or Python, or even Excel. And so I was a conventional programming system you build a function let's call it F. It takes some input data let's call that X, and it gives you why which is whatever the output that could be as I say any any kind of program we think about is like that. With machine learning, we don't do that at all what we have is a machine learning algorithm. So maybe that's a deep learning system, or, you know, various regression learning systems. We call that G, we give G some input data that's X, the output data why and we expect it to actually come up with the algorithm. So the algorithm, a machine learning algorithm is itself another algorithm which we call f hat here which is the, the estimator. So we can then take f hat and we can run it on unseen samples and it gives us those labels back. And as sorry said the cheering data study group came up with an algorithm that was able to differentiate between copepod non copepod and try to with something like 95% accuracy overall we've managed to actually get further improvement since the data study group as well. But one of the things where that we would like it to do is to do full taxonomic resolution if we could, if we could actually understand, you know, determine that it's a copepod we'd like to know what kind of copepod what's the, what's the taxonomic classification for that animal. And at the moment we're not very good at that we're only in the sort of 70% kind of region. So we want to do some more work there and going forward the two things we're really keen on doing is one rolling out the existing algorithm that we have, because it's good enough to actually go and count copepod and give us estimates for density and the abundance in real time on the ship so that's, hence the architecture that we have. But also we were working with taxonomists to develop a much larger more curated, hopefully higher accuracy data set to really start to improve that taxonomic resolution. And because we want to do real time processing we need a fast processor to actually run that on. We talk about age AI by which we mean that we're running the AI actually in situ on the ship. A lot of AI is actually cloud based where you bring the data and put it in the cloud and run it there, but because we want kind of more instantaneous results we need we want to use age AI. We're using these Nvidia jets and boxes. So these are effectively a graphics card if you will, in a box. They're made to go really fast. So they claim to do something like 200 plus trillion operations per second. So it's a graphics card not because we're doing graphics although other we are, but mostly because what we do using the graphics card for is fast linear algebra, a lot of what we're learning is, is manipulating matrices and doing large matrix multiplication. So this is an architecture of computer that's sort of ideally designed for that. So we're at the early stages of testing that those boxes arrived recently. They are really fast. They're really cool bits of kid, but I, I don't actually have the numbers yet about what the throughput is. So the star is coming down to the lowest after a couple of weeks time and we're doing some of that integration testing and actually putting the end to end system together so I think that's, that's really exciting for us. So the other thing we talked about was the storage problem. If you're on a ship on a ship in rough seas do you really want a spinning disk is that a good idea. What do you do about disk storage do you have to buy large discs and I'm in some of the larger starter discs now we're 22 terabyte so that's okay we could fit a survey on there. But is it better to have a large number of small discs or a small number of large discs. If you have a large number of small discs if you lose a disc if something goes wrong. Okay, you've only lost part of the data set. But if you lose, you know, if you get a mechanical fault on a 22 terabyte disk, you can actually lose a whole survey is worth. Should we be using radar raise should we be actually introducing redundancy in here having multiple discs. How do you manage that etc. We've come up with a lot of problems with using windows and ntfs file systems a lot of science is still running on windows unfortunately ntfs file systems are not really designed for large numbers of small files. And so you get real bottlenecks while the system is trying to upload the essentially the file allocation table or the file catalog on the disk. So that can be a real problem. And ultimately, we've got this problem but if we capture large amounts of data, we need to get those those data up into the cloud, where we, which is really where we do most of the training for machine learning and the sort of heavy heavy compute stuff. So how are we actually going to transfer this these large amounts of data up into the cloud. So we don't have all the answers on this yet our initial solution for the crews that we're running in the summer is that we've implemented a custom sort of file streaming system on Linux so we're going to use regular disks, but instead of storing all the images as large files we're streaming them into larger containers as it were on the disk in a tile in an archive format, so that we end up with a smaller number of large files so we're seeing better right performance through doing that. And then the other thing that we're doing is we're making use of these things called as your data box disks, which I'll talk about in a moment to actually get the data up into the cloud. So I wanted to just talk a bit about big data and the difference between bandwidth and latency. So suppose I want to move 100 terabytes from my home office to our London data center. My broadband connection is about 100 megabits per second. So most broadband connections are asymmetric they usually have more download speed than upload speed so upload can be can be a real pain. Yeah so let's say 100 megabits a second. I reckon that would take the order 90 days to do the upload if my calculations are correct. However, if I were to throw those disks in the back of the car and drive down to the data center. That take me about 90 minutes so that's more like 150,000 megabits per second. So this this observation was made famously back in 1981 by a guy called Andrew Tannenbaum he said, never underestimate the bandwidth of a station wagon full of tapes hurtling down the motor down the highway. Well, that's kind of American language station wagon and of course we don't particularly use tapes these days we tend to use more disks, but the sentiment I think is the same. And just interesting point of history Andrew Tannenbaum is the guy who created Minix the operating system and Minix was the thing that inspired Linus Torvalds to go off and build Linux so we're indebted to him. So we talked about cloud disk shipping taking the data box that's on the ship. And when we're finished with it actually getting that up into the cloud, couriering it driving it literally to the data center. Well, there are services around doing this. And this particular service. I like this diagram we're actually not using Amazon web services snow cone, but they have better diagrams than the now cloud vendor. So I'm using that here but essentially you go into the cloud. And when you order in one of these data packs that gets prepared for you with a whole load of security around it. When the device arrives you unlock it, you put the data on it, when you're done, you sort of relock it. And then you ship that up to the cloud, where the cloud vendor then makes that available in your, in your storage system. That's the plan. We're actually using these as you a data box disks. So we're probably going to be using the low end stuff here for our initial tests but I think we have aspirations to start to use the data box itself as more like 100 terabytes in due course because we're actually thinking about consolidating all of the data from all of the experiments on the ship and trying to get that shipped up. But if you do have large amounts of offline data, you know skies the limit you can buy a petabyte uploader here if you want. So that's quite cool stuff. So I wanted to talk a bit about how we're training these deep learning algorithms. So these learning algorithms are very asymmetric really in the in the sense that they take a long time to train, but then they can be run on on sort of, you know, on faster lower end hardware. So in the past what we've mostly been using is using desktop GPUs. Things that are really desired for gaming PCs, things that are on our desk to do some of the development work. But we've been using these Nvidia Tesla T ones on Azure for training the algorithm. But these are infrastructure as a service GPUs, which means that we really have quite a lot to do to look after them in terms of managing the computer environment. We have to make sure that every time we get a new sort of kernel update, reinstalled the right Nvidia bits and pieces and the whole thing works. So what we're trying to do is to move towards using Azure machine learning. We've done, we have an experimental cluster setup that we've started to use. And that gives us access to a much wider range of more cost effective GPUs. It potentially allows us to sort of scheduled the learning so we can, we can say, actually, we don't need this result immediately we're prepared to put this in a queue and have it, you know when there's more availability of GPUs. We just run this for us. So that gives us the GPUs a lower price point and it gives us more effective utilization of those GPUs. So we're hoping that that's going to give us faster iteration and allow us to try to try more algorithms more quickly. And we're sort of gearing up and investing in this because we're fully expecting to be dealing with much larger training data sets as we start to capture more data, as we have more taxonomous involved. So that's the size of the problem is growing all the time. So the other parts of the piece of the jigsaw here is this real time results. So a lot of people are talking about digital twins these days well, I think, according to sort of John Sidon's taxonomy of digital twins now is more of a digital shadow. But we really do want to terrestrial digital dashboard because not all scientists are on the vessel. Those backup base want to see what's going on. And they can be, they can be a useful part of the feedback Luke, if you're seeing what the ship is seeing in close to real time. Then you're then you're able to call people on the ship, potentially get things changed, say we've seen some really interesting stuff maybe change the model, etc. So in order to get this, these data back, we've got to be really careful we've got very low bandwidth budget because that bandwidth is expensive, and it's used by lots of other people. So presently we're essentially doing summarization of the data. We've got a custom protocol based on something called message pack, which we're which we're running, which is really trying to keep the signaling overhead down to a minimum. In terms of summarizing data, I thought it was worth just mentioning really quickly about lossless compression, because I think this is a really interesting subject. So why is it that some farmers compress better than others, if you take something like the King James version of the Bible and you compress it, you can get that down to something like 32% of its original size. But if you take an image of a copper pod, you can only zip that and get that down to something like 61% bits original size. So that's curious, I think, isn't it, and this is all about this is the subject of information theory, which is a really deep and interesting, interesting space. But I think you can kind of get some insight into this when you think about predictive text systems on phones, when you're typing in a message with predictive text. Often it knows the next word that you want or it might it knows, perhaps the next gives you a choice of two or three words. Often the text that we send is quite guessable, and that suggests that there's quite a lot of redundancy in text and that's why I think text compresses really well. But with with copper pod images if you make a very tiny change to a few, to a few appendices on a appendages on a copper pod change a few pixels, it can change the whole nature of the image. And there's some really interesting work done by a guy years ago called Cole McGorough to really, really look at this and he and he came up with this idea that the information content as it were in a string can be modeled by this idea that the Cole McGorough complexity of a file is the length of the shortest program that can generate that file. So let's think about this. Let's think about a string of containing 10 days and 20 B's what we could write a program to print that out that first line there as a Python program. But that's not much good because that's actually longer than the original sequence. But if we were to codify that in Python as print a times 10 plus B times 20, you get the same string and you get that delivered for a computer program. So that's actually much shorter. So you can start to think about some really interesting computer programs and algorithmic ways of generating things and these are, this is really the essence of modern compression. So this is all really exciting stuff and you think well that's a really, that's a really key idea if we can write programs to generate data sets well that would be really cool wouldn't it. And the interesting part of all this is that this Cole McGorough complexity turns out to be an undecidable problem and therefore it can't be computed. So to cut a long story short, we're we can't, we're back to the original problem. We need to lose lossy compression we need to be prepared to show actually throw some of the data away. And of course lossy compression is the same thing as summarization. So let's get up now and to think about some conclusions. Why is it that CFAS is is is interested in this work. Well a lot of environmental science now is about sorting stuff sizing stuff and sexing stuff we've got lots of people looking at things whether it's plankton fish litter bacteria. Things are really good opportunities for computer vision and to provide systems that can be complimentary to human expertise. So we want to provide more computer vision systems for decision support. We would like our, we've got some really good taxonomist but their time is valuable. We'd like them to be working on the hard problems the interesting problems and a lot of the run of the mill stuff we hope that we could deal with with the computer, perhaps having the computer do stuff. But on the miss just sort of spot checking some of that. And there's this observation as well that marine science generally is increasingly data intensive we need to find better ways of collecting and managing data, storing and summarizing this big data. Most of the instruments that are coming through now we just bought a new flow cytometer. They generate huge amounts of data. So, we've got to get to grips with the big data problem. And we've got to do that in this sort of unique hostile environments that we have which is either on a ship on an autonomous vehicle or sometimes even on the seafloor. How are we going to build the systems to deal with those kinds of problems and of course you know edge AI and summarization is key to that particularly when you start to think about large autonomous center networks, which is very much where our work is going. And another sort of underlying thing here is this increasing need for real time and right time processing so real time is, you know right now, right time is getting data processed, so that it's ready for decision makers in a timely manner. And even to kind of think about that a lot of the work that we're doing is essentially documenting environmental change. We're actually writing papers that say, you know, three years ago we observed your ecosystem collapsing. We've, if we're going to see large changes in ecosystems in the coming years we need to be reporting on them much more quickly much more rapidly. So that we actually give people the opportunity to make real changes. So I hope that's useful. With that, I'll turn it over to questions. If that's okay please. Thank you very much. Both of you that was really interesting great to see a kind of mixture of the science and the kind of dealing with the practical problems and good great so you've got that touring biodiversity project kind of testing these things in a real world environment now to deal with the tricky issues of computing and bandwidth and stuff like that. There's lots of questions coming in on the Q&A. Probably more than we can go through in the time but fortunately give the devil had a nature of today sorry he's done a great job at answering lots of those already so I'd ask people to look through those can find this questions of interest but we can run through things that haven't been answered. So some of these taken straight into the detail so the first one from Simon which is related to probably sorry's presentation about the practicality of using these in the field so how do you use the information about what species have been detected along a linear transect for a ship sampling to change and understand the kind of 2D surface of plankton across the whole area. Well that's a it's a typical thing isn't it like we take a tiny water sample and extrapolate it to the entire ocean. We do that all the time is pretty much most of our measurements so it's pretty much the same thing. I guess, you know, if you think large what you want to do is you want to get enough data so that you can actually then link it back with other information you can get on a broader scale, you know, if we can say we get this sort of plankton community when we have this sort of spectral, you know, spectral spectrum coming from the ocean that we can detect by satellite combined with this temperature and the solidity, then we can use satellite data on a larger scale to extrapolate it. So there are people working on these sort of approaches using machine learning as well. And I think that's probably the hope that we go into and otherwise at the moment it's just you know plot a map and make the line bigger. Unfortunately, I guess at least, I guess at least your images have got a spatial, you know, every, every image has got a spatial location for itself rather than at the moment your nets are just an aggregation of the whole. Yes, whole transect I guess something so it's getting it's getting. We get some information so it's like if you have if you have one of those schools and you still kind of chop it up into little things and you know where like your little section came from so we have that but it's the same challenge that we have as most of our pinpoint data. I suppose that might be that's where the autonomous vehicles coming that they can map an area rather than just do a transaction. Here's not a question somebody said this is best webinar I've seen in a long time so that's nice. Alex Bush has said if human users are able to identify or believe they're able to identify plankton from images, therefore working with the same resolution, why do you think it's. Why do you think it is that more accurate and taxonomically resolved classifiers are difficult to train to a higher level, i.e. more than 90%. So this comes back to the garbage in garbage out question. So it's quite typically. People don't like me saying this but quite often what happens you maybe have like a PhD student who has a really nice data set and then they go in and then look at the plankton images and they quite often have like you know a limited skill and taxonomy there are not that many really highly skilled taxonomists out there anymore unfortunately. There's no one who looks at these images, and a lot of these images, and well we had some example ones but we normally pick out the ones that are really really pretty. I could show you a whole range of images and they're a lot of them are just sort of great pixels, like a little bit like randomly, and, you know, kind of distributed over the screen and then you say, Well, I think this is probably a copper pot. So I as a taxonomist actually put on some uncertainty, but I don't report that back. So I then have a classifier who has basically a zero or one you know it's either this or it's not. And even though for me it's like well I'm 60% sure this is a copper pot. So we don't have this level of uncertainty that we then feed into the training models. As I just said, a lot of these images there is just a very high ambiguity, and we normally accept that as a scientist, but then if we feed that back to the algorithm. Yeah, we can't expect an algorithm to get 100% right that, you know what we only know was 70% certainty. Yeah, so sorry it's completely right there's no magic here but one of the interesting things about deep learning algorithms is that you know obviously every training set that you provide to a deep learning algorithm has got some signal in it and it's got some noise in it. But as the number of samples that you put into that as the training set grows, as in starts to tend to infinity. And then the deep learning systems do have mechanisms for kind of learning their way through this noise. They do start to find the most discriminating features. So you do, in the limit you start getting better performance. The problem is you need very large data sets to do that so you know what we're kind of working on sort of two aspects really regarding the training set data sets. One is trying to improve their quality and that's done by checking double checking and consensus amongst taxonomists. And then the other is just trying to, you know, get kind of kind of get it bigger, but the most of the work that we're doing is around data engineering it's about making the training data that's better. There's very, there's diminishing returns really on trying to make the machine learning algorithms better. It's better to calculate, it's better to concentrate on the training. Yeah, it's like another thing that we're really interested in at the moment is also working on sensor agnostic training data sets so that's something we have a couple of other projects where we're working on that one. So at the moment we really have almost the training data set for every single instrument, sometimes even for you, we have different instruments but they've been deployed one in the Arctic Ocean one in the Indian Ocean and then we have different training data sets again for these ones so our data is scattered all over, you know, basically the globe. Compare that to Google who have an amazing data set on you know what our cats and dogs. So we don't have that at the moment, and we really work on on very small sets so it's going forward we need to combine our information and make it a sensor agnostic. So I was going to ask a question about that so so one would be first one was about the Turing the DSG you said you've got a training, a data of 60,000 training images so how did you get that labeled. Is that kind of unprecedented in this world is that what's helped you. So we were really fortunate to have a guy on the team James Scott I don't know if he's on the call but he was a PhD student when a lot of this was kicking off and he just spent hours and hours and hours classifying plankton and he spent most of his spare time on cruises. He was on quite a lot of science cruises with testing various equipment and yeah so we're really grateful to him for getting that amazing data set together. That's probably maybe is that a lesson there for what what is need I was thinking how do we build these training data sets kind of the future you talked about instrument agnostic ones but there are other kind of aspects of maybe how no can support data sets for research. What do we just need more labeled data sets, we need to put huge interesting so one of the options is also like to image cultures, you know you know exactly what it is so you can then image that culture and even if you have a fuzzy image of that individual you know what you've image you know it's it's different to the exploratory imaging that we normally do where we image something and then we're guessing what it is so it's a reverse. Another one which we exploring I'm quite excited about it is using synthetic models. So we have 3d models of plankton, and we can render him, and then kind of produce as many training images as we want to with that. So they're different ways we're trying to explore what we can do. And then we can ask if that training data sets available for others to use and can it be added to and or refined by the community. That's a good question. I don't think it's kind of particularly out there in the public domain yet but there's no reason for it not to be so yeah feel free to get in contact and we can talk. Quickly running through a few more than so is eco tax are the only system used for sorting at the moment images are there other promising alternatives. Eco taxa is the one it's been specifically designed for AI assisted annotation it's run out of France. The guys have spent a lot of time really making it human, like easy to use and quick. I do like it. I would recommend it promising alternatives. I'm not quite sure exactly. So some of the instruments come with their own sorting system so I believe flow cam has one for example it's visual spreadsheet which works very similar. And the one that I'm quite keen on it's still sort of like developed is more for cluster. So rather than having to look at every single individual individual image. It basically looks at the feature space and finds clusters and give you representative images of those ones. So basically if it finds a lot of ginger cats it's only going to show you one ginger cat and say are you happy that this is a ginger cat and you say yes. Okay, so now I don't have to show you the other 100,000 images of exactly the same thing. And so that is probably something if I would have a large data set I would go for that one type it in here it's called more for cluster. Let's see somebody else has mentioned other net as well there. So here's a question that's a bit partly one part from a attendee and then I'll add my own question on it. So if ML algorithms can identify what they've seen before how good are they to detect new information such as invasive species. And I was going to ask on the on relation to the edge processing, are you considering a kind of mixed model where you do. Or maybe in the future where you do. You do on the edge, and you chuck away that is you don't have to store those images, the ones that are very certain then you just store stuff that you're not so clear about, or which makes makes you make sure you save it for for analysis I guess. So starting simple, you know, we just, we're deploying two things really on the edge AI box we're deploying the classifier that we have. We're also doing so more for logical analysis as well. And summarizing that, but absolutely you know the goal here is for the age AI system to say, here's a classification I'm not really sure of its and put those to one side, so they can be looked at later. So the thing we can do in the in the in the first case here is we'll be getting the real time results from the edge AI system during the cruise, but because we'll be keeping all of the data and looking at that and using that hopefully in it to test future algorithms. We'll actually be able to say well how good did we do in terms of the edge AI in the real time. So yeah it's a, as I said it's a kind of a test bed we want to experiment with these with lots of these things and. We would see that age AI system becoming sophisticated over time. There are lots of interesting questions I don't know if anyone camera and anyone's, do we have to stop exactly 11 or can we carry on with a few more questions. I'd say carry on for a few more questions. So this couple about what tools from outside of sort of marine or plankton processing are useful so transferable tools that are useful. So well I think I mean the general deep learning stuff that we're using which is, you know, based on Python pytorch torches vision, and, and those kind of libraries. Those are all completely reusable there's nothing specific about a resident 50 algorithm to do these these are you know just generalized computer vision classification systems. And, you know, there's lots of material out there doing classification of all kinds of other things we we don't claim, particularly to be machine learning researchers and coming up with new deep learning algorithms were much more about applying existing stuff. So a lot of what we're doing is reusable. This might be a tall order but somebody somebody saying basically they, as their biologists who works with images blanks and they just don't understand how they work, if you've got a kind of 30 seconds summary of how on these AI tools do what they do with the image that helps them. But okay so be coming from the biology background myself it's using AI is like a tool. So I really really recommend for people to at least get a basic understanding of how it works and what it does because we need to understand the tools and we need to understand the limitations of what we're using. So that's number one. I think the dialogue needs to be really there. And, you know, when we were, when we were thinking about trust, we were really thinking I that in future, because this is now becoming such an important tool for any zooplankan biologist. There should have a basic understanding of it, you know, it's and it makes it more interesting as well as a very exciting feel and we're going away from just sitting in like you know, okay drama here. Dusty lap somewhere you know in front of a microscope to really like going into the 21st century with what we're doing. So just a basic understanding and basic limitations is probably all you need to know, but you should really try and invest a lot of time on that one and I guess what would be really great is if if university was teach something like that you know like it is simple module when you learn plankton biology you learn a simple module on like how machine learning works. Likewise maybe some dedicated workshops will be really good to you. I think the important thing for people coming to this fresh is to not be too overwhelmed by it. I mean it's a very hot area there's lots being talked about it. These deep neural networks particularly can seem very complicated. I go to conferences where people put up these hugely complicated diagrams. And yes, it's useful to understand what a convolutional neural network is and you can find that information out in a book fairly easily. It's also important to know that a lot of times we're not into that level of detail what we're taking is these off the shelf algorithms in Python, using GPUs and so on and then running these algorithms. And actually, you know, those other machine learning research is very hot and very interesting. A lot of this is about applications it's about biologists taking off the shelf software packages, things like Azure ML and so on, and being able to write some code, put these things together and use these things. And I think you know what we'll see over the next four or five years is that these things are becoming much easier to use that that's already happened in the last few years. And it continues a place as well. That is a very nice place to finish it I think on a general point yeah so encouraging everyone to get a bit of a background in AI but don't be overwhelmed by the details. Sorry people if we didn't get onto your questions but I'd encourage you to look at the Q&A.