 So, good morning everybody to our talk, GRAS-GS in the sky. GRAS-GS is a high performance remote sensing toolbox. I'm Markus Nittler, part of the GRAS development team, together with Moritz Leonard and Markus Mitz. We are presenting here, latest and greatest, with the focus on remote sensing. GRAS is an OSGO project, by the way, exactly today. It's the 12th birthday of OSGO, in case you didn't know. So, 12 years ago we founded OSGO in Chicago over a long weekend, and then it all evolved in what you know with more than 30,000 unique subscribed people in the various mailing lists and conferences every month somewhere in the world. Well, so here we focus on GRAS-GIS. In this very first slide, I just want to show the last roughly 35 years in one slide. You can imagine that the history is tremendous and GRAS has come a long way, developed since 1984, since 99 under GNU-GPL license available, and we try to renew the project continuously. And, of course, it's an open development community. What you can see here are all the various links to different software packages. I don't want to mention them all. Since GRAS 7.2 there, or 7.0, basically, 7.2, there is time series processing, raster vector and volumetric data available, and you can do a lot of things like also looking at time series, even reconstruct them on the fourth. This is the graphical user interface. You can process LiDAR data. You can process 3D water flows. In the morning we had something about this, and naturally vector topological vector processing engine is there, and we have also Docker file, if you are on Docker, for example, we are using it in HPC, and we will see something more about it later on. Just to tell you GRAS comes naturally with a Python interface. This Python interface has been simplified recently, something called GRAS Session. You can use just to initiate your session, which means in a few lines, like with GRAS Session and so on, you can create it and then you can just go ahead and do your analysis. So nothing complicated, and like other software projects, you can go through, load your data set. It is internally using, as before, the GRAS GIS database, but if you work like this, you do not really have to think too much about it. What you do have to know is, of course, what the projection is and other stuff like that, but the session manager is doing it all for you. Recently, which means a few days ago, we published GRAS 47, 47.0 actually, which means, sorry, 7.4 of course. Okay, so many numbers. 7.4, thank you. It comes with a lot of features, which I cannot list here. There is the web page up there. You can check, the presentation has been uploaded just to highlight a few things in case you are already using GRAS GIS. This is an example here, processing Sentinel data in terms of wildfire. This is a wildfire somewhere in Australia. You see the RGB composite there, which is basically showing smoke, and there you can see more or less where it is burning. This is a feature of Sentinel, of course, but you have all the possibilities to process multispectral data here. If you start GRAS, if you are new to it, you can now just download a demo dataset. One we are using for a long time. This you can with a click download and then go through all the examples in the manual. We have been working on auto rectification. It used to be in GRAS 6, and it took us a few years to bring it also to GRAS 7 because of different updated concepts there. So that's back. R in GDAR, R external are the possibilities to use or even just register data in GRAS database. Register means you just tell it where the original geotip or whatever file is, and then you can use it in the session itself. So no need to duplicate data. We fixed some handling with data which exceed the 90, 180, 180 degree definition which happens with global data, which is quite essential if you work with such data. By the way, cloud optimized geotip you can also read because it is based on GDAR, so that's fairly easy. Export improved, clipping for vector data made easy, and then tons of other fixes which are there. Just to now drive you more to the topic here, which means remote sensing. We have been adding more satellites to the atmospheric correction. It is based on 6S. 6S is also used elsewhere, like an Arc-C for example. And we have been adding a tool in order to process more decentralized data more easily. More decentralized data come with additional layers of quality. So in each pixel you get information about the qualities that is extremely useful, but it can be a bit tricky to process these bit patterns. And so you have a tool here which is helping you with that. So I switch to Moritz, who will now present on remote sensing and grass GIS a long history. So just go ahead, have a seat if you want to come in. So yes, as Markus said, we're kind of zooming in now into the main topic which is remote sensing. Before then, Markus Metz goes on to show you the example of how to do this in a high performance computing environment. So grass GIS as such has a very long history, and the remote sensing in grass GIS is almost just as long in terms of history. You already had first sub-modules available in 1986 for grass 1.1, and then from grass 3.0 in 1988, you had a whole suite of modules integrated into the core grass, which are the I dot something modules, so the whole imagery family. Ever since we've had steady improvements and additions really constantly, there's been constant evolution also following obviously the changes in types of satellite data that we can use. And we've moved from what you can see at the top, very simple text-based console in the 1980s to a much more complex and modern GUI system to handle all that data. What is quite important, I think, to understand also and somewhat of a product also of the long history of grass and the fact that when grass was written at the beginning, computers did not have the same capacities as they do now. A lot of the grass internal libraries and grass modules are very memory efficient. So you can work with a lot of very heavy data and then we'll go back to that in the high-performance computing. So what can you do in terms of remote sensing with grass? And the motivation for this talk also came from a colleague who when they told me, oh, I've known about grass GIS for a long time, but I thought it was a GIS and didn't even know it can do any remote sensing. So here's an overview of a series of pixel-based techniques that we have from pre-processing where you can do atmospheric correction, punch-happening whole series of different techniques. You have transformation, very different modules that allow you to do principal command analysis, wavelet transformations. You can calculate a whole series of indices, vegetation indices. Edge detection modules. So a whole series of transformation possibilities and indexes you can create. You have a whole classification, pixel-based classification suite where you can do maximum likelihood, but also the ISMAP module, which is probably one of the first, let's say, hierarchical segmentation-based classification modules, even though the output is pixels. It is based on very different levels of hierarchical segmentation. You have access to modern machine learning techniques where we normally don't kind of develop them, redevelop the wheel in grass, but we use outside tools, either sci-pi or R tools that exist. We also have a whole host of specialized modules which look at specific topics. You have a whole suite of modules looking at evaporate transpiration. You have energy balance modules that use satellite data to calculate energy balance information, biomass. We have a module that works on gravity measures, even one which uses data from a moon mission so you can do even extraterrestrial planetary science can be done with grass. So a whole series of modules available on pixel-based, maybe just as an addition to say that there's obviously a series of, let's say, generic raster tools which are very useful for remote sensing as well. So we do have a map calculator as well which allows a very, very wide range of operations. There's, as Marcus said, time series management and you can do a whole series of types of analysis of these time series, so these are all available. The second part, which is, let's say, more recent is the object-based image analysis techniques that have been developed, also known as OBEA. We've developed a module which does segmentation which divides your image into objects and there's a whole tool chain that has been developed to go from the automatic, let's say, segmentation parameter optimization all the way to machine learning base classification of the resulting objects. So you now have the full pipeline available in grass with wherever possible the idea to parallelize and so to make it possible to work on very, very large images in this pipeline. Very recently as well there was a creation of a module to create super pixels which has been kind of a fashion last year as well which then allows you to kind of reduce the complexity of an image by grouping pixels in a certain way as well. And we're not stopping, so many more elements are being constantly developed. There's a whole suite of LiDAR tools as well that allows you to import LiDAR to treat LiDAR. Marcus already mentioned the suite for the creation of orthophotos and there's current developments going on. We have several students working on neural networks and integration of neural network techniques into grass currently. There's work going on trying to create what we call semantical cut lines which is very helpful when you do tiling in order to be able to parallel process your images. One of the problems when you use classical straight lines is that you have edge effects so what we're trying to create is cut lines that allow you to go through the image and cut your tiles, irregular tiles according to, let's say, characteristics of the image where you say this is a good place to cut a long road or things like that. There's a whole host of extensions so a few years ago, especially with the Python API evolving, there's been more and more extensions being created by researchers all over the world by contributors and there's a whole host of extensions in remote sensing as well being created constantly. And there's permanent work on ongoing trying to obviously increase the performance. We had the debate not too long ago because we reached some issues where we got above two billion objects being created in segmentation so we're really, really large images and so these are issues that we're faced with now that we're trying to work with. So I will give the floor now to Markus Metz, who will give you an example of high-performance commuting with grass. Okay. So I will continue to provide an example how to use grass with high-performance computing. First of all, I want to explain why grass is particularly suitable to use for HPC processing. First of all, all the grass modules and the libraries they have a very low memory footprint. They never load the whole dataset into memory. They always only load that part into memory that is actually currently processed. So even if you work with a raster map that's a few 50 gigabyte or something like this large the actual memory footprint can be as low as one megabyte, for example. That makes it particularly suitable to run, to parallelize it heavily and run it on a high-performance computing system. Second, grass is not a single application you can rather regard it as a toolbox and grass provides a few hundred tools that can use independently of each other or also as a processing environment. The example here is analysis of reconstruction of NDVI values that's from MODIS. This I did globally just to show an example in South America and northern parts in the Amazon rainforest. There are lots of gaps and I want to fill them. Here I use harmonic analysis of time series, just as an example. That's a short example for how this looks in time here on the lower right. So you see the gaps and you see one big outlier and that's the result I want to get to. So I want to reconstruct in space and in time. With this harmonic analysis of time series the whole processing is actually done only in time so I'm not doing any special interpolation or something like this. So my understanding of high-performance computing or as I have used it in this example is I have a master node that can also be a compute node and any number of compute nodes. So compute nodes are facultative, can be zero compute nodes or thousand compute nodes depending on what you have. The components then are, we have this master with a job or queue manager. There are lots of different job and queue managers, job schedulers out there. Therefore I will not go into detail just to mention a few common examples that you find on university clusters or something like this are talk and slurm. There are also talks on the other deaf room session on HPC computing that go into more detail about this. Considerations when you're running graphs on a high-performance computing system are the actual hardware resources available for each compute node. This is needed to like tailor and design your workflow, how much load you can actually put on each compute node otherwise you will get all sorts of problems I'll come to these problems later on shortly. So the general idea here of parallelization is that we are running several graphs commands or graphs modules at the same time on the same compute node or on different compute nodes. Before we actually start we need to create chunks for parallel processing. One option is to create temporal chunks and the other option is to create spatial chunks that depends a little bit on if we are doing temporal processing or spatial processing. In the example here you see the solid arrows show the temporal slice that I want to process like I create two slices, one and two and here I actually add a little bit of overlap to avoid spatial temporal discontinuities and the overlapping part of one I throw away and the overlapping part of two I throw away of the dotted lines and only keep the solid lines to have a seamless reconstructed time stack. You can also create spatial chunks that has also been quickly mentioned previously with G-dial for example. You can create virtual raster data sets that are chopping up a large raster into tiles. An important concept in graphs here is the computational region which is defined by the extent of south, west and east and the number of rows and columns. Considering these you can chop up a raster in any number of desired tiles. You can also create predefined regions instead of chopping up the raster physically you simply pre-define these computational regions. For spatial processing that is when I want to do some spatial interpolation or creaking or something like this I recommend to use each time step as one chunk for the parallel processing because the spatial chunks might suffer from spatial discontinuities which are very difficult to fix later on. Now a little bit of the inner workings of how this high HPC workflow would work. First I create a script with actual grass commands that do the actual processing within a given grass session. This I can run by hand on a laptop just to figure out an optimized workflow then I create a second script that creates a unique grass session. Marcus has mentioned this before. This can also be done for example of the new Python interface for our session. Here I prefer to do this manually more or less manually because I'm fine tuning the settings a little bit for HPC processing. So this script number two is then actually running script number one copying the results to a final destination and cleans up everything. Script number three is highly dependent on may not exist at all. It depends on your job schedule that you're using to manage the different jobs on your HPC system. The grass session is divided into two components. First we have to set up the grass installation, some environmental variables and also paths like we are the grass libraries and the grass executables and so on. This is all done automatically on import and then the actual session which means we have to define where the spatial data are located. A little bit more into the inner workings but I won't go into too much detail. So we have a special file called JSRC for the settings of the grass session and the most important component for HPC computing with grass is that we need temporary like environments sandbox the whole computation so that the different compute nodes are not interfering with each other and in this case it's a temporary map set that I'm creating. I do all the processing and at the end I copy everything back. Cleaning up can be as simple as RM minus RF, blah, that folder, gone. And done. More information is of course also on our Wiki page about how to set up a grass session more or less automatically or for batch processing. There are also a few more examples. The important part of that job manager that we are using here is we must have a queue, something like this where I designed, I started eight jobs four jobs are running for the next four jobs there are no hardware resources available so they are in the queue and they will start when the first four jobs are finished. Now comes the problem that we have experienced or I have experienced in the past so let's assume we have a number of jobs they all do their processing independently finally I want to keep all the results somewhere on a final destination and every job is writing to a final destination storage and that's the situation the time when the whole system is going down and crashing. Yeah, so it depends on you can put HPC systems to the limits and it depends on how they are set up it's worth to try out their limits first. Another example also what we did recently is land surface temperature processing we have in this case a time series of about 30,000 maps which we processed with HPC methods and yeah, so essentially we want to get from gaps in space and time on the left-hand side to the fully reconstructed map on the right-hand side and we want to do this 30,000 times in parallel something like this. The most important when you do HPC processing you need to have a good admin that fixes the system after you broke it and just for clarification I'm the one who's breaking it I'm not the one who's fixing it. Thank you for your attention. Disk space, maybe not so disk space is usually fixed the only solution is to buy more disk space. It's easier to buy more memory so if you want one terabyte more disk space this is easy to buy but one terabyte of memory is a little bit more of a problem. Another solution maybe is to run less drops in parallel and then use some external storage where you have enough disk space for the final results then you can avoid the high disk consumption for intermediate data. I'm actually never running using the HPC system on full power like on full CPU power because the bottleneck is reading and writing the data. Usually we do not create any temporary files but a workflow, more complicated workflow like for land surface temperature reconstruction consists not only of one single processing step but of 15, 20 steps and we have to monitor each step each step is producing output which is then becoming input for the next step depending on the processing you cannot simply pipe it through memory to the next step because memory would explode and we also need sometimes to keep intermediate results in order to figure out what went wrong. If everything went fine we simply delete the power command rm-rf everything and we have our disk space back for the next drop. Yes, so yeah, that's part of the optimization of course you don't want to create too many intermediate products or intermediate data but depending on the complexity of the workflow and for NDVI reconstruction this is straightforward actually we do not have any intermediate data because this hunts procedure is immediately producing the final output for land surface temperature I have 30 intermediate products because the workflow is complicated I have to check that the results are correct that I want to use as input for the next step so this is... or you can't do it