 So, fine, we are connected visually, acoustically. Great to have you here, and thanks for having me here to introduce myself. I'm Peter Baumann from Jakobs University in the north of Hamburg. And I'm going to bring to your doors what's called data cubes, but as you will soon see in terms of programming, that simply means arrays, multi-dimensional arrays. We are talking about large ones. We are doing that research since a long time, and one effect has been that our concepts have been brought into standardization of ISO, and the SQL query language is being extended currently. Standard is in the make, that adds arrays to SQL. That actually is a core part of the PhD thesis of Dmitri Amizhev, and he is PhD steward of mine, I'm from Macedonia, and I want to tell you a little bit about what that is going to be, what that standard means, and what we are doing in terms of implementing it. But first, allow me five seconds of advertisement of our master program in data engineering at Jakobs University, which by the way is an all-English program, we are more than 100 nations on campus, so it's all English. So anybody wanting to join us, be very welcome. Somewhere I have an acoustic feedback, and over the talk I will find out where I can place myself so that I still can reach to the laptop while not generating feedback. Okay, I have a database guy. So whenever I see something, I like to think in terms of data structures behind it. If you look at what is to big data today, where the proponents are, I see data structures behind it actually are not so many of them. Interestingly, my hypothesis is, please correct me if you find me wrong, that we have a few core data structures that make up for at least large part of the big data. So if I abstract that away from those randomly picked domains, then I actually find very few data structures that can conflate into sets. You have large numbers of elements that just form sets, animated sets. For this, we have relational databases, we also have no SQL databases, works sort of. We find hierarchies. Okay, hierarchies can be modeled for example in XML, any other kind of things like JSON as well. So we are on the, graphs are something extremely interesting. So like for example, the Facebook graph with a billion nodes and maybe a couple of billion of edges. So doing graph algorithms on that graph, that's an exciting thing. Therefore graph databases are a very important and exciting research area. And then we have a couple of data that happen to be lined up on some grid, gridded data like raster images for example. But also weather forecasts or confocal microscopy images, you name it, there's a lot of them. These are the ones we are focusing on. And incidentally, if you see what that means in practice coming back from the abstraction, then you find that typically this is sensor data, that is image data, image time series, it's simulation output or statistics data cubes. There you go. Data cubes is a common term for it. What in the end is a race. So the standard in a nutshell is going to look like this. It will be part 15 of SQL and it adds capabilities that a column in a table can hold a race, can hold very large a race. We know that already from blobs that they are just byte strings. Now we talk about something that has structure inherent. So I can define the standard stuff like in green and I can add a definition of a scene that is a satellite image that happens to be an array of 5,000 by 5,000 extent and each pixel in this case happens to be seven bands. That is landside seven, that hosts seven bands. Okay, once we have that definition in the database, we can indeed do queries. Now the data dictionary knows about it and so we can do operations like band one minus band two over band one plus band two, let us say the typo in this query. This is something that researchers in remote sensing call the NDVI, the normalized difference vegetation index. That's a very common formula. Now what does it do? It takes the array and applies the operation that you see here on every pixel simultaneously. That is a concept that you find in many languages like MATLAB, like IDL, Python, you name it. That's a common technique and that is now available in a database query. You can use that also in the where clause. If you collapse everything you do on the array into a single Boolean scalar, then you get a true, false, a yes, no and that means you can utilize that for doing filtering on the data. So it's essentially what we do is marrying or injecting image or signal processing language or more generally linear algebra into standard SQL. And that means suddenly a lot of domains can benefit which traditionally have been outside of databases. Who has been maintaining satellite images, I mean the pixels inside a database, not many people, weather data and so on. So it's interesting indeed to do that. On the side, there is another effect we gain because suddenly we are able to have red queries and green queries united in one. What does that mean? Green queries are on metadata or the small, smart descriptive data. Whereas the red query parts are on the data, the big, huge, clumsy data that typically you could only download. So this age old subdivision to data and metadata we can overcome and we get a common information space for users. So this logical modeling already has its advantages. So this is what will become the standard. The status is that we are just entering the second committee draft in ISO and then it goes into draft international standard, blah blah and third quarter of 2017. We believe that the full adoption cycle will be finished and this will be fixed standard. So we have an idea of how to embed a race into databases. What we can do with it, we have seen one class of operations. We can do much more just by way of example, you can do matrix multiplication. Top right, you have the formula mathematically and if you translate that, you can see that the operations that we use here are relatively one to one. So in the red part, we define an array that has an extent from zero M in one dimension to zero to P in the second dimension. We bind iteration variables I and J to it and then we define for each of the elements of this newborn matrix what its contents should be. That contents is given in term of an expression where we in the innermost part find A, I, K multiplied by B, K, J. So we take individual numbers and multiply them like in the matrix formulation topic. Then we need to aggregate the big sigma that is materialized by the condensed plus, so user plus operator to collect everything. And by the way, we have another iteration, the K iteration, which happens to be at the bottom of our big sigma. And so the blue part is an aggregation operation that collects the result of expressions and the red part establishes a new array. Those two together allows to formulate a lot of things in image processing and statistics like for example, convolution operations, you can do rollups in OLAP and other fancy stuff. The second example shows a histogram which is again the same structure. You have the red establishing of a new array which in this case happens to be one dimensional, 256 buckets, and you have the green part which is the aggregation part. Actually, excuse me. If we step back a little bit, the other operations that I showed you before can be expressed via a combination of these two or even just one. In the end, all the operations of this language or this sub-language can be reduced to these two operators. That is not only conceptually beautiful, it also allows us to define optimizations and a few things that can internally make the system faster. Okay. Sorry. So we have a concept. How can we implement it? Our answer to that is called raster men. That fancy name stands for raster data manager. It traces back to a time when I did not dare to call it arrays because people in image processing think of raster images. But in the end, it is exactly the idea. SQL plus n dimensional arrays. Getting a query language that allows us any query, any time on any volume of data cubes. And therefore internally, we have established an architecture that relies on tail, on tail streaming so that we can evaluate queries on objects that are larger than our main memory and that typically are. If you say I have a terabyte main memory, then I say I have a petabyte data cube. So it's always bigger. Okay. That is actually in operational use. If you go to planetserver.eu, this is about Mars, Moon and other planetary objects that is hosting 20 terabytes of data in the raster men community, which is open source and which you can download in source code or as a RPM or as a virtual machine. This is something that is more and more recognized today. It took a long time, actually way more than a decade. But meantime, it's recognized, for example, by the European Space Agency. And actually, I'm sorry. Okay. Right after that, where's the next graveyard? This is something that has let us stumble into standardization into geo services because they were the first ones to offer us big data. Historically, medical data were not that big. Today they are, but not when we started into that. And so we are mainly busy today in the geo sector where we also received a couple of awards and recognitions, but let's go ahead. We wanted to talk about implementation. How can we implement such a language? First, we need to care about the data. Well, obviously a petabyte is too big, even for today's disk, so we need to partition. Typically, what systems of that kind do is they do some homogeneous partition. You cut it into pieces of the same size and via directory lookup, you then easily can spot the tiles or partitions that are affected and so you can load them from disk into main memory for evaluation. We do that in a more general way so that actually you can adjust the partitioning scheme to your application, to your query workload. That is a little bit more involved, but it gives you nice tuning parameters. These tuning parameters actually are offered to the administrator so the user doesn't see it. You saw nothing about partitioning in the previous queries. The administrator, however, gets an extra statement in the insert clause, the tiling clause, that says, I want to use tiling scheme area of interest, and by the way, these are my areas of interest. Dear system, this I want to have as fast as ever possible. For the other regions, do something meaningful, say tile size one million elements. And then we can do a couple of more things and in the end, the system knows how to organize data on disk so that we get it fast. Wait a moment, some people say, that's overkill. This regular partitioning is really okay and it's sufficient. Really? Let's look into OpenStreetMap. They deploy a tiling that looks like the one you see here. It's on vector data, but the issue is the same. If you want to go into New York City, you have a very fine grain access of people, very small boxes that they want to retrieve. Whereas in the Southern Pacific, you can have one big blue tile. So actually, it is worth adjusting to access behavior of users. On the right-hand side, you see a simulation of two neutron stars rotating around each other and you can see nicely the adaptive refinement of the grid. So this way, I would counter, yes, it makes sense to go into that work and implement arbitrary tiling. So much about the data. Now, hooray, once we have partitions, we can nicely parallelize them in that we send each partition to a different node or we put it on a different machine even. That's what we actually are doing and we have done that to the extent that a single query was split over more than 1,000 Amazon nodes by doing the following. We take a query, an incoming query and we look first, where do the data sit? In this made up example, we might have data set A right away. Okay, that's us anyway. Data set B is close by. It is sitting in the cloud on another node on another disk, so it's relatively easy to fetch. But still, we don't want to fetch the data. Rather, we ship a subquery. So this other node can do something for us. It does computation on its data and just sends back data, the results. In this made up example, actually it's just a single integer that comes back. So it really pays off. It pays even more off when we do not talk about a cloud but about a federation of data centers. And that is something that we indeed have working. ECMWF, close to London, wants to federate data with NCI Australia. So we cannot ship data cubes. So rather, we ship queries and get back the results. We can even drive it to the extreme that we have a moving target like a satellite up there. And actually, this is not vision. This is not science fiction. But you see, that's really fun for us. We have been accepted by ESA as an onboard experiment on that satellite, Opsat. And this summer, we will go on board and Rassamann will be an instance there which will be federated and then the satellite becomes a database server that we can trigger with queries. So, yes, there's a lot of fun in it. Okay, the architecture is like that. So we have a central workhorse which is the Rass server process, the Rassamann server process that can put its pixels into a database. This is how we historically started. So it can be PostgreSQL, for example, or any other relational database. At some time, people also wanted to have a version without the database underneath. Excuse me. And so, we developed our own storage manager which is roundabout a factor of two faster because it doesn't have some overhead that PostgreSQL with its functional richness has which we don't use. And so there are two options today to use that. This all is SQLMDA or it's our RassQL. It's syntactically still a little bit different because the ISO folks wanted to have adjustments. But it's the domain agnostic query language. On top of that, we have an application layer that adds geosemantics that implements the OTC standards of web coverage service, web coverage processing service, et cetera. In short, it knows about coordinates and it knows about regular and irregular grids and that kind of things. Okay, so that is the overall architecture. The Rass servers can be multiple processes and that can run on the same or on different machines without a single point of failure. It's not a centralized architecture but in a peer-to-peer style. Good, so far so good. Of course, when I talk about at big data conferences people say, big data, ah, Hadoop. And I say, no, not Hadoop. And they say, why not? Why do you not use Hadoop? It's so fast and scalable. First of all, Hadoop is scalable but not fast. But actually, Hadoop has an important missing property. Hadoop is very good for set-oriented data. Remember my intro, but it does not know about arrays. Now, if I have my red pixel that I want to access it's extremely likely that in the next millisecond I will want to have a blue one. As long as the array is small and fits into your RAM in Hadoop, it can deal with that as well. In the end is Java and Java knows arrays. But hey, we talk about big arrays. Yes, multi-terabyte objects. So you need to shuffle around and this is what Hadoop doesn't know. Consequently, this is shown by benchmarks. So people at George Mason University, University of Madison, and so on, they have done benchmarks where indeed the green one, Hive, was faster than Rustam and when it gets about accessing a single pixel. Hooray, good. But whenever you want to have more than that, actually performance times skyrocket with those systems, whereas our architecture is specialized and is faster. I'm not saying, just to make that very clear, I'm not saying we are better than Hadoop in general. We are just specialized on a very particular niche, on a very particular data structure and therefore we can do better. In other domains like indexing web pages we would fail in a big disaster. We cannot do that. This is where Hadoop can excel. Okay, this whole thing is embedded in, or one of our projects, our flagship project actually, is the Earth Server Initiative where we do exactly that. We take three-dimensional and four-dimensional data cubes and make them accessible for agile analytics. Agile means you can send any query anytime via the query language paradigm. Utilize three-dimensional time series, X, Y, and time. And we have four-dimensional weather data. So we have a step where we find, for example, European Space Agency, Sentinel, Satellite Data at Mio. This is currently exceeding 250 terabyte. We have Plymouth Marine Laboratory. They do ocean color analysis. We have national computational infrastructure in Australia and European Center for Medium Range Weather Forecast in Reading close to London. And now if you think about federating NCI and ECMWF, you have exactly that distribution situation. Okay, so that is what we are doing in this alliance across three continents and now in year five. So yet another year to go and then we see what we find next. Actually, I forgot one. There is Planet Server up there, Mars. I mentioned that earlier. So that is another one. And okay, there we don't want to federate with Earth climate. Okay, admittedly. So this is the geodomain. Actually, we have done experiments in many other domains out of curiosity. So we have looked into the Earth sciences from 1D through 4D. Life sciences is very interesting. Gene expression analysis, for example, looking at the fruit fly embryo, human brain imaging. We are just teaming up again to continue that research and furthers like computational fluid dynamics, et cetera. You see, I have a fable for technical scientific applications and not so much into monetary applications because my own bank account is not big data, so it doesn't pay off. Okay, this was for the technical part. Now let me go to MetaLil and report on a few other. Hey, wait, we have so much time left. I thought I cannot do it, but I can. Fortunately, I can show you a little bit just in case you're interested. Rustamend.org, if you direct your browser to that page, you get that one where you see documentation and you get the source code, as I said. So please feel invited to download. If you want to see the downloads, we count them since some time. You see that there are a few, this is strange, so somebody seems to have many fails in download, so don't count that peak. That spike is for sure not realistic, so realistic, I believe, is this one, having on average about 10 downloads per day, something like that. If you want to play with it, play with it, yes, go to standards.rustamend.com. This actually is a strange URL, you may think, but the main purpose actually is we want to demonstrate the use of GU standards so that we tell GU folks like satellite data centers, hey, it makes sense to offer services based on the standards because they are easy to use and they are scalable and so on. So we have lots of demonstrations for different situations in life, but as we are talking here about the SQL extension, I would like to draw your attention to this one, Rust SQL console. We have some sample data online and you can type a query in here and then you get a result. So I take something that gives us some colors, so blah, blah, blah, select and code and then some query and we retrieve it and we get back the image. So if you want to try out how it feels, be our guest to look at that. In another talk that I've given this morning, I have presented on the Python interface. There are quite a few Jupyter notebooks available now that also allow you to immerse into that, but there's still to come. We have many items on our plate where we have to continue. Let me mention one thing, it's very important to me. I may seem to appear now as evangelizing SQL, as the world speak, intergalactic, definitely not. We database people have learned our lesson. So initially we thought we invent the natural language called structured English query language and maybe not many in this room would find SQL, natural language. So we have dropped that idea, but SQL still is an excellent client server programming paradigm because it allows you just by string manipulation to compose some code and then you ship code to the data. You tell the system what to do. You don't care about the optimization and about distribution because the high level language just ignores that and for this purpose, I believe languages like SQL really have their purpose. But then actually you would want to embed this into your application that ideally has some visuals, some graphical interface and that's what you really want to offer to your users. Okay, so disclaimer, no, this is not the new intergalactic world speak. This actually, if you go into this one, you will find one link at the end, OSTO. So we found it naturally since we started in 2008 to fork the open source version from the pre-existing fully commercial one to join the open source community. And so we found our way into OSTO and I would like to make a few comments on that now on meta level. So we had the opportunity to gain some experience and we found a couple of fantastic folks there but we also found some room for improvement and so what I'm unfolding now is based on number one, I'm OSTO charter member since three years or so. We have done the incubation procedure and as I said, we are engaged in open source software since about eight, nine years now. So the one thing is the maturity. I believe also for open source, you need some coherent management of it. So for example, that was not my complaint but of somebody else last year, there's no inventory of decisions taken. How do you know what has been resolved earlier and that your decision today is not countering another decision earlier and maybe you can base your decision on past wisdom? Also management, incubation management, we took six and a half years. Pi WPS actually took eight years. If you take into account does OSTO is only 10 years old? So that is taking up quite some time. At some time I suggested, hey, OSTO folks, make it simple. Just apply your incubation criteria to yourself first but there was not much response to that. No, that doesn't fit, we don't need that and something else I observe unfortunately is that typically insiders recommend each other when it comes to elections. For example, into the OSTO charter membership. The next thing is that actually it's good to have a broad focus but sometimes it can be too broad. We want to brand good projects in the sense it's good software that we can do use but we should not conquer the project and we should not impose particular models. So I don't like very much designed by committee. If somebody has a great idea and is willing to invest into that and implement it, let this guy do it and let's see what comes out. Do not take away a project and lose maybe this expert. The next thing is and this is my most serious thing which worries me most. I want to provoke you, I call you software communism. So they say all software should be free. Well, I'm very much in favor of free software. We are doing that ourselves with considerable effort. Why should all software be free? So we have the extreme commercial model. I put it like S3, so the market leader in GIS and then you have OSTO on the other extreme and only this is what they want to look at. What about the ones in between? We are losing so many. We could join up forces going up to here and then we have a much larger basis to counter the purely commercial parts if you want that. And by the way, these ones in the middle are very often the small and medium sized enterprises. So whom do you really hit if you hit everything right of OSTO? You mostly hit the small companies, not the large ones. They don't care so much about the small ones. Suffer real disadvantage. And this is where I say, let us have a more inclusive approach. I believe it's very important to have this open source idea prosper and we try to contribute our humble part to it. Let us think about all possible models that we can accommodate and not be artificially exclusive. Okay, so that was my kind of provocation of the day. Let me wrap up. So, a race is something coming back to technical level. That I believe is important. It's important for many, many applications that in the past have been ignored. I can show you papers that write. We take a pixel matrix and put it into an unstructured blob in the database. And consequently, you can do nothing with it. You can just deliver the image. And this is not what we want in the days of big data. So actually, it's good to have that support in some way and in the clear modernization. Not have another silo, but have a clear interface. And that interface is going to be SQLMDA. So remains the question, what's above that? Let's do spicy clients. This is not so much our game. We are on the back end. So at the lower end, how can we have efficient implementations that are scalable, that have high performance and that are easy to handle for the application programmers, not for the end users once again? With Rustam and Community, we try to make a humble contribution to that driven by a database perspective, driven by the query language perspective that has led us stumble into the standardization, as I said. And today I'm the editor of the OTC DataCube Standards. So web coverage service, web coverage processing service. And in ISO, likewise, we have more activities on this running. And as we have a little bit time left, normally I always take too much time. So I said to myself, Peter, be careful, not too many slides. And now I'm in the comfortable situation that I can detail a little bit. Actually, there is quite a few activities. I talked to you about ISO SQLMDA, but ISO also is interested in adopting the OTC standard of coverages, which is DataCubes in the end, DataCubes plus more. And in Inspire, we find the same thing. For those of you who don't know Inspire, this is the European legal framework for a common spatial data infrastructure. This sounds impressive and actually it's a huge work because we want no less than all the institutions in the European community that offer geodata should do it in a compatible way. That is a real huge project. It's not just mapping agencies at the top level, it's down to the level of municipalities. And they all should adjust to open standards. And for the coverage part, so author images, elevation data, and that kind of things, coverages are to be used. The good thing is now that all these converge. So we don't have competing standards on the same thing, but we have a notion of coverage that carries over the different organizations. So you know that just will be interoperable if you follow these standards. Okay, coming back, help, now I'm lost. This was it, okay. So we have interfaces that can be used. We look for scalable implementations. I mentioned on the fly, on political level, so to say my observation is I believe OSG specifically, I generalized it here, forgive me, needs to be more inclusive in terms of business models. And being a scientist and technologist, let me finalize still with the data cubes. And as opposed to the single images that we have the satellite images, one cube I would contend says more than just a million images. It's just more tractable. And this way we hope to contribute to the big data tillage. And this way I would like to finalize for now. And thank you very much for bearing with me. Can you hear? I think any questions please. And if you do, we'll come running to you with the microphones so we can be heard on the recording. That's great, thank you very much. Is there a timeline for this to make it into post-GIS? So the support for arrays, is it going to be adopted in post-GIS? Or is it already there? I cannot speak for the post-GIS people. Actually, at the time they started with post-GIS roster, I told them, hey, let us team up, that would be a nice endeavor. But they said, no, we want to do our own thing. We had public discussion on that. And now they have developed their own way of doing that. And they would need to adjust. That of course would mean some effort and so I cannot speak for them. I know that they are, excuse me. I know they are aware of it. We had some exchange on it. But is there somebody here from the post-GIS community? Could you maybe respond better than I can? Okay. It seems more logical to do it this way. I would agree. So I don't know. Be great if it makes its way. Of course, I would love to see that because that allows us not only to be interchangeable in the sense of vendor login, but it also allows to federate different systems so we can establish the system of systems idea. I would love that. I'm open to collaborate. Okay, great, thank you. Thank you. You mentioned about querying combining metadata and data. Can you tell us a little bit more about that? Yes, definitely so. It's actually, we can just pull up the example again and see the color schemes here. So the green part is metadata symbolically here. Of course, you have many more metadata normally, but it's metadata which have the characteristic that it's small attributes. It's the classical attributes of numbers, strings, and so on and dates in this case. So this is what you define in the table and you define also that you have the data attribute hosting the pixels. So conceptually, it's in the same table. Physically, of course not. You would not want to do that. It would devastate your performance. So the system internally actually must be clever enough to split that. This is actually what we are doing with mediator technologies that we split such a query as we have them below and we find components that go into the classical relational part and are solved there and others that go into the array database part and they're solved there and then we reunite. That goes well unless we have a mixed join. If we have a join that involves both data and metadata, then it gets spicy and that is a current research area for us how to optimize that. That's a difficult part. By the way, we are doing that not only for relational stuff in collaboration with Athena research, we do the same thing on X query and arrays. And in collaboration with Uppsala University, Tore Rish, we do the same thing with RDF data, with Sparkle, what's called polydors today. We want to inject arrays into other data models and see how well this can work. Does that sort of answer your question? Yeah. Okay. Excellent. Thank you. I guess we're done then. Good. Thank you again. Thank you very much. And enjoy Boston.