 So, this is a recap of a presentation that I gave at the fall AGU meeting 2018, and this is a summary of work that we did regarding accessing data stored in Amazon's S3 storage system using the Hierax OpenDap server. My co-authors are Nathan Potter and David Fulker, and this work was sponsored by NASA through a contract with Raytheon. All right, there we go. I click. So, here's what I'm going to talk about. First, I'm going to give you guys some background about the data server and the configurations of the data server that we set up to serve these kinds of data. We have a kind of, well, I'll get there when I get there. And then the next thing I'm going to talk about, optimizations to the algorithms that we implemented. And then I'm going to take a look at some graphical data that shows that those optimizations really did result in improvement. And then finally, my conclusion is going to be a comment about how applicable these are in general. So, the background, Hierax data server has been modified to serve data stored on S3 in a way that's competitive with the way with data stored on a spinning desk in terms of performance. And we evaluated several approaches, and we're going to wind up with some notions that caching metadata, parallel access, and connection reuse all provide really significant improvements when accessing data from S3. So, the last bit is really the most important part, that it's important to separate metadata from the data and cache those metadata so that you can access them quickly. Because it's very difficult to access things quickly once it's on S3. And then parallel access is really important as is connection reuse because you're trying to optimize the bandwidth that you have. At the same time, you're dealing with HTTP, which has a lot of latency. And so, reusing connections that have already been made means that you don't have to repeatedly pay that high latency connect cost. And then, when you put all those together, you can really see significant improvements when comparing with a more naive access mechanism. So, the architectures that we use for this evaluation, we have a kind of a baseline architecture where we took data and we stored it on S3. And then when we wanted to use it, we pulled it off S3 and cached it locally. We call it the caching architecture. And I'll go into that in a bit more detail in a moment. Then the other architecture is a subsetting. So, subsetting means the data are stored in S3 and you subset them directly from S3 using HTTP range gets. And I'll talk about that a bit more. The baseline architecture, I won't say much about it all other than to say it's the server working with the same exact data set. But that data set is not stored on S3, it's stored on a spinning disk. In all cases, all of these three configurations ran in the Amazon cloud. So, for example, the baseline architecture was Hierax data server running on an EC2 instance. And the spinning disk was an EBS elastic block store volume. And the actual accesses were made from another machine also running in the Amazon cloud. So, we didn't, for the most part, actually didn't have to pay cloud egress costs. But also, all of the network timing is all within the cloud. Oops, we won't fail. Okay, caching architecture. So, in the caching architecture, which is the first of the two architectures that throw data on S3, data files are stored on S3. They're not modified in any way, they're just stored there. It happened to be HDF5 files. It actually doesn't matter what kind of file they are. So, they were transferred from the spinning disk to S3. And the only thing you really have to know about them that's special is you have to be able to know which object in S3 corresponds to which file originally. When you go to serve data from these files, you first take the file and move it to a cache that's local to the machine where the Hierac server is running. Then, you work with a file as if it were always on a spinning disk. It's merely just a cached file. In fact, it's really no different than if a file is compressed and you have to decompress it and stick the decompressed file on a cache to work with it. So, it's essentially building on code that's been around for decades. What are the advantages to this thing? Well, so, first of all, it works with any file, right? It's easy to use with legacy software because almost all legacy software is already doing these things. The files are easy to maintain and they are easy to obtain because there's no format translation at all that's needed. If the Hierac server can serve the file, it can obviously serve the file instead of then moves to a cache. And there's minimal configuration of metadata. There's not a lot of pre-processing at all for this. The disadvantages are that the initial costs to transfer the whole file has to be paid for any access at all. And so, that can get really expensive if all you want are a handful of bytes out of a 300 megabyte file, right? Because you're transferring 300 megabytes at all times and it's a cache, right? So, there's all of that extra stuff associated with maintaining a cache that you have to do. And the caches are nominally smaller than S3. So, things get thrown out of the cache. So, something that was in the cache at one time might no longer be there. So, you may have to transfer the file many, many times. None of that's really optimal. It turns out, in fact, that it's also slower than the subsetting architecture. The subsetting architecture in terms of its sort of component diagram looks pretty much the same, actually. Again, in our case, the data stored in S3 were regular files. They were, in fact, exactly the same. HTF file files used in both the baseline case and in the caching case. But we use the technique that I call virtual sharding. So, each one of the files is broken down into regions of the file that hold the different parts of the variables. And that metadata is stored separately from the file or object in S3. It's stored in a special metadata store that's local to the Hierarch server. What happens is that as we decide, okay, we need these five variables from this HTF file file that contains maybe 800 or 1,000 variables. We can simply go in and using range gets, we can extract just the data for just those five variables. And we can do that because we've got the metadata for those entire granules. We have that metadata that tells us where those different parts or those different pieces of information are distributed within the file. We have all that information stored in that employee metadata file. So, that's the notion of virtual sharding is that we break the file into pieces virtually. It's similar to the database technique of sharding. And in fact, if we were to actually break these files into separate S3 objects, one for say each portion of the file that contains data, we would have a whole bunch of URLs. In effect, using range gets, we do have a whole bunch of URLs. So, it's very similar. And in fact, before we did this particular bit of work, we actually implemented a third architecture that used actual sharding into S3 and found that the performance of virtual sharding and the literally sharded files was almost identical. Of course, there are some disadvantages to sharding data. So, anyways, but that's a different talk. So, the advantages to this, so that it's faster than the caching architecture, the data cache is not needed, right? So, there's a cache we don't have to maintain. And only the needed data, only the data that you actually need, a transferred out of S3. Optimizing the transfers out of S3 is actually not that big a deal. You don't pay by the byte. You actually pay by the transfer, but and in fact, the number of transfers is almost insignificant to cost is so small. But if you're only transferring the data you need out of S3 and you have to then transfer the data out of the cloud, you're probably only transferring the data you need out of the cloud as well. And there, a big savings can be had. The disadvantages are that effectively by even keeping the original file intact, but using this virtual sharding technique, you effectively have a new format. You have to write new code to implement these reads. And because you're working with a file format like HDF5 that's very old, it's actually fairly complicated to write the code, right? Because this file format has gone through a number of iterations and improvements, so it's fairly complicated. And before you move the files into S3, you have to make these additional sidecar metadata files. So you have to do a little bit more pre-processing than you do in the caching architecture. So, as I said, and those all, yes. So the sidecar metadata is just more S3 objects or is it held some other database? Well, actually I'm going to talk about that in a second, but in reality, it doesn't much matter. It could be stored on S3 and we talked to our contacts at Amazon and they said, hey, you can store this in S3. That'd be really cool. You can also store it in EFS. You can store it in EBS. You can store it on the EC2's quasi-spinning disk. Pretty much doesn't matter. I'll get to that in a minute. Because it's how you deal with that metadata is one of the optimizations. But anyways, so the subsetting architecture optimizations that we looked at are optimizing the metadata storage, exploiting parallelism of the data access, and then reusing the HTTP connections. So I put connections in quotes there because the HTTP is technically connectionless, but it's typically running on top of TCP, which is a connection-oriented protocol. And that's actually where the latency comes in. And another said latency is kind of a big deal. So one of the things that we discovered when we started working with these files, and this actually doesn't have that much to do with S3 or the cloud, but it turns out to be a bit of a bigger deal with the cloud. But it turns out that our software was spending a lot of time just building metadata responses. And that's because these files, these HDF5 files that we're getting from NASA, they're very large. They often have hundreds of variables, 800 variables, 1,000 variables, is not uncommon. And every one of those variables has hundreds of attributes. So that means if there's 1,000 variables and there's hundreds of attributes, there's hundreds of thousands of attributes. And what was happening is every time we were providing a metadata response, not a data response, but a metadata response, we were rebuilding that structure. We were rebuilding all that information for every variable. So you can see in the figure that it's off to the right here that it was taking, on average, nine to a little more than 10 seconds to build up the metadata responses. What we found was that if we took the responses, built them up once and cached them, we could deliver them out of the cache in effectively no time at all. And what we did was we cached them to spinning disk, although, as I said, we could have cached them to S3. It would have been almost as performant to cache it to S3. It's literally the metadata response that we shipped back already encoded for transmission over the wire. What it has in it, in addition to the regular opened-out metadata, are the offsets and sizes of the locations of the data for the particular variables. So we took the regular opened-out metadata responses and we augmented them with this additional metadata so that our server could make use of this as well. Since it's exactly the response that comes in over the wire, we can easily build binary C++ objects to compute on these metadata as well. And that's software that's pretty old and pretty stable. So given that, we've optimized the metadata responses significantly. Effectively, we don't touch these granules at all for the metadata responses. We pre-compute the metadata responses and hold them in a place where we can get at them quickly. And like I said, you could do that from S3, but we did it from EFS, actually. And then, actually, this graph here, sorry to switch back and forth on you guys, but if you look at the bottom here, this is the time that it took to generate a particular type of response called a DDS. When the metadata store was running on elastic block store, this is the time when it was running on elastic file service, which is a shared file system. And then this is without it. And then again, for the next metadata response and the next metadata response. So you can see that the pattern is pretty obvious. Okay, so big win if you store your metadata in a fast access data system as opposed to generating it on the fly every time. And not a huge surprise, right? But it's worth knowing. So in addition, the sub-setting architecture optimizations, it turns out that working with these HDF file files and pulling out data that corresponds to a particular variable is fairly complex algorithmically. So this is not included in these slides because remember this talk was originally jammed into 12 minutes, but an HDF file file is made up of a number of variables. And some of those variables are chunked. And I'm sure you guys are pretty familiar with this, but that means that in nominally say two dimensions, each of the squares that make up the two-dimensional array are separate chunks when they're stored separately. And that can be done in n dimensions. So if you have a three-dimensional array, there's three dimensions of this chunking and so on. Each one of those chunks is individually accessible. And so we have that metadata access all of those chunks. And one of the things that our code does is it says, well, if you only want a region of that array, then figure out the number of chunks that fall within that region and only read those. So you might have an array that might have 1,000 chunks. You might be interested in data that only fall within 20 chunks. So the code will only read out 20 chunks. Turns out it's pretty complicated to do that, though, especially in n dimensions. So we took the baseline code here, which is the baseline implementation of the algorithm, and we looked at a couple of different optimizations, algorithmic optimizations. And the first one was optimizing when the stride equals 1. So just like that CDF and just like OpenDap, HDF has this notion of strides. So you can say I want for each dimension in the array to have my subset start at a particular index and at another index and then skip over so many cells as you go along in that skipping as the stride. If you optimize for stride equals 1, then you can use block memory moves. And you can see that using block memory moves pretty much cuts the time in half. Of course, it only applies when the stride equals 1, but that's a really common case. So it's worth optimizing for. It also turns out that you can choose your chunks optimally. And if you choose your chunks optimally, then you can pull them out in the order that you will need them to be assembled. So optimal chunk selection gives you another 20 or so percent performance improvement. Then you can paralyze that. So again, if you go back to that notion of, say, well actually it's easier to visualize with a one-dimensional array. So imagine that the array is broken up into 10 chunks. You know that you can read all 10 chunks in parallel, but you also know that the data that you need to insert in the result, you can insert that in parallel as well because you know that none of the bytes overlap each other. So you'll never actually run in, so you'll never have the case where you need to block on one data insert for another data insert to take place. So parallelism provides you a bit more of a performance boost. And I put two asterisks next to this. I'm afraid that they've been lost, so I'm not quite sure why this is here, why the asterisks are here. But parallel, I said parallel reads, but in reality it's actually the parallel transfers because it's not only the reads, but it's also the insert into the response. So these are the architecture improvements. And the thing is that the baseline architecture was fairly complicated to begin with. It only gets more so as we go out here. So that's the thing to kind of keep in mind. And then the last thing, which is maybe the most specific to S3 is that connection reuse and parallelism play a huge role in optimizing the transfer times. So again, S3 uses HTTP, and HTTP has a moderately large latency. Plain HTTP has a 70 millisecond latency. HTTPS has close to a 300 millisecond latency. So if you're transferring a thousand chunks, you're going to pay a significant penalty in terms of time. In this case, we're talking about, I think we averaged this over 20. And you can see that the difference between asking for one chunk after another with a new connection versus one chunk after another by using connection reuse, this cuts the time in half. The fact that it cuts the time in half, that's not the important point. The point is that in this case, that latency, 70 milliseconds or whatever, is paid only once. And in this case, it's paid 20 times. It's interesting that if you do parallel transfers, but you don't have connection reuse, if you make a new connection every time, it turns out that it's about the same as the serial time with connection reuse. So parallel without connection reuse is about the same performance as serial with connection reuse. That makes sense as well. If you imagine that if you spray out 20 requests for different parts of stuff, you're going to pay the connection reuse 20 times, but it's all at the same time, right? You're not doing it one after the other. Here, we ran, we combined the two, you know? So what we actually did in this last case, we created a connection pool and we connected it up beforehand. And so these are all of the accesses with effect of the last one here. These are all the accesses with no connections made at all. So in all these other ones, the connections were made at the time of the transfer. In this case, the connections were pooled. And this is a thing that you can do with S3 that you can't do in general with a typical web server. S3 will allow, keep alive times that are extremely long. A typical web server will shut you down in only a few seconds. S3 will let you keep connections alive for several minutes. Then you can watch them die. And as the connections die, you just make new ones. So the server always has a pool of ready-made connections to any given S3 bucket. I think it's per bucket, but I'm not quite sure actually. Okay, so you can put all this stuff together and it does boost performance. So let's take a look at some plots that show all three architectures. So again, the spinning disk is the baseline and that's EC2 running off an EBS volume. So in every case, the data are the same. The caching architecture, that's the same data, but now they're on S3. And again, when we go to access them, every granule that we want to get a byte from is first transferred to a local cache. And then the server works with it there as if it were on a spinning disk effectively. And the subsetting one, we go in and we read bytes out of the file. So here you can see that the subsetting architecture starts out being pretty good when we barely want anything. But somewhere around 40, it stops being as good as the caching architecture. And that's pretty dispiriting actually when we did this. When we first did this a couple of years ago, I was like, oh man, that's awful. But these are big files, those are the thousand variables. So the fact that if you want to just get 60 variables out of a thousand variable file, it's easier to just transfer the whole file over. That's also good. But again, we're going to see some improvement there. There's some other things to note when you look at this graph. And one of the things you see is that there's a really high variability in all these access time. It's the standard deviation. It's just crazy. And what's weird is that the blue and the yellow, that's working with S3. And these big spikes you see are S3, just doing weird things. And I don't really understand it to tell you the truth, other than they say expect the high variability. But look at this green. This is an EC2 instance talking to an EBS volume. And every once in a while, it just decides to take twice as long. What's up with that? I don't know. I don't really know. And if you guys haven't any clues about this, that'd be great. We plan on talking to our Amazon contacts about this because one of the ways that you can improve the performance is obviously to get from these standard deviations. I mean, if we can get more predictable performance, then we can fail. Sorry about that. If we can get rid of these standard deviations and get more predictable performance, then we can tail over more directly various optimizations. But anyways, when we make the optimizations that I've talked about previously in this talk, what happens is the subsetting architecture falls much more in line with the other two. And overall, it's average transfer times drop below the caching one. Now, if you've got a sharp buy, one thing you notice is that it looks like those two lines actually converge. And they do converge. They converge way out around 10,000 variables. So for any of the files that we're working with, the subsetting architecture is faster. And it's faster by a non-trivial amount. I mean, it's a couple seconds. It's like five seconds faster. It also does take a non-trivial amount of time to get these like 60 variables or so or 100 variables. It takes five or 10 seconds to get that kind of data out of the server at this point. There's still another thing worth noting. And that's that in this architecture, in this graph, you notice that the standard deviations are much smaller. And that's, this material was called, these data were collected on October 10th and we used an M4 XL. The previous graphs data was collected on June 16th. And I don't know what architecture was used, but I think it was not an M4 XL. I think it was actually a machine with more processors, but I'm not 100% sure about that. I don't, I can't really explain all of the variant differences here. I think we really need to talk to our folks, to the contact points at Amazon, to learn a bit more about this variability. You notice that this graph also shows the same sort of wild variability, even in the spinning disk thing. I mean, you would think that it would be, that all of these bars would be almost exactly the same, but they're not. And this, remember, is transferring data all within the Amazon cloud. So there is, there is definitely more to learn here. So here's some conclusions. So optimizing for S3 can provide a large enough performance difference to affect, you know, algorithm selection. The complexity of some of these things, like using parallelism and queuing up things and whatnot, is not really trivial. So it would benefit users if they're packaged in a way that they can easily make use of, like, a web API, for example. And then, and this is a kind of an interesting thing, but we were working with HDF5 files. So the Pangeo project has been working a lot with ZAR. And if you, if you look at ZAR, I was just talking with Rich's signal about this the other day. If you look at ZAR, its actual internal structure is very similar to the HDF chunked array structure. In fact, it uses the exact same compression algorithm, which is LZW with white shuffling. I have this sneaking suspicion that if you apply the kinds of techniques that we used here, the HDF5 files, you might see a similar kind of performance as you see with ZAR files. So, but anyways, that's the end. Again, Raytheon sponsored this, and there's a gang of folks that were involved in that work. Thanks.