 You guys see that? Are you seeing my slides? Yes, we have. Okay. Okay. All right. Yeah. So as I said, I'm a senior architect of a chef group. I've been working here for four years now. Part of that, I was a deaf manager at AWS. So I've come from a reverse peer progression from manager to developer. That's how it is. So first, for those who are not familiar with HTF5, let me briefly describe what it is. You can think of it as a CAPI, as well as a file format, as well as a data model. And you can think of it as a way of taking the common objects you would have for a scientific application. And you can pull them together. So in this picture, you have this folder, which is the root group. And that can contain, say, a three-dimensional data array, raster image. You can have subgroups. And subgroups could be a 2D array, a table, and so on. And furthermore, each of these objects can have its own set of metadata, which are like smaller pieces of data that describe the data. For example, for an experiment, you may have that data says what time the data was collected. And it's useful because oftentimes in scientific fields, you collect a lot of data, but it's hard to keep it organized. It gets separated in different files, and you lose track of what relates to what. So with HTF5, you can keep it all together. And it's a binary format that supports compression. So it's much more efficient, say, than using CSV files to store data. And HTF5 has been around for 20 years now. So it's quite mature. It's very popular in U.S. here with the Department of Energy, also NASA, and around the world. It's used for lots of applications, primarily in simulation, our scientific instrument data collection. We're spinning wheel. It's nice. Kind of a challenge we have now is the growth of the amount of data. So this is a slide from NASA presentation and showing how the collection of HTF5 data from satellites has grown from, like, half a petabyte in 2000 to over 20 petabytes now. And it's only projected to grow more because there are more satellites, and each of the satellites has more instruments, and these instruments are collecting more data at a time. And traditionally, where NASA has distributed this data is have data repositories where people can download files. And that was fine when the average file size was, say, 100 megabytes. But when you have files that are multi-gigabytes and collections that are multi-terabytes, it becomes impractical for users to download all that data to their local system. So NASA is looking at sorting data in the cloud. And one challenge as regards to HTF5 is that HTF5 was designed before the cloud became a common platform. So there are various issues with using HTF5 in the cloud, primarily because HTF5 library requires a POSIX file system. So a few years ago, we started a project sponsored by NASA where we created a HTF data server, if you will. Okay, so this is HTF5 optimized for a cloud that uses object storage, namely AWS S3, rather than POSIX file storage. And rather than a library, it's a service that runs as a cluster of Docker containers. By having the system as a cluster of containers, you can scale out the service by having these containers running across multiple instances in the cloud. It's feature compatible with HTF5 library, and it's implemented in Python using async.io. So features we have is that clients interact with service using the REST API. So the REST API is fundamental to how clients will engage with the service. But since applications have been written previously that use either C API or different language APIs, like with Python, we've created SDKs that clients can use to keep using the same API they're using before. But now, rather than talking to local files, they're talking over the web to the data server. Given that the data is stored on S3, there's no limit to the amount of data it can store, and multiple clients can read, write to the same data source at the same time, which is a limitation actually that HTF5 library has. Okay, as I mentioned, we can scale up the service. And I'll talk later, we paralyze requests to the service to speed up performance. So here's how the client server aspect works. On the red box, you have the data server. It's off somewhere at a known endpoint. You can have, say, a web application that's using Ajax to talk to the server. You could have a C4 application that's using HIV library, and the library invokes a plugin. So rather than talking to local files, it's going across the web talking to the server. And for Python, we have a package called h5pyd, which again talks to the REST API. So how we designed this is actually a radical change from how HTF5 works with POSIX. What we did is that we took the contents of HTF5 files, and as you recall, there's groups and there's datasets, and we shard those, right? We chop it up into smaller pieces, and store each of those pieces as an object in S3. So if you imagine this grid is a array, we block it up into these heavy outline regions. We call it chunks, and each chunk is stored as a separate object. Okay. When a client wants to read a particular region of this array, like say the yellow region, the server will figure out which chunks are needed to serve that data and do the segmentation to deliver just the data the client needs. I mentioned the parallelism aspect. It's implemented actually a two-tier set of containers. There's this front-end tier called service nodes. They handle requests from the clients, and there's a back-end tier that partitions the object source space. And when clients request data that spans multiple objects, those requests can be paralyzed across a number of data nodes. So if you have 100 data nodes and you're accessing data that spans 100 chunks, each of those data nodes can be reading data in parallel, and that speeds up performance greatly. I mentioned we use Python with async.io. Async.io is a rather new Python feature. And it's interesting that rather than using multi-threading to handle multiple tasks, use task switching. So for a data server, oftentimes request is blocked waiting on some kind of data transfer. And the async.io lets you block that task when you're waiting for I.O. and switch over to some other tasks. So you get really good CPU utilization. H5py D is a Python client. It's based on a popular Python package for regular HF5 called H5py. So we took the same API and we just translate it so that, again, it talks to a server rather than talking to the library. There's actually some additions to H5py features. For example, in regular desktop usage, you just do ls to see what files you have. Since the files to an HF server are now not mountable as a regular file system, we have utilities to let you see what files are there. Similarly, since you cannot use regular POSIX, CHMAD, to assign permissions, there's a set of utilities to what's called an access control list to control who can read or write or do other actions to the files. And there's also a query interface. This lets you do like SQL style commands to pull out just certain rows of a data set. There's the command line interface. Again, it's a set of tools for like uploading data, downloading data, managing permissions, and so on. In addition to that, we recently launched what's called key to lab. This is a hosted Jupyter Hub environment. I hope you guys are still hearing me. It enables you to connect to this hosted environment so that rather than connecting from your desktop, which may be very far away from where the server is, you're running a Python environment within the same Amazon AWS region as the data server and as the S3 storage objects. So you get very fast access. So say you're doing analytics over a very large data set, rather than moving the bulk of the data from Amazon to your desktop, it's actually all happening within the Amazon data center. So data is flowing from S3 storage through the data server to a container that runs your Jupyter environment. This is how the architecture works. So here we have a user. It connects to this endpoint as Jupyter Hub. Once he signs in, Jupyter Hub spins up a new container that contains this environment. This environment has a disk volume that's attached to it. That's used as kind of like a scratch pad, kind of local POSIX disk. And the environment is configured to talk to the data server. So here we have our service node and data nodes, which is talking to S3. So you can try this out right now if you want. The register at hjgroup.org slash hdf key to lab. And once you register, you sign in to hjflab.hjgroup.org. So each user gets the equivalent of a dedicated Xeon Core with a gig of RAM, 10 gig of persistent disk, and up to 100 gigabyte of cloud storage. So we charge this minimal fee to cope our expenses of $10 a month, but the first month's free and if you use a special ARTC tech talk coupon code, you have two months free. Here are some links for more information about everything I've talked about. And if I have a few minutes, I'd like to quickly demo how this works. Let me sign out. Okay, so you come to hjflab.hjgroup.org and you sign in using your hjgroup registration. And once you're signed in, you have some environment. There's an FAQ. You have a terminal. So I can do, so this will show me like a content I have. Hsinfo is one of the command line tools I mentioned. I can do, say, HSLS and see my home folder. So this is not a POSIX directory. This is a kind of a path managed by the server. So I'm seeing content that's stored in S3 managed by the hgfs server. If I do, I have a file here called tall.h5. There is an hgfs5 library utility called h5ls that you can show the contents of that file. And I have a tool, HSLoad, will take that local file and upload it to the server. So now it's loaded. And I can do HSLS again with this content. And I see basically the same structure now it's replicated in the server. So what's the advantage of having this content managed by a server rather than managed just in this local disk? I have Jupyter Hub. Well, for one, you can share this content with other people. So traditionally with Jupyter Hub, it's very hard to share content among Jupyter Hub users. But here, anyone who's logged onto the system can see the content that I've uploaded. So this makes possible to have, say, you know, shared folders of common data files or have, say, a multiple processes that are all riding to the same file and aggregating data and so on. Okay, we also have a collection of notebooks that can illustrate usage. So let me clear all outputs. So if you're not familiar with Jupyter Notebooks, it's kind of like Matlab. You have content and cells and you execute cells and they do the function. And you can go back and change code and rerun it and so on. So in this example, we took 7,850 files that NASA published. And these files represent, I think, it's one file per day of satellite data. And we munch those into one file served by the server. And so rather than a collection of 2D data slices, it becomes one 3D data cube. And it's much easier doing analytics to have this as a single data block rather than have to manage thousands of smaller files. Okay, so here I just, I import these packages. So here it knows H5py D is the Python package that's part of the HFKIDA SDK. And I open the file. And all this content looks just like it would with accessing local files on disk, except for using H5py D rather than H5py. So let me see. So we can say the, sorry. Okay, so the shape of this is 7,850. That's the ease that can basically the time dimension, right? There's one extent for each file. And 720 and 1440 map to launch to the lab to. So we can see the metadata for this. There's a fill value. Here I'm extracting one piece of metadata called long name and explain that. And now here I'm going to actually pull out data from the data set. So you can imagine maybe this entire data set would be too large to fit the memory. And what I'm going to do is pull out one slice of it. And the Python syntax here is you have a data set reference and say 1240 for the 1240 of time slice. And the colon, colon, colon, colon says pull in all the lat long values. So I do that. And this took just 71 milliseconds. So in the 71 milliseconds, it's gone from the notebook running and Amazon to the data server to S3 gotten data fetched back and brought back to my notebook. I'm actually cheating a bit because the server actually caches data that's recently been accessed. So let me change this to a different slice and it should be a little bit slower. Okay, so now it took almost half a second to fetch the same data. But if we run it again, it's fast to get. So S3 is not, you know, that format for accessing data is mainly used as a long term archival storage. So in the key to server, we keep a large RAM cache. So recently access data can be fetched much faster than going to S3. So a nice thing about notebook environment is easy to plot. So I can plot this data and see it. I can zoom in on that. Here I'm creating a histogram of the values. Now I mentioned that we don't store the entire dataset as one object. We shard these objects. And this chunk layout describes the dimensions of each sharded object. So it's 1 by 720 by 1440. And why that's relevant is that if we do this kind of request, so now instead of accessing data that's aligned with the chunks, I'm accessing the orthogonal to the chunk layout. So in this request, we've had to access 500 chunks or 500 S3 objects. But because it's done in parallel on the server, it's actually fairly fast. So if you ran the same code on a workstation with local files, it may be slower than the access. And I can do a plot. So that's just a small aspect. It's really a lot more to it. But I encourage people to, again, go on and try out the Jupyter Hub. And if you're interested and have things you need more information about, please feel free to contact me. Okay. Do you have any questions?