 So, people from Purdue University, like J.P., and David, me, and Steve, and Adishak, and Vijay. And let me first introduce our institution. We are the University of Southern California, and we are from the University of Southern California, but from Arlington, Virginia, not from LA. USC has the Information Science Institute, the research lab in West Coast and also in East Coast. And we are from East Coast. And we've contributed pretty much to this open-stack community, including heterogeneous architecture support. So, for this Falsum release, we pushed a code for scheduling heterogeneous architectures. So now, with Falsum release, if you specify some flags like instance-type X-star spat in your loa.com file, and specify in CPU architectures like X8664 or ARM or something like that, then the scheduler can handle it with the proper instance type. And also, we did some work on a big shared memory system, like SGI-UV system. And also, we have worked to support GPU under open-stack. And we plan to push the code for GPU support in the grid-delay release. And then, we also worked hard on bare-matter provisioning. And with many companies like including Entity.com and HP and NEC and Calzada and a lot more companies. And from SX version, the Telera architecture was supported as a bare-matter node compute node. And we are trying hard to push the code for PICC and IPMI support for bare-matter provisioning. And yesterday, we did a pretty good design submission and we agreed on most of the design decisions. And we hope that code will be merged into grid-delay RC1, we hope. And for high-performance computing support, yeah, I mentioned a little bit. We worked hard to support GPUs for the previous, so far, we worked for LXC to support NVIDIA GPUs. And we plan to push that code in the grid-delay. And also, we plan to work on supporting GPU under GenServer. So we'll work on that and we'll also plan to push the code in the grid-delay. And also, we are working on and also very interested in networking. We are testing some high-vendors network like InfiniteBand. And we are testing and we will try to collaborate with anybody who are interested in that field. And also, high-performance computing support is the main topic of today's session. And as a last advertisement, we are also hiring. So anybody who's interested in, just let us know. Okay, Abhishek? Hi, my name is Abhishek. I'm actually from Purdue. I'm a PhD student there and I worked over the summer at ISI and this stuff. So this is a short presentation on our ongoing work. So it's not really complete, but then we had some performance evaluations and some ideas and just sort of putting it forward here. So here's what we are going to go through. HPFS in cloud, what you will do if you actually try to run high-performance programs in a cloud environment right now today. And then how can we do better in terms of integrating with OpenStack and just in terms of better performance. And then we did implement some of those stuff, what we think are better alternatives. And then we got some performance evaluation numbers. And then we talk about those and then we conclude. Yeah, and then the work is still going. So we'll just document what we're right now doing. So basically, motivation, I think if you look at the literature it's pretty clear that high-performance storage is critical for HPC applications because they basically need high aggregate bandwidth and it should be scalable as you scale your program. So they're used for accessing input-output data. A lot of it is just writing, like storing checkpoints and also executing algorithms which don't fit into your memory. So those are the main use cases. But regardless, the point is we need high-performance file systems or storage. So the papers that we find, both from academia and industry, they just point the single thing that IO and network are the major bottlenecks for running HPC applications. CPUs do fine nowadays because the virtualization technology is good. But so communication and IO intensive benchmarks will perform poorly if you just run it as such on the cloud. And since ISI is focused on providing cloud for HPC applications, that's the thrust of the program. So this is one of the things that we are looking at. Can we do this better? So the current way would be... So what are the storage options which we have? It's on a very high level. It's basically virtual block storage. It's Nova Volumes or Cinder or whatever it is. And you have the Web-Best Key Value Storage, which is not really relevant for HPC applications right now. So basically what you have is virtual block storage. Now high-performance HPC application would need parallel file systems, right? Parallel accesses. So the way to do it would be basically deploy the parallel file system. We call it the virtual parallel file system on top of the actual block storage. So basically you provision the storage and then you deploy your file servers and the file clients. It's inefficient because it goes through several layers of virtualization. And as an user, the user has to manage it entirely. So even if it's called a high-performance offering by a cloud provider, you basically go in there, install all your stuff and run by yourself, which the management can certainly be improved and even the performance can be improved. So that's the two areas we have to look at. So just continuing on that. So basically these are the steps you would follow if you actually try to install a HPFS system. You provision the VMs separately as you could and then you might or might not have control over what the VMs are placed. Install and configure the HPFS servers and clients. And then the file access happens through multiple virtualization layers depending on how you install it. We'll go through it a little bit more later. And then obviously it's managed by the user, so which is not very attractive in all. So improvements can be made on both fronts. So we actually have focused more on the performance part here, but the management, by management we mean that can we do the entire thing through OpenStack? Or some cloud, well, here it's OpenStack. Basically the cloud management framework. That would be nice. And then for here, in our work, we targeted the Luster file system as the backend HPFS installation. So we'll just go over it a little bit. So Luster file system would be a high-performance parallel file system. It can be found in most of the, not most, like half of the supercomputers that exist nowadays. So basically it can handle petabytes of storage with very high aggregate throughput. The idea is very simple. It has a metadata server. Actually it has one per installation typically, which handles the metadata information for the file. And then there are this series of object storage servers which manage the actual IO data. And they're backed by storage volumes, which are called metadata targets for the MDSs and then the object storage targets. And then a file typically we striped over all or some of those object storage servers. And the aggregate bandwidth would be the sum of the OSSs. So it's easily scalable. Assuming you can stripe it all over the OSS. And then it supports, it works over, it's an application level soft protocol, if you like. And then it works over the TCP, IP, and infinite band. So in that sense it's good. It's just a configurable option in Luster. So this is, and it's also open source. So this is a common file system that people use. I guess you guys already know it. So we were targeting this as an example. So let's go through the options that we have. This is the one that I was talking about. Basically if you look at the diagram, you have the VMs here, and then which are provisioned on host. And then you have the Luster clients running inside VMs. When I say Luster clients running, basically it means that you mount the actual server. And then that means the client is running if you can mount it successfully. And then the underlying storage is still virtual block storage, or NOVA volumes are the comparable stuff in other offerings. But the technology is the same. So there is obviously the volume server bottleneck, which can happen if you have only one volume server. I mean you can have multiple volume servers, but it's kind of hard to manage in the current framework of OpenStack. And the actual MDTs and OSTs are still just block storages on the VMs. So in this scenario a file access in guest would go through a block device emulation in the hypervisor, and then again a file access in host. So that's like two steps which could be eliminated if we just have a direct file access. I mean, so then we are mostly going towards bare metal here in terms of file system, but that gives you a lot of performance. So when we run these Luster clients over the VMs, they go through the networks as well. So they kind of out the networks, and then if there is a network emulation, the overhead is still there for that too. So this is the default framework right now. So let's talk about how this can be done, not considering the performance. Even this framework, we can just improve the deployment, right? You can manage it better. So one idea that we have or we are working on is allow users to deploy the VFS system as a whole from OpenStack, just like having different kind of VMs that we have. We deploy one at a time, but then we can have different configurations of file system where we launch all the VM instances together, the backend storage targets, take care that one node has, like there are not the one OSS runs on one actual node, because otherwise there's a bottleneck problem. I mean, as good as it can, but at least have some control over what we are doing. So that would be like a meta service over the OpenStack service. Then we are trying to get that incorporated. So and then once we do that, we can have different flavors of this, just like we have flavors of VMs. So that would be something we can do. And then we are still using the virtual VFS that we talk about, but then this is something we think can be done, it could be useful. So at that point you can go into OpenStack and just deploy your parallel file system framework just with one click. That should be more user friendly. But then in terms of performance, we can do this alternative stuff. The first thing that we thought about is basically, these are actually not very complicated, it's just simple stuff, but this would improve the performance a little bit. Basically what here we are talking about is the actual parallel file system is deployed bare metal, if you like, or natively. The servers are not running in VMs, which is possible. And then at that point, basically you run the clients on the VMs and access through the network. Access the actual files. So the file access is not direct, not going through the block-level emulation that we avoid. So that gives you a better performance. So the clients are accessing the file over the virtualized network, so that part still remains, but then the servers are at least running directly on metal, and then there is no emulation there. So that would be one of the configurations that could be used, and then we use that, which is basically improving the client side. So we find that using a virtual file system pass-through, I don't know if you know it, I guess this is pretty common too. Basically you run the client, the file system client is now running on the compute host, and not inside the VM, and then the VM can connect through the client through a BERT FS client which we used. It's basically a virtualized file system pass-through which works over the BERT IO transport using the KVM hypervisor. So this is a zero copy operation which is nice basically, the memory is shared inside the host now. So if you go through the virtualized network, there will be at least one copy of the file in memory before it actually goes out. So this is supposed to give you some performance, which is what we see. So these are the two other ways that we can tackle this one. So implementing this has some issues, which we will talk about. But then these are the performance improvements that we think about we can do. So continuing this, so now we can provide HPFS as a service as opposed to just calling it VPFS because the actual file system is running on metal. So the issue that we face is we have to limit allocation per user. Now this allocation is right on the metal, not through block storage, which could be done because, for example, Lustre supports the quota concept where you can actually, if you have an user then you can actually limit its quota. So that's fine, but then you have to have a user first. So for that we need to connect the virtual and native user namespaces. That's one of the challenges. That's also a security challenge. So once you go access from inside the guest, you need to have some kind of control about what the guest is actually accessing because now he's accessing the actual file, right? So what we are trying to do is basically use unified LDAP database through Keystone and the actual native user authentication. So that would sort of solve this problem. Now we have the same set of users which are inside the guest and also outside. And at that point we can use ACLs and whatever authentication frameworks is already available and then that would sort of solve this namespace problem. So we are trying to use a single LDAP backend for Keystone and a native authentication. So this is one of the issues that we are facing, but then that I think that's solvable. So that's where we are at right now in terms of this. So if this thing works out, then we can actually provide HPFS as a service through OpenStack to the clients, to the users. The other thing that we are working on it's just another basically another way to use a parallel file system. We are using Luster as the backend of the virtual block devices. It's not directly related and this is not very innovative in the sense I think there are other servers too like Ceph which is already in the code. So we are just trying to do a Luster server. There was a Luster server Blueprint I saw but then the code is still not there. So I guess this would be a contribution if not a very innovative thing to do. So basically the HPFS server is run on native. The clients run on compute nodes and then this one is simple. So instead of using iSCSI we can use the actual Luster clients running on the host and then instead of using a LBM kind of driver we use a Luster driver. So that's straight forward and then we are doing that too. But that's not basically providing the same thing that you are talking about. It's just basically providing a little bit of more bandwidth and reliability to the VBS backend. But we are working on that too. So in OpenStack right now the volume service uses logical volumes created on volume server exported to iSCSI. You guys know that. There is a one-to-one correspondence between the volume server and the backend right now. Although there is talk of changing that. Actually there has been an implementation changing that in the current release. But anyway the point here is we can use Luster instead of that we don't need the iSCSI anymore because we can just run the Luster clients on the actual host and at that point we can use Lever to connect that. So we actually did that and it works out okay. So we sort of implemented this and then we did some evaluation some experiment with this. Just to see what kind of access works better. So we used the Luster as the backend as we talked about and then use KVM instances on real six. So the VertFS framework works with KVM only. But as an idea I guess the other ones the other VMs we can have similar frameworks for other VMs but we are using KVM for here. And then we use IOR which is interlinked to random benchmark. Basically HPC benchmark and it can be simulated to it can be configured to simulate HPC program accesses. There are papers about that but this is like a popular benchmark to use. So we use that to get some numbers out of our frameworks. So what we did is we used two clients accessing a Luster installation. So the clients could run inside VMs or outside VMs on host as we talked about. Basically there are two clients and then the Luster installation has one MDS and two OSS and then each OSS has four OSTs. We have a very big installation. It's a typical small installation and each file is striped over all OSTs. The IOR configuration we are using the MPA IO interface. One task per client writing to and from separate files and then to avoid the caching if task if the client A writes to a file client bids it and vice versa so that there is no read caching in each node and then we have 8GB file sizes. The file sizes means the size of data that's transferred per operation from memory to disk and block size is basically the total size per file here. So it's actually in IOR it just means that if you have a loop it's the total data that transferred per iteration of the loop, let's say. But for here it's we have just one block size. And what we found out is basically these are the results which are normalized with respect to the VPFS performance which would be the typical one that we would use, right? Sorry. And then what we see is that if you use the client if you use the file system pass through and the client in VM we get a big bump in performance at least in write. The pass through is the actually does better than the client in VM configuration the native is much more higher obviously but even the pass through gives you like a 15% improvement it's from like 190 to 210 or something. So the read improvement is not that much but it's still better. So it seems using file system pass through is the way to go to provide a HPFS service. And then that could be incorporated in the open stack as well. I mean it's more technical to open stack pretty much. So that's what we found out. That's the experiments we did and then we are still working on it on a bigger installation because it's not very big and then basically work on the framework to provide this as a service. So that's pretty much it as a conclusion. HPFS service is an important requirement that's what we feel and the people we have talked to today and then we presented our ongoing work effectively integrating the HPFS in a cloud environment. So we identified and implemented the possible methods of HPFS access from a VM instance and then the current methods are substantially less efficient than using a file system pass through that's what we find out and then so we are still working on implementing this and the ongoing work is what I talked about. Investigate the results a little bit more. HPFS service is ok it's just we need to do the orchestration and the scheduling. This is the first part we talked about. The second part if we provide native file system with a service, integrating the guest and host user domain is one of the challenges and limiting the user quota is the other one which is sort of dependent on the first one. So and we are also working on a driver at the back end to volume service in OpenStack. So at some point all this will come together and then we might have we should have a file system service provided through OpenStack for HPC so that's pretty much it. If you guys have any questions, David is here. David was the one guiding me through this during my internship thanks to him and the team. Yes. Could you speak a little bit loudly? No. It's very common. Standardized and open sourced and usually The question was is Luster something that we are doing or is it available? The answer is it is available is very very common in HPC domain. So Manu is the version of the letters you used. Right. We used 2.1. 2.1. So my understanding is until the version 2004 next year the metadata server currently it does not support clustering. Metadata object metadata server. It only supports two nodes. It does not support scaling out clustering. So metadata right now also metadata server is just one for installation and two if you want failover. That doesn't limit the scalability because metadata server is used once per file access. It just gives out the metadata information the file name information and once a client has that and it can just go along all the object storage servers which can be many and collect the actual file data. I think that's how most data file systems work. So that's not a very big bottleneck. Luster has been operating in spite of that in HPC environment for a long time now. It is like that in 2.1.2 as far as I know in 1.8 as well. But that's fine. Another thing is I could just give back to maybe one page or two page to the diagram. You know a great loss moving from native to virtualization environment for example from drop from 300 to 200 or drop from 150 to almost 100. So do you have any suggestion actually when this kind of could be used and when it better not used environment? So obviously it's just the client part the server part is native in both both scenarios. We don't have a suggestion right now. We are looking at that besides that why is the big degradation and what can we do about it. There is one idea which we could be done. The file system pass through basically it goes and calls the OS APIs and then we can avoid that. We can actually make it call the direct luster APIs in the host which would sort of be there. And that's just programming and we are doing that but I don't know how much it will give and then that's what we are investigating. Can we do something better? That's for vexing us. We do not have the analysis of where and how much overhead is trying to do. But here we at least show three different implementation of HPFS for virtualized environment which is a structural way to provide and service and it will be great. But native that's the best but so far that is like a security concern and kind of stuff so we do not have clear answering now but we will investigate more. Yeah. Sorry. I had my hand up before. The verdict asks that you have for that specific it could. So it has it is specific to KVM and Linux. It has nothing to do with luster. You can access a host as well very easily. So luster, yeah, I mean the verdict was the paper which came out in 2009 in Linux symposium and they didn't use luster as the back end. It was just connecting to the host file system but we use luster and it's very straightforward. Yeah but it's not specific to the host file. So it wasn't clear to me that you actually solved the namespace mapping issue. No, we don't have an active code to submit. We are looking at the LDAP formats in basically common like how Keystone is using it and how we would use it in user authentication and try to map it now but we haven't solved it yet. So I think this is part of the much broader problem which isn't really able to do with HPVC. It's not that people want access to existing file systems without having to dump things into object store or math interface symposium within its thesis. Yeah, that's correct. We have some, I don't know, customers who have the same problem. Yeah, I cannot divulge more I guess but then yeah, that's a very common problem. They had a luster installation and then they sort of throw them off. Yeah but if we solve it we'll let you guys know I guess. It's important. Well it sounds like you should be involved in this session tomorrow morning. Sure, sure. Like we will storage volume plug into Cinder because they're looking at a similar thing. Okay, thank you. Yeah. Part of the tests that you mentioned are... No, these tests were Ethernet only. InfiniMent, we have the hardware so we haven't tested it yet. The diagram was basically generalizing it but yeah. The results are on Ethernet. Yeah. But that is for sure we are going to do because we have the hardware now. Please. We cannot say that this result is the result. This is not fair. Exhausted. We are working on it. Did you guys look at ZFS and then how would we compare with Luster? No, we haven't. We took Luster. This is a research institution. Yeah. So Luster is one of the representatives. I mean it's a lot of work to set these things up and then a lot more work to make a VM work with this installation. So I don't think we are going to look at other parallel file systems. Yeah, Luster is what we are looking at because to some extent should be representative, right? Because the mechanism is the same of most file systems but yeah, but we are looking at Luster now. Did you look at the latency or just the bandwidth? Because I would expect that the one of the really big differences in running a VM as opposed to running on bare metal, right? Because in Pennyman one of your big benefits in that, sorry Do you have any numbers on how the benefits affected the latency? Not really, no, these are bandwidths but that's something to see, yeah. It'll make more sense to do that once I have my lead behind it to test that. Yeah, yeah. Okay. Is there anyone working already on port in this kind of situation? Well, so we are getting hit in the next month that's pretty similar to this so I'm going to be running Luster for our VM disks and that's all we're going to be running in FDRID so we'll be looking at pretty similar stuff but I don't know whether we'll be able to deploy that in production or whether we'll have to do something a little more standard. Thank you. Thank you.