 Hello, everyone. I'm Rajil Kumar. I'm a graduate student at Northeastern University. And I've been working as a research intern at Massachusetts Open Cloud under the guidance of even Weinberg at Boston University and Crissel at MIT. So why do we believe HPC and Cloud could be brought together and they could work together and they could complement each other? So if we look at HPC clusters, the workloads are always infinite and the job queues are always piled up. And the jobs are always waiting for resources to get things done. On the other hand, clouds are generally over provisioned and so keeping the peak workloads in mind. And so they tend to stay underutilized most of the time. Now, if we could get these two together, HPC could provide workloads to the idle resources in the cloud and they could benefit each other. However, this should be done without impacting the cloud workload because as a cloud provider, a customer might be requesting some resources and you could not provide it just because we're doing some HPC jobs. So we have to make sure the HPC clusters are dynamic and that is alien to this preemption of resources. So we came up with a, started with a simple use case where we're trying to run single node HTC jobs. HTC jobs are the high throughput computing jobs that focus more on the execution of the tasks rather than the performance. So all their end goal is to get things done. So we have a HPC cluster and that gets backfilled by the HTC jobs. These are single load, self-sufficient HTC jobs that are consuming the idle cycles of the HPC cluster. Now, if a high priority workload comes in, these jobs get scaled by that high priority workload and they have to start all over again. So let's say a job was done like 60% or maybe 90% into its completion and then this HTC jobs arrives, it has to be killed, re-queued again and it has to start from zero. So effectively, in supervision, like you could see the idle cycles are being consumed but effectively you are gaining nothing out of it. So it's as good as the cluster remaining idle. So what we thought of, can we run these HTC jobs over virtual machines in a cloud? And we could suspend and resume these jobs and the virtual machines as and when the resources have to be released for the cloud. So we came up with this implementation. So let's say we have a HPC cluster on the left where the purple ones are the HPC nodes running HPC jobs, the blue ones are the HTC jobs running on the right, we have an open stack cluster with the green ones showing the cloud workload and the most of the resources lying idle. Now we have a resource monitor that keeps track of the resource utilization on the cloud. So to have this implementation in place and the first thing that we needed was a gateway between these two clusters. So we created a gateway on these two clusters and we connected them via open VPN. And next we implemented a control demon and that drives the overall allocation of the jobs from HPC cluster onto the open stack cloud. So it keeps a track of the jobs onto the HPC cluster. It listens to the resource monitors for the resource utilization of the cloud and based on these two factors, it decides when it has the resources available to allocate these jobs onto the cloud. So once it has the resources available, it connects to a provisioning and a configuration management system where it could ask it to provision the resources, configure it and then federate it to the cluster and get the jobs allocated. Also it controls the suspension and resumption of the nodes if required. So let's say HPC jobs comes in to the cluster and the control demon decides, okay, it has to free up the resources running HTC jobs. So it checks with the resource monitors, like do we have the resources to get this thing done? Once it has the available resources, it connects to the open stack cloud, provisions the virtual machines, configure it and then federates it to the same cluster. Once these virtual machines are added to this cluster, the job gets allocated, HTC jobs get allocated to those nodes. Now sometime down the line, the utilization of the cloud goes up and now the control demon decides, okay, it has to free up the resources for the cloud and it has to suspend those HTC jobs. So it again connects to the open stack cloud, it suspends the nodes, the virtual machines that were running these HTC jobs so that the resources could free up and the job states are stored and the way it was so that later on when they resume, they resume from the same place rather than doing it all over again. So once the resource utilization goes down, this control demon figures out, okay, it has to resume the job. So it checks if it has already had, like it has some suspended jobs already in the cluster or it has to allocate the new jobs from the queue. So based on that, like in this case, it knows, okay, there are a few suspended jobs and it will resume those jobs on the cluster. And they will resume from the point they were last suspended. So apart from making all these modifications, the configuration, implementing the control demon, we did another step. We made some modifications to SLURM, the Workload Manager for the HPC cluster, which deals with resource and job scheduling. So by default, SLURM controller keeps on pulling all the compute nodes in the cluster and keeps a track of the health of that node and it marks it down and removes the job if a node is unreachable for a specified period of time. We could observe the same behavior for the suspended virtual nodes as well. Like if you suspend a node, it won't be able to, the controller won't be able to reach to those nodes. So we made modifications to SLURM so that we could allocate some state to this specific virtual node and we could notify SLURM, okay, this node is temporarily unavailable and you could, you should keep the states intact rather than removing the jobs so that it could be resumed whenever the node comes back. So we have done this implementation. We have run this on the testbed and we have got satisfactory results and post that we're working to run the actual workloads. So our future prospects for this implementation, we would like to harden up and utilize the full data center resources so that we could run all these jobs in the actual cluster and have dedicated networks between these two clusters. Currently, we are running with a single node jobs and we would like to extend this to multi node jobs as well. And now in the current implementation, we just move the jobs from HPC cluster to the virtual nodes on the cloud. But if we could have some implementation in place, we could move the job across the virtual machines and the bare metal nodes as when required. And the last is, we would like to extend this to the container frameworks. So in conclusion, what do we achieve out of this? We could create a dynamic cluster, we could expand and reduce it on the fly with the least overhead to the existing clusters. We could get a better productive utilization cause we could process more and more jobs onto the HPC cluster rather than killing them and re-queuing them again and we could get better resource utilization of the cloud cluster. On the resources that were lying idle. That's all from me. Thank you. Thanks for your time.