 So indeed, my name is Damien. I work in the HPC center in the University of Saint-Louis-Ven, something like 30 kilometers south from here. And from our HPC center, we have seen the big data world slowly coming at us from far away, but every year, a steady year, closer and closer. We've heard about the convergence, we've read about the convergence. And as I myself have a background in what we call, at the time, data mining, I don't want to be caught by surprise when a convergence would hit us. So I looked into the thing and in the next 20 minutes, I will share with you some insights about concrete steps that we can, do I need to speak louder? Okay, about concrete steps we can take in HPC centers to make the convergence something concrete in the context of scientific computing. Because you know, the scientists, they are never happy. Some of them, they have beautiful equations, they have models, but what they want in the end is to have data. Others, well, they have data, they have lots of data, but what they want is to have beautiful equations or to have models. Maybe you're wondering what Trek is doing in there, and maybe you've seen in the previous slide, there was a dwarf from Snow White. Well, the reason is in the literature, authors have looked at the kind of problems that consist in going from data to models or from models to data, and they gave the name HPC dwarfs for those which are compute intensive and big data augurs for the kind of problems that are data intensive. And when those augurs or dwarfs become too large to fit on the laptop, typically, it's not always the case, but typically what happens is that the dwarfs have moved to HPC clusters and the augurs are moved to clouds. And what people love or like about clouds is the fact that it's instantly available. You just click somewhere, as long as you pay for it, you have access to some resources. Also, it can be either self-service where you manage your virtual machines, you manage your virtual networks, or it can be software as a service where you just use your browser to access some resources and click in here and there and you have your computations going, for instance. And I think about the cloud also is the elasticity, the shrinkage or increase in resources as you want and the fault tolerance. When some resource fails, it can be replaced with other resource instantly. So the cloud is very versatile in a way. By contrast, clouds, clusters, sorry, they are close to metal and people like the fact that they offer IN hardware, dedicated hardware such as GPU, accelerators and so on, and that access to the resources is exclusive when you are given a set of CPUs only your job is able to use them. So they are much more designed for performance than for versatility. But more and more, people from both sides have been looking at what the other sides had to offer. Interests from both sides are converging and we can see that very clearly from public cloud actors, for instance, which all of them know offer an HPC solution in the cloud. So the main question I have and I want to address here is as an academic HPC center, what should we do to manage this convergence interest? Well, most large HPC center in Europe have started looking at clouds and so we need also to have a look at what the cloud can offer. To do that, I first need to review a bit what the cloud offers and what the cluster offers to see how we can merge both. A cluster typically is built on high-end hardware, costly CPUs, a lot of memories, very fast networks, very fast disks. And on top of that, you have the operating system. Most of the time with RDMA, use a lot to bypass the operating system and make sure that performances go from applications to the hardware very smoothly. And then on top of it, you typically have an MPI platform and a Parallify system, everything managed by a resource manager or scheduler. And on top of that, you have the HPC ecosystem, the user ecosystem. The cloud is a bit more versatile. You also have a layer of hardware here, mostly designed for commodity hardware, so no high-end clusters, no high-end CPUs, but entry-level processors maybe, regular network, regular disk, and so on. And then on top of it, you have the operating system and on top of the operating system, you have the hypervisor most of the time, which is the layer that offers diversity at the expense of the performances. And then I'm sure you know that, if you close, you have three levels of services, classically the infrastructure as a service level, where you can match virtual machines, virtual networks, and block storage. And then on top, you have the, on the very top, you have the software as a service, mostly accessed by web or apps, where you just abstract the application away and just use it as a service. And then in between, you have the platform as a service, which can be related to the HPC platform in the sense that you have some storage solution, typically a distributed to five systems such as ADU-PFS, and you have MapReduce or Spark or whatever frameworks in which you can perform your computations. And then it's all managed by a resource manager, and you already see that we will want to merge both stacks. There will be some conflict in the resource management because both stacks have one, and then you have the big data ecosystem on top of that. Just to make the big data and HPC ecosystem a bit more concrete, this is the list of typical things that are offered by both stacks. So in terms of languages, we see, most of the time, C, C++, for drawing HPC, we see Java, Python, R, Scala in big data, and what you see here, just a list, but actually the ecosystem in big data is a bit more diverse than the ecosystem in HPC, and so there are things from the big data size that could be very beneficial in the HPC world. So now I'm going to detail five ways that cloud or big data technologies can be embedded into clusters or HPC. The first one is virtualization. So we can, with virtualization, bring more control to the users in an HPC environment and also bring more isolation. There are many three ways that we can bring this virtualization onto the cluster. The first one is just to install your regular cluster stack and then on top of it, install a tool such as Peacock here that allows spinning out virtual machines inside a serial malocation, for instance. So it's a way to add the virtualization on top of the existing HPC stack. You can work the other way around. It's often called HPC on demand or HPC as a service. First deploy on your hardware a cloud stack and then inside the virtual machines deploy the HPC stack. And one tool that you can use for that, which is open source as well, is Trinity X. That uses open stack behind the scene to create virtual clusters that you can deploy for your users. So those two options are with virtual machines. The third one is to use containers. We've seen a presentation about singularity just before. It's not the other option, but it's a very popular one. So containers are at this point in time very popular in HPC to bring the virtualization for the user and allow the user to have, for instance, a stack which is only available on Ubuntu running on a cluster where the operating system is already at, for instance. Second way, we can use a cloud in HPC is through what we call cloud bursting. The idea is that when you have a HPC cluster available on premises and the demand or the request for computational power goes too high with the aspect of what you can offer. There are ways to use virtual machines provisioned in the cloud to make your cluster bigger and actually to bring elasticity from the cloud to the cluster. Every scheduler has this option no in slum. It's called the elastic compute. You just define a set of compute nodes which are cloud of type cloud and then when your user requests more power you can activate them provision the virtual machines and then they are seen from the perspective of your scheduler as additional compute nodes available for the users. Besides virtualization and cloud bursting a third way we can enrich clusters with cloud technologies is to have additional storage paradigms. Typically on a cluster what you have to store your data is file system maybe four or five different file systems one which is parallel one which is NFS another one which is for archive and so on but the cloud has to offer much more different storage types and especially one of the big problems clusters over the face is what we call the Zot files problem, zillions of tiny files. The clusters are designed mostly to cope with very large files but what we see more and more is users come with millions or billions of very small files containing very little information and at the end storing more metadata than they are storing data actually. And so we think that with by replacing some of the file system with object storage such as a GFS or Swift or SEF we can try to move those users towards those kind of storage which are much more suitable for that. Another thing we can use is to have adobe connectors on top of existing private file system. Actually most private file systems now offer besides their typical POSIX file interface an adobe connector that allows using MapReduce or other Spark workflows directly on top of the private file system without the need to install the HDFS. And finally a last kind of data storing solution that can be useful in a cluster environment is all the ecosystem going from the no SQL word so maybe besides the file system we need some elastic search or we need some MongoDB or Cassandra or in FreeZB or Neo4j to be able to store data in a way which is consistent and suitable to the type of data. So for instance if you have graphs rather than storing CSV files with a matrix of an agency you store them in a Neo4j cluster. And then we can have additional problems to our clusters because typically what the clusters offer is an MPI stack which is a way to write parallel programs which gives a lot of control to the user but it's also non-trivial to write correct and performant MPI code. And at the other end of a spectrum there are jubber rays. Many clusters just allow you to run jubber rays which are dubbed as embarrassingly parallel computing because you don't have communication at all so it's very easy to implement but it's not very efficient, sorry, it's not very expressive in terms of what you can do with jubber rays and so the tools that are offered by the big data ecosystem with the map-reduced paradigms come in between those two extreme MPI and jubber rays. And so there are several ways you can use those tools in a cluster. The first one is to use MapReduce of Spark, for instance, in a standalone mode. I will not discuss this here because then you lose the interest of having several compute nodes that work together. What you can do is use a tool such as Myadoop to deploy actually a Hadoop framework inside a regular job allocation, be it from SLAM or PBS or so on, just a set of scripts that you call from your submission script that will spin up a HDFS storage and offer the MapReduce or Spark API for your workflows. You can go a bit further and try to disguise the HPC scheduler behind the big data or Hadoop platform with tools such as Anything on Demand or HAM or Macbi, which are tools that offer the MapReduce paradigm. And behind the scenes, what they are doing is just using the resources and requesting the correct resources from the resource manager just to make sure that you can use your scheduler as a Hadoop solution. Another thing you can try to do is to take advantage of the fact that the Hadoop ecosystem is very fault-tolerant. And so people have looked at ways to run the Hadoop stack on top of either nodes in a cluster. So when you have your clusters, some nodes are allocated to jobs. Some other nodes are just waiting for jobs to come or for other resources to get freed. And you can leverage those Hadoop tools and actually inform the Hadoop scheduler or the yarn or mesers of the Hadoop nodes that come when jobs finish or that disappear when jobs start. And it's just a matter of playing with the prologue and epilogue files in your cluster just to be able to have both stacks. It takes away a bit of efficiency in terms of scheduling and in terms of computing. But actually, people have shown that it's totally doable. And then the goal, the far objective is to have a unified stack, big data and HPC. Maybe someday we'll have that. Some large actors are playing in the field and they are trying to develop something like that. I'm not sure it will come up as free software. But anyway, people are working on this. And then the final way is to enrich a cluster with web applications and allow the user, for instance, to submit jobs through a web interface rather than asking them to go through SSH. Sometimes for some people using the command line is a bit of a hassle and they would prefer to be able to click here and there to submit their jobs or to use tools such as RCD or Jupyter to submit jobs interactively and notebooks and so on. And then also in the kind of software as a service thing of my set, you can add some tools like NextCloud to your cluster to allow your user to access the files which reside on the file systems from either their phone or from a web interface or with a desktop client and make them able to share those files with people who do not have access to the clusters. So these are just five directions we can move to or we can progress to to make the clusters look a bit more like something that is friendly to big data workflows and that leverage clouds technologies. And so if we were to design such a cluster now what we would need is just to have nodes, compute nodes that have three distinct shapes once with very rapid SSDs and very fast local storage. Once with the Accelerator tools either GPUs or anything else some of them dedicated to GP GPU or to machine learning and deep learning and to have some compute nodes with very high memory and have the software stack on those compute nodes so that you can either run a regular job or something that is hold in a virtual machine or in a container. And then besides your typical file system you would have also nodes dedicated to databases to no SQL databases which are the one depicted in green here. You would have the Hadoop connectors on top of your file system to make sure that the Hadoop workflows can work easily on the cluster and then you would have three types basically of input nodes the typical login nodes where user can use SSH to submit nodes and so on but you also have, you would also have some web nodes that allow running RStudio or Jupyter to let the user use notebooks and those kind of interactive way of working with the clusters and then some data transfer nodes that would have the same current tools such as grid FTP and so on but also tools such as NextCloud to ease away for the user to share their files and also the data transfer tools from the Hadoop stack. And so I hope that if we can one day do this I just hope that we will make scientists happy again and I thank you for your attention. So the question is, are we actually doing what we say here? Yes and no. Yes in the sense that every single piece have been tried somewhere at some place knowing the sense that for no we are still not offering a single user interface for all these services. So we have NextCloud somewhere, we have RStudio some other place, we have a typical cluster, we have MyHadoop but we don't have a single cluster that offers all things at the same time but it's what we tend to achieve, we try to achieve.