 Thank you for letting me into this room. It's very crowded outside. It's a great honor to be standing inside. So I will talk about a consulting job I did. First of all, some words about me. I did my PhD in computational physics a long time ago. I was involved in projects like Minix 468k, the Mathematics Library for Linux. I did the high precision mess. Pearl Python porting to the Psyon 5, flight simulators and some contributions. And I'm a maintainer for the MSKT-UTIL, it's Active Directory Kerber's integration tool. Right now, for the moment, I am the PMC of the Apache Victor project. More about that later. In my real life, the other world life, I'm a software architect for connected e-bikes. And the thing I'm talking about is how to attack a problem when you have to do some consulting jobs and with a big data label on it. And the problem I had to solve was from chem informatics. You have for the development of medical drugs. So the problem at hand is you have a database, a large database of molecules that can be produced quite cheaply, for instance. And you have this medical database of things is written down, for instance, in the smart notation. That's a complicated notation to the chemical structure of this thing here. I didn't remember the name of it. I just picked randomly one out of it. And the problem is to look for substructures into these molecules which have some healthy, can make you healthy or something like that. So that's not a new problem. There's a commercial solution for it. You throw an enterprise database in it. And you have something like a cartridge in it. And you can make a SQL query. And you get results. The problem is the duration is very long. For instance, it needs about a day or something. It's not very reliable. And it's very expensive. So the customer looked out for a big data approach to this problem. And fortunately, there's an open source project called RDKit. It's a beautiful library for chemo-traumatics. And you can do something like that. You read in a molecule from SMILES. And you put in the SMILES notification. And you have your molecule object, a Python object. And you can simply say it has a substructure match from the other SMILES notation. That gives you true or false. That's fine. And OK. So it's always the same thing. We have the ingredients, a time-consuming job. We have a large data set. And make it fast. So the environment we wanted to use is a big data cluster or an HPC cluster. We have both. So what can we do? So now, about how not to scale out, you can simply benchmark it and see that the reading of the SMILES notation and constructing the molecule object is the most time-consuming thing in this whole problem. And OK. We can simply read it in one time and serializing the molecules. It's called pickling in Python. So you can dump it to a file, as though you do not have to reconstruct all the molecules at any time. So that's a huge gain in runtime. But it does not scale anything. We are looking out for scale in order to make the program faster, stronger, more machines in it. So we are simply looking at the problem. It's simply an EP problem. It's embarrassingly parallel. So we can simply divide the database into small chunks and throw each chunk at different machines. Quite easy. And the big data approach is to distribute the algorithm to the machines and not the data coming to one machine. So yes, the beautiful framework for us is a Spark. It's a Spark core with no special ingredients needed. And we use the RDD paradigm that's Resilent Data Distribution. I think it's called that the files are hacked into different components. So we can distribute it to all the machines. And Spark is beautiful because it runs each on HPC and big data in the environment. It does not need, is not tied to the HDFS, for instance. So all I did is read the instructions at Spark. I never haven't used Spark before. And it's quite easy. We read in a text file. OK, in fact, I'm reading in the pickled file, but forget about it. Text file, I have the input. And I can distribute the algorithm. That's the algorithm here, the function to the cluster and do some aggregation in order to get benchmarking results. That's all. All I have to do is if I remember about constructing an MPI job and thinking, what I have to do with all the initialization and it isn't fail-safe. This is inherently fail-safe because Spark handles if a node is crashing or had misbehaving or something like that. And only realize on that every node can read that file. So if I have some kind of shared file system, I can distribute it and run it in the environment. So HPC mode is quite straightforward. Use a cluster file system, dump the Spark jar on it, distribute the data in the same directory. And OK, it does not use the locality of Datastore, but this is CPU bound, not data bound. And we can use the standalone mode of Spark. And OK, let's compare about with the big data setup. OK, I have to make some advertisements for the Apache Big Top distribution. And only one minute left. OK. Oh, yes. OK, the Apache Big Top is a deviant of big data solutions. It's reused by Google, Cloudera, Canonical, and UDPI. Please turn around, Roman. And it contains all the usual stuff. And it's important to have a compiler environment. We have repositories. We have provisioning with Docker Compose for testing, of course. The open stack is broken now, sorry. It had deployment templates based on puppet orchestration if one needs to like to have it with Juju Charms. That's the Canonical thing. We have an automatic testing environment. And we have non-intel architectures. And that's a glimpse for CI Composystem. That's the distributions. The Linux distribution center is 7.6, Debian Fedora, Fedoria on PowerPC, and Ubuntu on ARM64. And that's something like Hadoop, Girav, Hive, and something like that. That's all these big data ecosystem. And it's simply deploying it with the puppet scripts. Use the HDFS, put the data into it, use the yarn mode to Spark, and unpack the Spark 2 because we do not support it until now. And run, OK, it works. Preview of BigTap 1.2 is we will have Spark 2, but unfortunately, it's not finished until now. And we need more help. Please join us at BigTap Apache org. So let me conclude. The problem runs much better on the HPC environment because it's compute-intensive, not a data pipeline. That's simply because the big data environment is optimized for data throughput, not for HPC runtime. And the problem scales really well. It's over n. But unfortunately, that's the number of machines we put in. And we have some runtime, the total runtime I'm waiting for the job. That's the commercial solution because it's a fixed environment. And that one is too much for the customer to pay. So we will have to investigate how well we can speed up it much better. Thanks. Thank you, sir. Any questions? One question.