 Hello, good afternoon. My name is Raphael Monera. I'm going to talk about hyper-convergence. Hyper-convergence meets big data. I work for an XID from Paris. I was bored today. Hyper-convergence that we do with SLEPOS and how we deploy big data projects using vandaline, how we deploy it, how we normally upload data, and I will start to make quick demos in the end of the presentation. So the goal of this presentation is to be a bit more... not necessarily, you are going to use SLEPOS and not necessarily, you are going to provide big data with vandaline. But the merge of the two that we have been using, it's somehow the key that we could imagine for future in global data and hyper-convergence with big data, in order to collect and automate the deployments for big data in the Internet of Things. So this presentation, so these two tools and this presentation reflects how NextID works with their customers. So NextID is one of the largest open-source publishers in Europe. Despite the fact it is a small company with just 30 or 40 employees, we could produce a big amount of open-source softwares. And I will board the two of them today. This is just what they stack that I'm going to base myself on today. And these tools were mostly created to inglobate a need for a customer that couldn't find an alternative solution for the problem that it has. And during the presentation I will give some examples of how SLEPOS was designed and implemented and during the evolution of the tool was targeting to cover topics which don't exactly are covered by other tools. So this is just a list of a stack which is fully open-source and mostly based on Python except the Fluentid one, which is actually written in Ruby and it's not a software that was written by NextID. But in the market we couldn't find an as reliable solution in Python these days. So Vandaline Core is to provide out-of-core PyData which means that we can process data which is larger than the RAM of the computer. Neo is a distributed database. For those who know Zope it's a distributed ZODB. ERP5 is an open-source ERP, not so exciting on this presentation. SLEPOS is the tool that I will present in a follow. The Resist is something that we developed to provide mesh interconnected mesh networks worldwide. Fluentid is to collect data and scikit-learn is for do machine learning and others. So this was the start of this SLEPOS. It started in 2009 or 2010. I don't remember exactly the date today. When it was built for the first time we were proposing on that time to put servers in a people's home. So we designed a system that could be distributed in a way that it could work in more than one data center. So it could work in Amazon or HackSpace, OVH or any other provider. It was also to cover, to be able to host services in a people's home or office in a distributed way. It starts to work very well and then it with the Internet of Things and other projects that it's coming up. The SLEPOS became a tool that could be installed on machines that were in cars or trucks to provide mobile cloud. We have a project ongoing on this. It can be used to host services in a people's boxes, like in France there is a free box. So you could produce equivalent of it with SLEPOS. It's being used by turbines, wind turbines for collect data in order to inform when there is a need for preventive maintenance. And also it's also used to create Internet of Things routers. You get an Raspberry Pi and you install it and then you can collect data over the several devices that are eventually connected in the network in order to collect data or manage certain services at your home or elsewhere. So now we become beyond the data centers. We can manage at the same time mobile cloud and normal CDN using data centers using exactly the same system without significant modifications in any of the parts. And it's important to show that the SLEPOS can provide nodes everywhere and it uses a central server. For now it's only one but in future it can be more than one master to control any amount of computers and devices which are with SLEPOS installed. So just to illustrate a bit, so what is the SLEPOS? So SLEPOS is composed by whatever Linux it installed, whatever Linux available today as base. So there is a three core components that SLEPOS core that coordinates everything. We are based in build out. So we can reconstruct a software from scratch or use some cache automation to share already compiled software between two machines which are based on the same architecture. And we use supervisor D for manage the process. On top of it, so this is what is present on all machines. So on top of it you have the soft releases. The soft releases are some kind of equivalent of a group of packages that are placed in a special way in the system in order to provide binaries to run whatever service that binaries are supposed for. And you can have several configurations in one machine. So the same machine can have more than one version of MariaDB or Apache or word process running at the same time without conflicting each other. So the software releases itself don't provide any running process. It's just software there. And the software instances are the ones that runs the services. So you can imagine that when you install a package it only provides the binaries. And on the instance side it will tell how the binaries will run, how the service will be composed. And in the software instances it's a bit similar as a kind of microcontainers. So there are containers but they are more light in a way that it don't provide overhead of recopying the same group of files everywhere. That's why there is this major separation. So this represents, for example, one machine running anywhere. It can be a machine in a data center or my laptop which I move around anywhere. And it can be hosted in a car if there is a need for it. And one important thing to remark is that it can run at the same time VMs or virtual machines. So a bit like what OpenStack does. And also other services that which are not virtual machines, which are not virtualized, it's just processes. So a machine can compose many different kinds of services in a distributed way. So a cluster can use more than one machine, one machine, several machines. It depends how the composition will be configured. I will, later on this presentation I will show the case of the big data with Vendeline. So based on this configuration we can provide at the same time using sharing services, sharing computers. We can supply several projects and run them all at the same time sharing the machines. And they are, by this list, you can see that they are significantly different in terms of goals. So we are running, for example, today a CDN worldwide, which is present in China. So we have services in China. So we have KVM clusters for big data in TerraLab, which is a French project that provides big data for large French companies at the Institut MIM Telecom. So we have Vendeline for big data, which I will mention in the sequence. It's being in production today to provide preventive maintenance for wind turbines in Germany. And we have development, we have distribution test nodes, some kind of equivalent of Jenkins distributed in several machines worldwide. So we have the automation of, we have a system that can produce VM images. So we automate, I don't know if people know or not, we automate the work that Packard does for generated VMs. So we can generate pre-built VMs. And we also use to provide the Chromium OS images for Chromebooks. So we have our own distribution of Chromium OS, which is called Naio OS. And we use LAPOS also to build the images for ourselves or for the persons that want. And for make this works, it's required to leverage the way to install it everywhere. So if we have 20 different ways to install the same thing based on different architectures, it will require much more effort to deploy anything. I will come back to this one. So how we do the deployment of it? We have a one line installation script that will ask you if you want to connect to the master. So if you want to connect to whatever master you can tell which one, you connect your device to the machine. For example, this laptop is connected to the master. So I can use master to deploy services on my laptop. Or if you have a mobile cloud, it's the same way. I can control machines based on master. Deploy them, deploy whatever service to whatever machine it's connected to there. So we use Ansible, which is also a Python tool to automate the setup of the node. It allows us to, with the same line, we can set up a Hasberry Pi, a laptop, like in Chromebook, or a production data center. So in this way, the Ansible take care of the minimal, the particularities of the system that is being installed. So we can support a very large amount of Linux distributions, for example, just by using this common. If your preferred distribution is not supported for whatever reason, we will be happy to add. We just add on demand. So you don't have to actually connect always to a master. You can deploy a standalone node. By just skipping the questions and running these two commands here. So if you type LAPRAS node, configure local, you get your computer configured to use any software that is available in LAPRAS. And this one is for prepare the machine's positions and folders and so on. You don't, you have an API for, you have the... When you install this LAPRAS, you also have a command line tool that allows you to supply and request, and use a console to automate the deployments of the software that you want to deploy. So here is an example of a request and a supply. So I'm deploying a software release of a monitor to a computer. And then I'm requesting to run one instance of this monitoring on this computer. So it's the equivalent of set up in a monitor for a wide wind turbine, for example. Here is just variations of the services, of the requests that you can be done. So when you deploy this monitor, you already deploy FluentD, which is what collects the logs from the machines. Which leads me to the vending part. So as this LAPRAS is everywhere and it's standardised in a way that we can put it anywhere. So we were able to quickly deploy a vending stack, which is a tool for providing big data analysis and out of core Python. And the advantage of using LAPRAS in this case, it doesn't require hours of set up to have the stack. So even a data scientist, which has no background on set up in a cluster, can set up it. And also allows the persons that we are charnot data scientists have the full stack of, for example, scikit-learn, NumPy, out of core distributed database, and Ipy Adjunct Notebook, ready to use for make some kind of calculation. So both sides can benefit from the quick set up by not spending time on learning how to peeping style NumPy on REM and REM in your Raspberry Pi or examples like this. So the vending also was designed to work on commodity hardware. So it doesn't require a super powerful machine to be deployed. So you can, keeping the dimension of what you are going to do, you can make big data with machines that you can buy in supermarket, for example. You can buy a few machines i7 in a supermarket and then you can start to make big data. Because it's quite easy to find a one i7 with a 16 or 32 gigabytes of run anywhere. And SSDs are becoming cheaper and cheaper. So you can buy a one terabyte SSD these days quite easily. And as everything was designed to be distributed with SlapOS, you can buy several cheap machines and then you have big data. You don't have to spend 100,000 euros buying expensive hardware to start to make big data. So this stack is composed by average hardware. Of course, people that has conditions can buy more reliable hardware, but it don't require a special service for it. So we use SlapOS. The ERP5 is just as a base tool with Neo to provide an object database that we are going to manipulate soon. And the scikit-learn is to provide machine learning and other features that you can use in big data. And the ERP5 is also used to provide one already old but equivalent tool comparator to the job lead. So we already had distributed and active... What's the name in English? So it's active... So we can provide background and asynchronous programming already by 10 or 12 years. So since I started, I'm already doing asynchronous programming. But the stack is nothing if the data don't arrive there. So you can only do big data if the data arrives to the tool. So we use FluentD mostly because it's one of the most reliable tools that exist today. We make a test in the office when we were selecting. And we put a laptop in our normal office. And we let it on during the weekend, pushing data to the venaline, which is the case. We are not analyzing FluentD. And it could last just one registry over a million in the space of two weeks, which is very, very reliable. Because a laptop is turned off and on all the time. Because a suspended enabler, suspended enabler. And the person go home with the laptop and connects from the other network. And then it turns off, suspended enabler, goes to 3G, then goes to Wi-Fi again. So all of this is just lost one registry. And for the places which we cannot afford to run FluentD process, we can just run an HTTP server and we can crawl with FluentD. And what we extend FluentD on this case is to stream binary data. Because I will show soon. But we can stream wave sounds in a wave.wav format to the venaline and plot an FFT out of it. So how we deploy... So here we learn how to request into whatever computer monitor which will come with FluentD. And here, in just two lines, we can request the venaline. So the full stack with all tools and scikit-learn, scikit-learn, NumPy, Vendeline Core, out of date, there is several other scientific tools installed on it is available just by typing these two commands in whatever node you want. Or if you are in a standalone fashion, you don't want to connect. You want just to have an instance in your VM in Amazon. You just type this command and you get everything. So soon we are going to release... It was not ready for this conference, but soon we are going to release ready-to-use images for Kiemu, Situ, and Digital Ocean, VMware, VirtualBox, and so on. Which can provide ready-to-use, ready-to-try, I would say, instances of Vendeline. So you don't have to pain yourself or install big data anymore. So here it rendered the text. I will show from... So here is the configuration. Here is what I ran earlier today to upload the data. So you generate a file which is basically like this, which says, look at this folder for WAV files. Save the position to know what you already send or not. Then you can tag your data. You can have different tags for different data and send it to different ingestion policies to you classify or shard or you can do whatever you want with your data. And here, I just used the Vendeline plugin, which has already come with the monitor that I just said, but if you are just installing the FluentD from treasuredata, you can easily install, it's just a file in a folder. Then you say where you are sending to and which is the user and which is the password. And here you can see that it found six WAVs and it streamed everything to this ingestion. So in a few minutes, you can start to ingest files in your big data. So probably it also works with the other equivalents of FluentD, like log stash and I forgot the name of the other one, which is also written in Ruby. So you can write the plugins which are compatible just by making posts and making sure that you are consistent when you send the data. So this is what I just showed. You can limit the buffer and so on. If you buffer in memory or if you had too much streaming of data, you can buffer in disk. And you can just run manually like this. Just FluentD let's see and then you pass the configuration file. Or you can, or depending of your setup or if you are using SlapS or not, you can write a very complex configuration file. So it takes just few minutes. This was just what I showed. When you run, you should just say that you send data. Then if you are using different plugins for example to get a syslog or machine consumptions that we use too, it's just use a different, you have just to adjust the part of the source. I can show another examples later if I have time. So where the data goes? So I will just jump quickly. So the data goes to this, so this is the UE of ERP5 which you will be able to just store the data. And by using a fast input, you can create the entirely part to be ready to use that configuration file that you saw. So it will create a portal ingestion that you can use Python to, for example, if the tag is, depending of the tag, you can write in a different data streams. I'm not doing anything complex here. What arrives, I just put in the same stream of data. I can just go to data streams, yes. You can just search for a way. So here I have my wave that I sent earlier with the amount of data. I can manually upload a file which will overwrite the file that is already there, overwrite the data that is there. And I can append files manually. So you don't have to only rely on the front-end to send data. You can upload certain data that you have to manipulate or you can make posts, for example, to upload a certain data stored that you have. Then how to use this data? So I have this data represents several wave sounds that were streamed to this computer. I just show data arrays after when I'm starting to do demos. So here are demos. So I installed SlapOS everywhere. I have my vandalin set up. So if you are in normal speed, you can set up everything in one day or less. And I go to my demo. So I hope I still have IPv6. Else I will not make demo but just show the thing. And believe it or not, I have IPv6 here, even if you don't. So instead of use the normal IPitone, we just make a small extension to the normal IPitone notebook. Not the entire tool but we just create a different kernel that we can use with some magic which can make our life easier. And to be more reliable when we deal with data. So I assume that you know minimally IPitone if you get lost, raise your hand because that will be time to do it. So some magic. This one to four, we just say where we are going to connect our notebook. It's just a reference. Then when you finish, you get an object called context. You can see it as a kind of proxy. It's not exactly a proxy but whatever you put, context.getID, you are doing a remote call. Let's see. I don't have IPv6 now. I have. So I can just call whatever I want. I'm doing a remote call to that object. I don't have a relative URL. So based on this, I can get the data that I streamed. So the data that you saw, it's under this path. But of course, nobody remembers the idea of every object that you want to manipulate. So you can search the object by using the catalog. So you can make queries to the database to know where it is, the paths and so on. So you can just use portal catalog. Then it starts. This is an interesting part. So even if it's out of call, you are not calling the methods on the IPython notebook but you are doing a remote call to manipulate the objects. You can still do bad things. So for example, this gets the entire data which is in the stream. So if there is one terabyte data, it will get everything. As string to your browser. Not good at all. So the only thing that you have to make to take care is to use different approaches. Not so much different but you have to take care to not load the data all at once when you want to manipulate. Here is just two examples that we are on core, that we are making the IPython notebook handle all the data. And here we are just managing small chunks of data. Here is just few imports because I have to import a sci-fi to handle the waves that I want. Then I wanted to make an FFT without load entire data. And for this, if you read the code of this read function, you to expect that it's a file. As I expect that it's a file, I could use the file.io, the string.io of Python. However, if I use the string.io, I have to load the entire data to give to the string.io. That's why I make this class as a wrapper to make an out of core stream looks like a file. So when I pass the file reader, it will behave like a file but without load the entire data. Because you can imagine that the data can be one terabyte. This way, by using this file, I can manipulate a one terabyte file as it is an average file without requiring me to have one terabyte of memory. So here I just get one channel. And here I'm just saving... Oh, I'm running out of time. So here I'm just saving the arrays and plotting it. So I'm just plotting the arrays and making an FFT here. So I get the array, I save the array, I regate the array in order to make it out of core. And then I save it and I plot. I can save images too to the database. So here as well was the previous times that I invoked the FFT. So I can see the files. Or I can regate much later one array that I already processed and replot it. So I can save and recover the arrays that I'm using. So, and I move to the second demo, which is... Now I'm going to emulate an asynchronous processing using... So it's the same thing as before. So here I make just a calculation to see how much data I have in the site. 33 gigabytes. And here I just make the same calculation, saving the calculation 10 by 10. So I put in background processing. So I'm just putting in the background the processing. And later I'm checking if the processing is already finished. Then I can make the same calculation again, but doing a kind of map reducing and using a cluster of instances instead of program myself in the level of the... ...iPython notebook. So this is the kind of things that you can do after one day of setup. I already did a synchronous very quickly. So you could do directly, so you can follow the tutorial in the link. I will make it available at Twitter to the site. So you can make the tutorial in the link for plot data directly in the browser using JavaScript. There is a short tutorial. And you can use peeping style and use the core of Vendeline without installing the full stack. So you can use Vendeline in and out of core features just in your computer to make small calculations which exceeds the run that you have in your computer. Well, thank you very much. I was extended a bit, sorry. So now I think it's a coffee break. I lost, I steal a bit of your coffee break. So if anyone has questions, can ask or feel free to go to coffee break.