 Hi, everyone. Please say welcome to Andreas. So, good morning, everyone. I'm Andreas, and today I'm going to talk about analyzing data with Docker. And before I start, I want to thank again the organizers for inviting me to the conference. It's really great to see a lot of people here for the second or third time, and I'm really excited to speak to you today about this. So my own background to say that is in science. I've been working in physics, and I've been using Python since about 2009 for my own work. And in the last five years, I've been mostly working on data science problem, also, of course, using Python as my main tool of choice. So, we're going to attack this problem as follows. First, I'm going to give you a small introduction to data analysis and explain the different scales and the different types of analysis that we can do and why it's sometimes that might be difficult. Afterwards, I'm going to talk briefly about Docker so that we all understand what it is and how we can possibly use it. And then I want to give you some examples of how we can containerize our data analysis using this technology. Finally, I want to talk about some other possible approaches. I want to show you some relevant technologies that you can use. And I want to give you some outlook into the future of containerized data analysis. Okay, whoops. So, let's get started. Data analysis is a pretty large field. So, as a data analyst, I like graphs. So, here you have a graph about the different scales and the different types of analyzing data. So, I try to segment this a bit from small scale to large scale and from interactive to automated methods. And if you look, for example, in the upper left quadrant here, you would have automated, small-scale data analysis tasks. So, this would be typically some scripts or a Python code that interacts with your data, for example, a local database, and does some analysis on that in a non-interactive way. On the lower left quadrant here, you have things that are interactive and possibly user interface-based. So, a good example for this would be the iPython notebook where you can analyze your data in an interactive and straightforward way. And you can do so very easily using graphical methods and using various types of data sources as well. If you go to the large-scale data analysis, we have things like Apache Hadoop, which is mostly non-interactive technology that allows us to perform data analysis tasks at very, very large scales in a batch way. On the lower right quadrant, on the other hand, you have tools which are also helping us to deal with very large datasets but which are more interactive than traditional, for example, map-reduced-based approaches. Examples for this would be, for example, Apache Spark or Google BigQuery. So, what kind of tools am I going to talk about today? Everything, of course. And I want to show you that using containers can help us in all of these areas. So, if we have lots of tools for data analysis, you might ask yourself, what is actually so difficult about this? Well, in my own experience and maybe from your experience, several things. First, sharing your data and tools is not exactly easy. As a scientist, I experienced this myself. My PhD, I started that in 2009, and I used Python for a lot of things. And back then, my analysis workflow would basically involve a few hacked-together scripts in Python and some data files that I would keep in directories. So, sharing those files, the data and the tools that I used was possible, of course, but it was not easy and it was not, surely, straightforward to give other people access to this kind of things. This leads, of course, to problems in reproducing results here. So, here we see a cell in the process of reproducing and it can do that because it has all the necessary information that it needs for it available to it. And if you try to reproduce our results in science or in other contexts, it's not that easy because often we're lacking, like, the context and several critical parts of the data analysis process. And another thing that is difficult in data analysis is the scaling. You know that probably at a small scale, you have a lot of tools available that you can use to analyze your data. I mentioned iPython and the iPython notebook earlier and there are a lot of different ways to handle, for example, the plotting and the processing of data at this scale. But if you go to larger scales, you normally need a totally different set of tools. So, the normal tool set that you use for your small data sets doesn't apply anymore to this world. You need technologies like MapReduce, like Hadoop, and that means you need to rewrite a lot of your data analysis tools because when you're getting bigger. And so, how can Docker help us to overcome some of these problems? Well, let's first try to understand what Docker is actually about. So, Docker is basically a tool that helps us to deploy applications inside of software containers. And if I say software containers, you're probably thinking of virtual machines, but it's not the right approach because Docker containers are working on a process level and they isolate different aspects of the operating systems, for example, processes, resources, and the files that an application sees. This means that some aspects, for example, the kernel that your containers are running on are shared between them. And this is exactly what makes Docker very interesting because it provides a more lightweight way to isolate applications from each other. And Docker, so this is the basic idea. And of course, we need a lot of tooling to make this idea convenient. So Docker provides a high-level API that helps you to manage version control, deploy, and network your containers. So, if you look at the core concepts of Docker, at the basis we have the image, which you can imagine as a frozen version of a given system that contains the whole file system that we need to launch a given container. And as you can see here, images are versioned and some images are based on other images. And we have also images that are not based on anything else, which we call a base image. And we'll see later why version controlling images and building them on top of each other is a great idea. So you can keep your images on your local computer, of course, but what makes it convenient to use them is to put them into a registry. So Docker has its own registry on Docker Hub, but it's also possible to run your own private registry server. Now, a container in this sense is nothing but a running instance of an image. So each of these containers here has a given image associated with it. And philosophically, or conceptually, containers are ephemeral. That means that the state of a given container is not saved when it stops working. So that means that in order to contain us to be useful, to do any data processing, we usually want to attach some resources to a container. This is shown here. So containers can run on any number of hosts, and each host that the container runs on runs the so-called Docker engine, which is responsible for managing, starting, stopping, and monitoring the containers in a given host. Now, one of the very great things about Docker, what I like a lot, is the ability to network containers together, which is a quite recent feature, and which basically abstracts away the networking of different hosts. So we can completely ignore the physical constraints of our network and can construct virtual networks that connect different containers to each other, which of course is very useful if we have applications that rely on multiple containers and multiple services that need to talk to each other over the network. To orchestrate all that, there are a couple of tools. For example, there's Docker Swarm, which makes it easy to deploy Docker containers on a cluster of machines. And there are also, if you ask yourself, how do you manage all this? It's to the Docker API, which provides a REST interface that allows you to create containers, manage them, monitor them, and do everything that is possible in the Docker ecosystem. The command line interface, which you will mostly use to interact with Docker on your machine, is nothing else but a client to this Docker API. Good. So what do I like about Docker? Well, one thing that I really think is great is that images are space-efficient. And they are space-efficient because they're based on a so-called layered file system, which you can imagine somehow like Onion, where you have different layers, and you can just add layers on top of an existing layer. And here I have an example of the image that we're going to use later in our data analysis. You can see in the beginning when we created this image, we downloaded a lot of data, about 124 megabytes, which corresponds to the Ubuntu base image that we used. And then we did some things. So we installed some... We did some shell scripting. We installed some things. And we called, for example, UpgetUpdate to get the newest repositories, which added about 38 megabytes to our image size. Then we installed Python Tree on the image. And then afterwards we installed our analysis script. And you can see the last steps here where we analyzed, where we add the script consume only very little space, and then we added a few gigabytes. And this is really great because it means if you make small changes to your containers, the size of... Sorry, to your images, the size of your images on a disk will not grow linearly with the number of those images. That means you can build a lot of different versions of your software without worrying about filling up your disk with all different images files. Another thing which is really great is that containers have very little overhead. And what I mean with this, you can see here, two graphs that I took from a paper from IBM from late 2014 where the two authors compared the performance of the virtual, the native Linux with various virtualization technologies like Docker and, in this case, KVM. And we're seeing two things. Here's one thing, the right latency of disk operations. So this is the cumulative distribution function where we want to be on the left side if you want to be fast. And the other thing that we're seeing are the input-output operations per second for different use cases. And you can see that Docker here imposes actually very little or almost no overhead compared to the native solution, whereas another virtualization technology, KVM, you can see that there's a significant performance drop here. And I mean, I don't want to make any of these virtualization technologies bad because they're doing something that is very different and they're providing things that are impossible to do with Docker, but you can also see that by doing this, they're incurring a performance penalty. And with Docker, we don't have that so we can operate our applications at the same speed like if we would run them on a native system. Another thing which is great, of course, is that containers are self-sufficient. And this means that as soon as we have an image that we can run with Docker, we have everything that we need to run our application. So we don't need to install any dependencies on the host system except Docker, of course, and we can rely on the fact that the application bundles everything that it needs inside the containers or inside a set of containers, so to say. And this makes things like sharing tools for data analysis or sharing data itself much easier than relying on a workflow where we would need our users to install a lot of different dependencies on their system, which might be problematic because versions change, systems change, and it's always difficult to manage all these different dependencies. And if we can bundle them into an image and run it as a container, then all of these problems disappear. So in that sense, containers can be seen as a Lego blocks for data analysis. Or if you want to regard that more in a functional context, you could see them as a unit of computation where you have certain inputs, for example, configuration data, your data files, and possibly other network containers. You perform some computation on that and you produce an output. And this is a very powerful idea because it allows us to construct data analysis workflows that are reproducible and can easily scale to large systems. So here, for example, we would have a use case where we would take log files from different sources, for example, Apache logs, EngineX logs, and use two containers to map out the interesting information in those logs, then use another container to aggregate those results, use the container finally to filter those results for things that are interesting to us, and pass that on to other containers that, for example, put that information into a business intelligence system, into a monitoring system, or into an archive. Okay, so now we talked a lot about the theory. Now I want to show you some very simple example on how to do this actually in practice. And the thing that we're going to look at is the log file analysis. So we're going to download some data from the GitHub archive and we're going to process them and extract some interesting information. And then we're going to perform a reduced step to get the summary of that information over all the log files that we're interested in. And the code of this is available on GitHub if you're interested. And as you can see, the basic workflow is very simple. We have our analysis script that takes some log files from GitHub, launches an analysis process, and then reproduces some output. Okay, and now please keep your fingers crossed because we're going to do a live demo. Good, so you can see we have several files in this directory here. If you look at the analyze file, you can see that we're importing a bunch of standard libraries here. We're defining our data directory. So I can show you that the data directory actually contains a bunch of GUNCIP JSON files that we're going to analyze. And, I mean, the first question that you probably have now is who is pushing commits to GitHub on the 1st of January? Well, obviously a lot of people. So to analyze those files, we have several functions here in our script. We have just one function that lists all the files in the directory. If they contain a JSON.GZ ending, then we have the analyze file function, which takes a filename, initializes a dictionary of work frequencies, then opens the file using GUNCIP when it goes through each line of this file and decodes it using a JSON module. Then checks if the data contained in a given line is a push event, and if that's true, there's a commits entry in that event that we can use to extract a number of words from the commit messages. So here we just split each word for non-alpha-numeric characters, and for each of those words that we obtain like this, we increase the count of our work frequencies. Then we return that, and that's it. Then we have the reduce function, which takes a result as produced by this analyze file function and just adds the counts in that file in those results together, producing a global dictionary of all the different words in their frequency. And the main block of our script does nothing else than uses getFiles function to list all the files in the directory. Analyze each of these files, reduce the results, and then print out the statistics. So if we run that, it will take some time to do that, going through each file and calling the analyze and the reduce function at the end, and you can see we got a pretty straightforward result. And if you ask yourself, who is pushing all those commits to GitHub, well, it's apparently JavaScript developers. And you can see that the good Python developers, they seem to be taking a day off on New Year's Day. Good. So very simple, very straightforward way to analyze this data. So now let's have a look how we can take this data analysis and containerize it. And to do that, we're going to make some changes to our workflow. So with having our analysis script work directly with the data, we use it to first create a Docker image, and then we're going to use some supervisor script that's also written in Python to create a bunch of containers based on this image that then take each of them, a chunk of the data, analyze it, and finally produce an output that we can again with the supervisor, reduce and convert into the result that we are interested in. Okay. So let's go back to our directory and let's first have a look at how we create the Docker image. So if you see here, we have a so-called Docker file in our directory, which is a file that specifies the specifics of our image that we want to create. And you can see here that we are basing our image on the Ubuntu 16.04 base image. We're saying that I'm the maintainer of that image, and then we are doing a bunch of simple steps. First, we update the apt cache so we can get an up-to-date version of all the packages available. Then we install the Python tree package in our system, and then finally we copy the Docker analyze PY, which is in the same directory as the Docker file, into the container at this location here. And the final line specifies the command that is being run when the container starts up. In this case, it's the Python tree interpreter that runs the file that we just put there. So we can use Docker to build that file. We just call Docker build, and then tag the resulting images with the name that we want to use. And as you can see here, we did nothing basically because the image already existed before, but you can see that Docker went through all of the steps, checked if it already has an image that corresponds to the version that we want to have, and then successfully creates a new image with the given name. Now we could run that image manually using the run command, which is a bit complicated. So let's go through that here. So basically we're saying Docker run. We're saying that we want to run that with a given user ID and a given process, a group idea. We want to expose all the ports of the Docker container. We then specify certain environment variables, which I will explain a bit later. And we just say that we want to mount this directory here as the data directory and this directory as the output directory, and finally we specify the container at the name of the image that we want to run. And so if we do that, we just receive the output of the container that is being run, and as you can see it already finished. And now let's have a look at our analysis script, actually. So like before, we have a Python script that operates on a data directory and that produces output in an output directory, and we have one function that is called AnalyzeFile that takes a filename and does the same kind of map operation that we saw before in our traditional analysis script. Now we don't have any reduced function, as I will explain later, and instead we only have a main block that takes the input filenames from the environment variable input filenames here and then goes to each one of them calling the AnalyzeFile function and writing the result into the output directory that is mounted into the Docker container. And as I said, we need an orchestrator or like a way to start these containers, of course, and for this we wrote a simple Python script. Again, we specify our container name, the data directory, the output directory, and the number of containers that we want to launch, so the parallelization and degree of this problem if you want. And the first thing that we do here is to use the Docker Python API to create a Docker client to our local Docker engine and then retrieve the files from the data directory, analyze each file in the container and reduce the given output file. So maybe we can step through this a bit more in detail. So the AnalyzeFile and container function takes a number of files and then creates a host, so-called host config, which specifies the different directories that we want to mount into the container in this case we want to mount the data directory in the read-only way and an output directory in a read-write mode. And this host configuration we can then pass to the create-container function, we also pass in the container name, the user ID that we want to use, the host configuration that we just created and the environment variables which just contains a list of the files that we have given as a parameter to the function. And now the main function looks like this, so we first retrieve all the files that we want to analyze here. We then chunk those files up into pieces of like four or five, depending on our parameter N. And then we create for each of those chunked file lists a container that is performing the map step for each of these files. We append those containers to a list so that we can use them later and then we wait that all the containers finish their work of mapping the files. As soon as this is done, we can call the reduceOutputFiles function which takes all the files that have been created by the containers in the output directory, reduces them, and then produces the result that we're interested in. So if we run this now, so we have to do that with Python 2 because I have only installed the Docker API for that version, but it also works with Python 3, of course. So we call Python 2 Docker Parallelize. This will launch the containers for the individual files. It will wait for the results, and as you can see, it's even a bit faster than before, and in the end, we get exactly the same result as before. You can see the files that have been created in the output directory by the containers are here. So pretty straightforward to actually go from a workflow where we use normal Python to a containerized workflow where we also use Python but based on a Docker workflow. So this was, of course, a very simple example, and I wanted to show you the basics of this approach, and in real life, the complexity would be higher, of course, for any real data analysis application, and there are certain advantages and disadvantages associated to this approach. So one advantage is, of course, that it's, as I said, easy to share your data analysis workflows because now when we have an image with our scripts, we can just push that to the Docker hook, for example, and anybody can download that image and use it locally on his or her machine. Each analysis step is self-sufficient in the way that the container doesn't care about its environment. As you've seen, we only specified the input files and the output directory for the container, and everything else was inside the container, so there are no dependencies that we need to run this analysis, including the input and the output data. As I also showed you, the containerization makes it pretty easy to parallelize our analysis process, and I mean, for this example, we ran everything on a single host, but as I said, with Docker swarm, it's also possible to run this kind of analysis on a multi-cluster environment so we can easily parallelize our workloads to hundreds or even thousands of Docker containers. And the nice thing is also that, of course, with the image-based approach, we have a versioning of our data analysis script included for free. Well, disadvantages, there are also a few. It's, of course, a bit more complex because we have to prepare our containers for the analysis. We need to install Docker on each machine that should perform the data analysis, obviously, and we also have lost a bit of interactivity and flexibility in doing our analysis. So which parts are actually missing from this workflow? For me, three things. First, as we've seen, we need a lot of orchestration to make sure that we have all the containers running as they should, and for the simple case that I showed here, it was not that important, but for any real-world data analysis, you probably need databases, you need maybe task queues, so you have a lot of different things that you need to put together and launch in the right order, and so you need a lot of orchestration capabilities to do this in a straightforward and effective way. Another thing is, of course, dependency management because in most real-world data analysis contexts, you want to not only perform the steps of your data analysis that you really need to perform. So, for example, if you have several types of data and they depend on each other, and, for example, this way, we do not want to perform all of our data analysis again, only if, for example, this part here or this part here changes. We want to perform only those things that are really necessary to redo with the change data sets. And finally, we also need a way to manage the resources. So, in our example, we produce a lot of output files already, and in real-world data analysis, you will produce many more of those files, and it's also important to manage and version-control those things for which Docker, unfortunately, does not provide any good means right now. And so I was fingering with Docker a bit on my own time, and I happened on these problems, so I decided to start writing a small tool just called Rouser, which is built on the top of the Docker API. And if you would summarize it in one sentence, you could say that it's a make for Docker. So it provides basically the three functionalities that I talked about before, so resource management, container orchestration, and dependency management. I have to say it's still an early prototype, but I want to show you a bit how it works. So the basic concept of Rouser is a so-called recipe, which specifies three things. We have first the resources that we want to use in our data analysis, then we have the services that we need to run, for example, databases, et cetera, and then we have a sequence of actions that we want to perform in order to perform the analysis. And the resources here includes things like versioning, dependency calculation of the different resources, backing them up, copying them, distributing them to the machines where we want to perform the analysis. The services section deals with things like starting up the services, including the right order to do that, provisioning the resources to those services, and networking them together. The actions section then is concerned with scheduling the different actions that we need in our data analysis, monitoring them, performing exception handling, and finally doing some logging for us. Okay, again, I want to show you a small live demo here. So what we are going to look at is, again, really a very simple example where we want to convert a CSV file into a... We want to load a CSV file into a Postgres database. So if you look at the recipe for this data analysis, we can see we have a resources section where we specify all the resources that we need for this kind of analysis. So first, of course, we have our CSV file, which comes from the user resources, which we want to mount as read-only, and which we should make available, which has a URL, electricity.csv, in this case. Then we have the Postgres data, which is the database where we want to put the data. And here we tell Rooster that the state of this database depends both on the CSV file and on the converter script that we are using to create the database, and that we should create the resource if it doesn't exist, that the URL is Postgres, and that it's also a user resource. And that we want to move to mounted in write mode. So finally, we have the converter script that performs the conversion between CSV and the Postgres database, and this comes directly from the recipe, and it is contained in the converter URL. So far, so much about the resources. The services are listed here. In this case, it's only a single service, notably a Postgres database which uses the Postgres image, and which exposes this port here to the outside world, and which makes use of the Postgres data resource that we have defined up here. And here you can see that we mount this resource at this location where Postgres will be able to find that and to use that to initialize or work with the database. So finally, we have the action sections, which contains in that case also only a single entry that uses the Python tree image that we created before and executes this converter, convert.py script that takes the data from the CSV file and loads it into the database. And this container needs access to both of the converter script and the CSV file, obviously. So now if we can launch this recipe, we'll just say Rooster run, and then recipes CSV to Postgres. And you can see that several things are happening now. So what Rooster did now is to look, first look at all the images which we require are available on the system, and then initialize the resources. In that case, copy or initialize the Postgres data, make sure that the input data is there and also check that the script, which we need is present in the recipe, then mount those resources, create the Postgres service, and finally launch the analysis steps or the action phase and give the action access to the Postgres database through a virtual network. And this took a while to run, and you can see the output here of both the Postgres container which created our database and the Python container which ran the script that inserted those rows in the database. And you can see that we inserted about 35,000 lines of CSV into the Postgres data. And now the resulting data is put here. You can see that Rooster also takes care of versioning your data by using a UID-based approach where we always copy the previous version of the data and providing a link to the parent so that we can go back in time and, for example, revert to a good state of our database in case anything goes wrong in our analysis. All right. Now, this is, again, a pretty simple case. It also works for more complex problems where we have different services and more action steps that depend on each other. And, of course, there are still some open questions here. In the example that we had looked at earlier, we used files to communicate the results of our analysis between containers, but there are also different approaches. So we could, for example, use the network or even use the Docker API to communicate that. And right now there's no canonical way to do this, so to say. Also, an open question, especially for distributed systems, is, of course, how to make the data available to the containers. And there, Docker doesn't provide a good solution and we can probably rely on some technologies for things like MapReduce, for example, the Hadoop distributed file system, but it's also not clear what is the optimal way to do this kind of thing here. Of course, there are some other technologies that are interesting in this space. I wanted to just briefly show you two of them here. One of them is Puckyderm, which is a U.S.-based startup that provides an open-source tool for data analysis using Docker. And the great thing about their solution is that they provide both a version-controlled view on top of your data. So they basically have version control for large data sets and they make it very easy to build a dependency graph-based analysis workflow. And I talked yesterday to one of the founders and it's a really great product. So compared to Rooster, it also works reliably already. So if you want to have something that works, both as a large scale as well, you should definitely check it out. Another thing that I wanted to mention here, which is not directly related to Docker, but which helps you also with managing your dependencies and data analysis is Luigi, which is a library that was built by Spotify and that can help you to build complex data analysis pipelines where you have a lot of interdependencies between your individual data analysis step and Luigi kind of figures out how to run your analysis and how to only run those steps of the analysis that are really required. Good. So to summarize, containers are by now a pretty mature technology and they're probably here to stay. They are very useful in a variety of data analysis contexts. They don't solve all of our problems with data analysis though. And that means that we need additional tools to handle them effectively. Some of them I showed you and I also showed you how you can use Python in conjunction with Docker to use this kind of approach to data analysis. Okay. So with that I'm at the end. If you're interested in the tool and roster, you can find it here on GitHub. Contributions are highly welcome and I think we have time for some questions. So thank you. Thank you. This is useful, exciting and... Yeah, I have a question about running this on the cluster. How does the Docker swarm use... So if you have a powerful single machine, let's say, or you have several of those machines, but they are powerful, they are multi-CPU. How does it scale? Will it use all the cores on that powerful machine? So any other bottlenecks? Okay. I didn't do any performance evaluation of that, but swarm basically transparently handles the distribution of your containers to the different system. And the great thing about swarm is that it has almost the same API as the Docker core engine. So we can, for example, use it from Python exactly like you would use Docker on a single machine. And as I said, the containers are completely isolated from each other, so each container runs in its own process. And hence, if you have a multi-core machine, you can, of course, make use of all the cores and the operating system will take care of allocating resources to each of these containers. In that sense, a container is nothing, not much different from a process running on the operating system. Is that answering your question? Okay. Maybe that would be too much overhead, but did you consider dockerizing Apache Spark for this map-reduced thing, like just putting spark workers in Docker containers? So I think in general, Docker provides a great way to build a local setup where you can test out technologies like map-reduced and spark in an environment on your own machine. So I think it's definitely possible to have a setup, for example, running spark, if that's your question. And on the other way around, it's also, of course, possible to use, for example, Docker containers from inside the spark ecosystem or inside Hadoop. So I know that Hadoop, for example, has a runner that can make use of Docker containers to perform the map steps. So both of these technologies can be kind of used in conjunction with each other. Is that...? Yeah, but I think your main purpose is to make it as small as possible self-contained. So I thought maybe Hadoop is a very big thing. A purchase mark is more like a lightweight Hadoop for a map-reduced. So I just thought that may solve your problems with distributing work, serializing results, et cetera. Okay, that's an interesting point. I didn't look into this, but it's possible that it's a good fit. Any more questions? There? You talked a lot about dependencies. But I think there are two kind of dependencies and we should not make confusion. At least we should focus on possible and evident differences. One is code dependencies, dependencies between software packages, versions, and so on. And the other is data dependencies, like models that are built on data, that is built on other data, and so on. Maybe it's kind of a theoretical question, but how do you see these two different concepts of dependencies interacting? Is there going to be a single tool or instrument that can solve both, or we are going to build completely different tools to solve them? I think that is the question. As I said, I think images are a great way of solving the dependency problem with software, so we can use images to make a reproducible environment for analyzing the data that we have, where we are sure that all the dependencies and all the software code, for example, is at a given state. And for managing the dependency of the data, we need a different tool, because Docker is, in my opinion, not the right choice for doing that, and, for example, Puckyderm and other technologies have some support for these kind of things where you have, like, large datasets that you want to version control and that you want to manage in that sense. And personally, I think that code can also be treated as data in that sense. If you would look at the different inputs of your container, as I showed them before, you could also take the software and, like, the scripts that are used for analyzing the other data as data themselves. So in that sense, you can treat those two things under the same paradigm, I think. It's, of course, always a question, what is the best practical way of handling these things, because the scale is very different, because code is usually quite small and manageable, whereas data can be very large and cannot be managed effectively using, for example, source code version control systems. Does it answer your question? Okay. Good. Any other questions? No? Okay, so thanks again.