 Okay, hi. The next talk is going to be held by Chesney Scheppler about big data analytics with Python using stratosphere Please welcome him Okay, so one thing that I have to raise before is So between handing this in this talk and actually having it the strategy project has been accepted in the Apache incubator program and Had to be renamed due to a name conflict and it is now known under name Apache flink Flink being a German term for something that is faster agile So for the remainder of the talk I refer to it as flink since the feature that we or the main thing that we are presented today will be part of our of the actual One of our first a patch releases, so I'm going to first talk a bit about what flink in general is and then Come on teacher about Actually using me here our new Python API that exposes some features Off-link to Python developers. So what is Apache flink is in this with runtime for? big data analytics Britain and Java with the big focus on ease of programming the opera by offering a rich set of operators and Automatically optimizing your program The project started in 2009 as a joint research project by several universities here in Berlin and Was then later transferred into an open source project project is now moving to a patchy under name flink Ladies versions 0.52 is the last release outside of a patchy and the next or six releases scheduled between within the next weeks Flink operates in the same Use-space as systems like a patches park or how to map reduce it shares common traits like Scalability and user defined functions But combines these with database technologies like declarativity and Optimization so a lot of decisions how the system works deep-downs Left to the system and writing a program We're writing a program for flink and what you're essentially creating is some kind of data flow so you have data sources and you put operations on them and Output it somehow So in this case we have some data sources. We'll play map function reduce joins One thing that sets if link a policy these steps are not done step-by-step So we don't do the whole thing that operation and then pass the whole data set that is to produce into reduce function we stream them on the go into the next operation so that in some case your whole Programs run at the same time the flinks that look something like this. So at the very ground you have some Storage where you have your data flink itself does not depend on any distributed file system So you can work on local files exclusively How they first work is surprising you were or especially you are but you cannot serve your data in databases for example For cluster managers, you have also flink works with several it's yarn especially well once again Then we have flink runtime Optimizer and several API APIs built on top of them and different languages. So we have the Java API was our main baseline Scala Sparkler API for graph computation Meteor one for Jason, I believe Currently developing a square API and the main reason for for me being here is on your Python API We just built on top of the Java API. So one thing that is Actually pointed out is that all API is both the same optimized and the same runtime so they're not So they're kind very deeply connected with a lot of so In summary the key feature of straight off flink are that you have various developer APIs that are easy to use So when you write a program you have I'm wondering you write the plan inside the API and they usually find functions while some operations are done on the Java side Like grouping and sorting a lot of the computation is actually done in the API's respective language The optimization takes care or consoles how certain person carried out how is the joint carried out? It takes into account where the operation or operations are carried out physically. So it tries to Place Subsequent operations onto the same machine so that you don't juggle the data across the whole cluster The flink runtime is its very own system. It's not built on a loop or other systems and Something that I haven't touched on earlier is that we treat iterations at first class citizens and optimize and separately So so much to have for flink in general and we're not going to look at the Python API So this is a word count example So you want to if you want to know how often each word appears in a text. This is what you would do So on the left side you have you have the plan basically on the right side. You have your user defined functions So so let's just go through it a step by step. So first code to get an environment something like I'm writing a new program So I'm think of it like a common route for your for your program We then apply a flap and function to any first read data A text file so every line is treated as a separate string and then apply a flap and function to it We're passing an object of our user defined function In the output type so the output type we're not very Python like To have to pass it This is a side effect of using the Java API on a leaf Java It falls as a very strict types system. So if you would like look at the same program in Java You would see type arguments little everywhere So this is one of the last remnants of that so The token as it does it spits the line to say words and collects them each with an initial count We then group the data based on the words we partition the resulting data and Applier we just function to each group separately that some such words and then writes a date The code execute is SNM applies starts the plan so When After you call to read text data does not contain any data It is just an absolute extract representation of the data that really exists once the program is executed. So now Where are we interested in Python? Python's already used heavily used Python's already heavily used in data and analytics jobs So it just seems like a good fit to provide flings capabilities to Python developers It also allows Java developers to access Python's extensive standard and machine learning libraries So in the grand scheme of things this is how The whole process will look like to rewrite your code in Python When it's executed it creates intermediate representation of your program is then funneled through a Java API to create an actual flink plan Which is then shipped across a class and executed so the Python functionality So I'll use defined functions are stored as serialized data inside these inside the execution graph so that on The runtime the data flow something like this So a flink runtime that encompasses everything you have a Java operator since the plan is within Java instantly need some Encapsulating object for the whole Python operation Which when it is so So when the operator is started or opened or created it starts a Python process Decealizes object we passed in the plan so talk another object in our case for example, and then pipes the data through The user different function So when dealing with process of different languages exchange data type system becomes a natural issue So what we do currently is we have a failure inventory system currently We just what we do is we map basic types to each other as well as tablets and lists are converted to flink tablets flink tablets are a fixed-length type safe container which is Somewhat similar to to Python tappels Similar enough that you can use them So I said really mentally so you can't These tablets cannot nest it currently they can't be longer than 20 elements Just time constraints while you haven't gone to that currently A more severe problem is of course that we don't use up on the arbitrary objects Which is due to the fact that Wanted to properly so we could use the pickle module to just Pickle the data if it's not one of these basic types then unpick them when I come back But there are several use cases where this is not Enough so for example if you want to sort on these data or group on them You end up It won't work since it sort operations are executed on the Java side. So in order to avoid Really ugly programming patterns we decided to leave it out for a while and Spend a bit more time on these things so when we say We support arbitrary objects of our Python API that we actually do Sometimes close relationship protocols how we exchange time between these processes So the protocol initially we use the Google protocol for those for this but they When the end wait to snow for us so we switch to to the struct package and the Python site and bypass on Java side Which works surprisingly well together So basically right off the box Since we are only changed the protocol around three weeks ago there it's Pretty much designed currently for the table for fling table type and around the current restrictions for the type system So a few details will most likely change in the future but The gist of it is that we serialize the fields and apply few extra data Independent extra so for every field for every field to serialize we apply we add a type by to it This can actually be removed for a normal data flow since the output times are also same But for now it's doing there and then we have this metabyte. This is with it which is One end use for several control firms like I'm the last element in this iterator On the last element. It's collected things like subject computation and then with the size argument which Represents a number of elements in your given table So for something that is not a tupper This would have a special value just use. I'm way too fast So the Python API is currently provides a subset of the features of the Java API So you can read data from text files easy files or provide distinct objects within your plan Convite as Texas easy or CSV files and print and standard out Operation wise we actually support most of them Most important missing one iterations Couldn't get on time For the most part of the world is for something that you Is the Python API is currently in the state where you can try it out. It's not production ready. So For these testing it is certainly Usable So one thing that we're thinking about is for example for for the future Is that in a scalar in the java API you have the option of define Or no, I'm one side The input and output of data is currently completely ended on the Java side by something that we call input formats It's similar to adobe and performance. So they define how certain data is read how they are given to to fling We want to provide the same functionality in the Python in the Python API so you can Read so that the whole pipe is completely in Python from data input to transformations to So I'm all to to to use the API you so we tried it on linux Want to to be more precise On Python 2.7. We know that there are two minor issues that prevented to run on Python 3 Just do some types of like no long and base string all that fun fancy stuff But it should January work also in Python 3 minor modifications. So in order to to Run locally so you can run it locally just in your you see to try it out And all you have to do is basically to download fling started and you can already run it So depending on how fast you download and can move to a terminal you can get it run running in a minute basically For the cluster you need a bit more work. So you have to set up fling and configure it all that stuff as of Right now Now you need them HFS that is used to distribute the files among the clusters. So the fling package as well as the Files user provides a distributed automatically among cluster so You don't have to do it menu. This also means that if you Make a change on the main node This change automatically propagated on the cluster, which is surprisingly convenient. So So I hope I Was way too fast. I hope that it could peak your interest a bit Fling so if you have more if you want more information you can go on the website link doting incubator dot a patchy dog If you want to try out the Python interface you can go into This approach that I set up and also contains a detailed documentation for the Python API And Then thank you for your attention. I'd like to take some questions. Yeah. Yeah. Yeah. Yeah, so sometimes we don't have a Microphone here. So you would have to repeat the questions. Okay. Yes So Could could repeat the second part again Yeah, no spark. Yes, they're similar systems I Yeah, so the question is why would I use why would should one use fling? Compared to similar system like spark so One thing that we do Better is that we provide a better interface for iterations or generally a more efficient one So we try to use reuse a lot of operations to prove to prevent repetitive computations One thing that should be said though as Python this is that spark. That's generally more mature currently Yes, it's a tough question to be honest. Yes, please So the question was if the data is only Replicated also partitioned when executing on the cluster It is as far as I know It is partitioned. So the question was very Got it correctly. What what happens when we need a piece of Partition the data and we need to access another petition from another system. What happens then? I'm actually not sure about that. Yeah, so I've never looked at that code to be honest I've only been with a project for around three months most time. I worked on the Python API So I'm not very much informed about the really internal deep down stuff in the system