 Hi everybody, thanks for the introduction. Yes, my name is Peter Hoffman You can see my Twitter handle at Peter Hoffman and you can find afterwards the slide at github.com slash bluyanda Before I start my talk a little bit about me and about bluyanda. So what do I do? I'm a software developer at bluyanda Bluyanda is providing predictive analytics as a service and I think with more than 100 data scientists We have one of the biggest data science team in Germany Our stack is mostly Python and we are Building a platform where we run our machine learning algorithms on top of it You see here. We are 10 people from bluyanda at the Euro Python and we have nine talks So after me you have still the chance to see three other people from bluyanda You can see tomorrow Moritz talking about testing and fuzzy testing You can see Christian talking about bulk data storage with sequel alchemy and you can see Florian talking about Bots and last but not least Philip Is the I think you will present what bluyanda really does So let's start in About what is spark? So spark is a distributed general proposed computation engine It has APIs to Scala Java R and Python and it's mostly for machine learning and distributed computing Spark has one core API. That's the resilient distributed data set and based on this core API all other APIs are sitting on top and Spark is it runs on a cluster. So it on multiple machines You can use different different schedulers to run spark on a cluster like a standalone scheduler You can use the adobe yarn scheduler or you can run spark on mesos On top of spark core. They are sitting several libraries The important or the most important one is spark sequel or the spark data frame API Then there are spark streaming where you can Calculate based on micro batches. You can do stream computing There's the ML lip library for machine learning and there's the graph X library, which is for graph processing Spark itself is written in Scala and runs on the virtual machine Java virtual machine and it is responsible for the memory management for fault recovery and interaction with other storage systems Spark sits on top of the adobe stack So spark can access every data source that the adobe stack provides and stand a lot more So the core library of spark is the RDD the resilient distributed data set and the RDD is a logical plan to compute data based on other data sets RDDs are fully fault tolerant so that a system can Recover from the loss of single nodes in your cluster or from single failures in a calculation of parts of your RDDs Spark will then rerun the calculation and try to recover from machine failure There are two basic principles how you can interact with RDDs. The first one is through transformations So a transformation always takes one or more RDDs as an input and has an RDD as an output Transformations are always lazy. That means they are not calculated on the fly But they are calculated when you call an action on an RDD and action are The the last step in a in a calculation plan where you really want to collect the data So you can take Some rows of your data you can get all you can call count the result set and then the data or the Calculation will really be run and you'll get back your data Spark tries to minimize data shuffling between the nodes in your cluster and In contrast to the Hadoop stack it doesn't write all intermediate results to the to the file system But it tries to keep them in memory and therefore a spark if your data fits into memory in the memory of your cluster It must it's much faster than traditional map reduce stacks if you combine multiple Transformations with your RDDs you'll get the RDD line edge graph. That means that Based on your partition of your input data You can have a lot of transformations one after another and spark tries to group these transformations together and when possible run them on the same node Many transformations are element wise. That means they can only work and an element at a time But it's not true for all operations operations like group by or join operations work on multiple elements and As I said earlier actions are then used to get the result back and to return it to your driver program If you know how to border the traditional map reduce Programming model. They're only map and reduce steps While spark has much more transformations. It has the map and the reduce Computation, but it also has things like flat map has a filter has a sample function You can do unions of multiple data sets. You can do an intersection You can group the data by keys You can aggregate it by keys and you can do fully join inner outer and right outer joints of your data sets What's important for spark is that it knows the partition of your input files and knows knows the data locality of your partitions because it always try to run your Calculations where the data is so spark tries to bring the algorithms to your data and Minimize shuffling data around in your clusters. So you have a set of partitions Which are atomic pieces of your data set and you have a set of dependencies based on parent RDDs And you have always functions, which will calculate RDDs based on your parent RDDs Spark needs to know about the metadata of your data to know where your data is located to be able to do data local computation so that data shuffle which is expensive and which will really slow down your calculations is only done when it's necessary as I said earlier a spark is implemented in in Scala and runs on the Java virtual machine. So what is pie spark? Pie spark is a set of bindings or apis which sits on top of the spark programming model and Which expose the programming model to your Python programs. So that's the famous word count Example what you do is you always start with an input RDD. That's some kind of basic file system operations Here you can load a text file from an HDFS And then you have the normal map reduce steps where you split the lines by white space Then you will emit each word with a number and then you do a reduce meant where you calculate the occurrences of the words As Python is Dynamically typed. It's possible that your RDDs can hold Objects of different types. That's not possible in the Java version, but the Scala version also has this possibility At the moment pie spark not supports all apis that are supported in the Scala version For the data frame, they are nearly provide everything but for streaming Pie spark always lacks one or two versions behind the normal Scala APIs So here you see how it's done. You always have a driver context. That's on your local machine You can as I show later You can have an iPython session or a run your normal Python program This will then connect to a spark context, which will talk over pi 4j to the Java virtual machine on your host Which will then talk to the workers and each worker will again talk to Python or to JVM it depends what kind of calculations you do on top of the RDDs There's the relational data processing in spark That's a relatively new API. It's had been added to the apis in spark one dot four only two months ago and It's a new kind to work on a higher level with your data through declarative queries and optimize storage engines It provides an programming exceptions abstraction called data frames and it also acts as a distributed SQL query engine I'll show you later. How you and What's really nice thing is that the query optimizer the catalyst optimizer works the same for Java Scala and Python So you'll gain the same speed in your Python programs as we will gain with the Scala programs And the data frame API provides a rich set of relational operations So you can interact through different APIs with the data frame API. You can connect it through GC so Java program can talk to it through the normal JDBC API you can directly talk to it through user programs in Python Java and Scala and You can also switch between the data frame API and the raw LED API So what's a data frame a data frame is a distributed collection of rows grouped into named columns with a schema and It has a high-level API for common data processing talks That's projection filtering aggregation join and it has metadata sampling and user defined functions so you can define your user defined function in Python and use it in the SQL statements in the SQL queries As with the RDDs data frames are executed lazy. That means each data Frame object also only reprints a logical plan how to compute the data set and the computing is Is hold on until you really call an output action a Data frame you can see there's an equivalent to a relational table in Spark SQL And you can create it through various functions using a SQL context And then once you have created it you can Operate it on it through a declarative domain specific language. So here you just we just load People's JSON file which has some rows of JSON and then you can like you Know it from SQL alchemy or maybe from pandas. You can do filtering selecting projection And get your data back and if you compare these two statements So the first one is in the declarative Python way. The second one is a SQL way and they Result in the same Execution plan on or in Spark itself So it's only declarative if you've ever write it in Python You'll get the same speed as the plain SQL one Or the one you define in Scala Why is this possible? It's possible because Spark has the catalyst query optimization framework Which works for all languages which use the data frame API It's implemented in Scala and uses features like pattern matching and runtime metaprogramming to allow Developers to specify complex relation and relational optimizations. And as you can see that's from the spark website If you work with the plain RDD's the Python version is always lower than the Scala version But if you sit on top of the data frame API where you only use declarative statements Then you'll get the same speed up as in Scala So how do you talk to your data? As I said before spark works on top of Hadoop so you can access all the Hadoop file system and Drivers that are available there Through the data source API so you can talk to Hive tables You can read in Avro files CSV files JSON files You can read and store the data in the parquet column of full format and you can also connect to a normal JDBC databases I'll go into little into detail into the parquet data format Because I think that's a really great way to work and store data from spark So parquet is a column of format that's more supported by many data processing systems And you can store the parquet data format in chunks into an Hadoop HDFS file system parquet ultimately preserves the schema of the original data and As you can see here if you have a table with three columns the normal row oriented storage Is that you write each row after row and the column oriented storage is that you save your data? In column order and it had different had several advantages The first one is normally in one column. There's similar data So compression works much better if you block and code Column wise data and then if you have data with many columns and you don't want to access all columns every time It's much faster access to just access some columns at a time The data frame API is able to do prediction and projection push down That means if you underlying storage is able to Work with vertical partitioning or horizontal partitioning the spark data frame API can push down the parquet to your storage engine And don't have has to read all the data into spark But let the storage engine do the hard work So can see here vertical partitioning that means you only want to have the column B And maybe you have some predicates on some rows So you will see I say okay I only want the the rows where a is a 2 and C is in C4 and 5 So you can split this up and only will read the result into spark to for further processing the Data frame API not only supports tabular data It has basic types like numeric types stream types and byte types, but it also supply Provide support for complex types and nested types So you can build build tree like data and access tree like data from the data frame API You always have to provide a schema or the data has a schema There are two ways to get the schema into your data frame. The first one is to do Schema inference this works with typed input data like Avro, Parquet or JSON files Or with a normal example with dict files where it can guess the data types for your data frame or You can specify the schema by yourself So here we want to read in a normal CSV file and we define a stroke type with several fields And we add to the fields the type of the field This is a call To see what are the important classes of your spark sequel and data frames So you have the context that's the main entry point to interact with data frame with sequel like functionality You have the data frame which is the distributed collection of group data and you have columns Expressions which you can work on your data frame a row is a row in the data frame and group data is data You get from aggregation aggregation methods like group by and you have as I said earlier the types where describe the schema So when you have a data frame, you know, that's looks like pandas I think you can select you can filter you can group by and work on the data As you do on a local machine, but you really work on a cluster and that's what I want to show So I'll show you a little example. It's from The github archive So the github archive stalls all the events that are on github for the last half year that are 27 gigabytes of JSON data about 7 million events and now all my colleague is said don't do it, but we'll try it anyhow To do a live demo and connect to the cluster. So it's a little bit too small, but I think we'll get it Also, this will not work So I also wanted to show you the cluster, but this will not work at all. So I'll try to only go through the Programming statements. So we always start with a sequel context That's our entry point where we connect to the cluster. I feel cluster with four machines each 40 cores and about One terabyte of RAM in total. So what we do, we just want to read a single text file or JSON file That's one hour from one day into into the cluster You see it here. How to get it you will say This context get the text files take one. So we'll dump it out We what we see it's normal chasing, but it's an hierarchical chasing But as I said earlier, that's no problem because a spark can work with hierarchical data Now we read it in as a chasing. So it ought to detect the schema So now we read in the data and spark auto detects the schema. So what we see for each event we have And an actual that's the person who committed to the to the to github We have a created version. We have some payload and we have the type of event at the bottom like a pull request or something like that and that's Spark automatically detects the schema. So we don't have to do anything So now let's try to work on the whole data Not only one hour from one day, but work on all events from the last half year from github Have a look how much events we've got So now from my Macbook, which doesn't have that much memory we're working on 70 million events Let's have a look at how many pool or how many events were in the Apache spark repo in the last Half year so roughly 60,000 or have Let's have again look who are the top commuters in this Apache spark repo and Show it So now all the calculation is done on the cluster and only the results that comes back to my ipython notebook So that's pretty cool because you now work on much bigger machines and don't have to do the calculation at your laptop and you also can always register a data frames to the SQL that you can also run normal SQL statements Instead of using the Python Declarative language So that's the demo Then a little summary. What's what spark so spark is the distributed general proposed cluster computation engine and Pi spark is an API to it the resilient distributed data set is all logical plan work on your data Data frames are high-level extract extract abstraction That are a collection of rows grouped into name columns with a schema and the data frame API Allows you to manipulate data frames through a declarative domain language So thanks for your attention and any questions Okay So we got this what the more main point I why I wanted to show the demo. I want to show the H top It's all the cluster So that's what I really like if you have once Yeah, talk to a cluster with 160 nodes and one terabyte of RAM. That's funny So any actual questions? Then if you get any other questions come to our booth come to the other talks and come to me see me around Great. Thank you Peter again