 Hello everyone. I'm Martin. I'm a solution architect at Cloudera, which makes that I help customers with Flink. Cloudera is one of the vendors that you can get out there to support you on Spark. I'm a PMC member of Apache Flink, so I have quite some experience with both of these systems. I'm going to walk you through a couple of use cases in this talk, mostly focusing on the API, not necessarily the performance, a little bit on the architecture. Of course, I'm ready to take questions on those as well, but basically what I would like to give you is a takeaway message, is these systems are evolving a lot in terms of the APIs, and you can go much further with such a system than you would imagine in terms of use cases. So the setup that we are going to use for this is this big pet store example. She's nothing really fancy. We will have a couple of pet stores. So the model that we will see is we have a map with a couple of stores and we have customers shopping around in these stores generating transactions and we will take it from there. So we have already batched work on today a couple of times. I will continue with that. I guess that's something that everyone has to do to a certain level. Then I will elaborate a little bit on the model, and then we'll see how to do a little bit of data generation with Spark and Flink. Also how to use their SQL APIs or the APIs that are very close to our data frames, if you are familiar with that. Then we'll do a little bit of machine learning, and then prediction in streaming or near real time. So it's going to be a little too much code because I also had to cut a little bit of the slides from the theoretical part of course. So what I suggest to try to do together is, I would like to tell you the story and the code is going to be there and having the story and not necessarily understanding the code 100 percent, that's completely what we are shooting for today. So infamous word count example, who have already seen a word count in their life? Okay, good enough. But that's why I put it up there because it's just half of the people in the room. So basically word count is the hello word of big data, and that's exactly what it is. It gives you the first impression. It has certain merits and we should give those to it. Basically, as unfortunately, it may be a little low if you are sitting in the background, but it gives you the idea that the meta-produced paradigm does the following for you. Even though you have a huge chunk of data, and even though you will never have a global view of it on any single machine because it doesn't fit into one single machine, you can still produce a global count for a given key in it. So it has this power and word count exactly demonstrates that, but it doesn't give you much more. People sometimes use this also for, and that's what I'm trying to get into, benchmarking a big data application. Okay, I have installed Spark. Now let's learn a word count and see whether it's works okay or not. Well, it will benchmark a little bit on the shuffling phase, so at least you will have this network communication, maybe the disk giveaway in the beginning and in the end, but that's basically it. What happens if you wanna do data streaming or machine learning or graph processing? Is it going to be enough for you? Well, most likely not, and that's where we start bashing word count, of course. So then the other cool stuff that basically every one who is talking today has to do is give a shout out to BigTop because BigTop is cool, so I have to do that as well. This project in general is part of Apigee BigTop. You can have a look at the code there and references to the Big Pet Store model. Basically, BigTop not only gives you this ability to install your big data systems with the puppet scripts and the juju charms and all these cool stuff, but also when it's already installed, you might be able to benchmark it. I would definitely say that this is not the most major part of BigTop, but it's a new initiative where contributions are very welcome. Well, in terms of the model that we are using, I already mentioned that this is this BigPet Store model. There's a scientific paper out on it. Basically, the idea that you need to have here is that it can generate data in a big data setting that even if it's small-scale or big-scale, it's sort of still relevant on testing your big data applications. So for that reason, it might be nice for you. It's based on the distributions why it might be relevant. Okay, so the two systems that we are going to deal with today and mostly their API is Spark and Flink, two data processing systems. And I would like to just highlight one fundamental difference between the two in terms of the architecture and the other parts are just matching each other. When you look at the Spark runtime, the basic runtime of Apache Spark is a so-called RDD, Resilient Distributed Data Set API. That's the batch API. So it suggests that you already have your data sitting on your local file system or HDFS or S3 wherever, and it's there and it's a finite data set. So that's what Spark runs on natively and then they build streaming on top of it. The main difference in Flink is that in Flink, the data stream API, the streaming one and the data set API, the batch one, those are on the same level. So if you wanna have this finite data set or if you wanna have something that is near real time, maybe giving answers under a second, that's sort of both a native implementation. The streaming doesn't get translated on top of a batch job. And that has different implications. Because of this, it's much easier in Spark to integrate within a streaming and a batch job because a streaming job is just a sequence of batch jobs. But it might be limiting in a couple of cases. So it includes a trade-off. And we can define the mapping that we are going to go through today. So we have the batch API is matched up. We will have a look at the streaming ones, the machine learning ones, and also the SQL slash table ones. This talk is a little too short to also include the graph ones, but maybe next time for Vasya I have to do that as well. So then let's go to the coding part. Ronald, one of the authors of set paper, he also coded the whole generation process in Java classes. Of course, both of these systems do have Java and Scala APIs. I'm going to stick with the Scala one because it's just easier to read on a slide. Who considers themselves a Scala developer? Okay, a couple, Java developer? Good enough. Maybe Python, that also helps. Okay, I'm safe, thank you very much. So just a little bit of, basically we have a way to generate these classes on a single thread that Ronald has already provided for us. But I would like to use these distributed systems. I would like to generate a whole bunch of this data in a distributed fashion. That's basically my point. And of course, it's pretty easy to do that. The way that we are doing it here is for this data generation to work, we need to generate the so-called ground truth, which is our customer IDs and the name of the customer's basic customer information, also basic information on the stores and on our product categories. So in this case, product categories like a doggy leash or cat food or whatever you would buy in a pet store. And then in Spark, you have basically two main ways of passing data to your distributed operators. Remember, these will get distributed in the clusters. Maybe you will have 10 instances running and generating your data. One is of course to parallelize some collection that you already have or read from a file. So that's the standard data set that is going to flow through your pipeline. But that's also an alternative. If you have just a smaller data set that you would like to propagate to all of your nodes. And for example, the stores is going to be such a data set in our example, you can use Spark's broadcast variable to achieve such a thing. Then Spark has this so-called functional API. We already seen a lot of functional APIs today in the dev room. And one of the main features there is you can use these functions. For example, a map function or reduce function or a flat map function. And pass your user define functionally within that. You see one example here. I just adjusted the time when we started generating the process. That's basically the whole structure of how you would call the data generation in Spark. Let's switch to fling. I also generate the ground through. Once changes in Spark, we called the basis of building this Joggraph Spark context. Now we call it execution environment. Of course, relying on the Ronald is the same. And now it's slightly different the way how they use extra information on our functions. In fling, we use the rich function interface. In Spark, we use another solution for that. It's slightly different how we use broadcast variables. But as you see, the basics having a mapping is on the same level, it's mapped out the same. And this is just a summary of what I've already said. So we have already written our JSON output file. So this would be written in JSON just to sake of having able to look at it. And let's do a little bit of ETL on top of the data that we already have. So of course, this is usually where you would start with such a process because someone has already provided it for you. So how can we, let's have two goals for this. First of all, we would like to feed this data into a recommender system eventually. So we would need customer product pairs where a customer has purchased the given product. But the format that we have in the JSON says that this given customer generated a transaction that has these 10 products in it. So we would like to get into this couple format instead. First thing first, we don't really have an ID with our products first. Yet we only have that for the product categories. So we could select the distinct of those and zip them with a unique ID. These are all Spark functions already included. You can access it from the standard core Spark. And then you joined this information together with the customers who already have an ID. And then you have two IDs here. This is just some basic mapping and the join and this thing. And what's really interesting, these are, this is the batch API. If I switch to Flink, practically nothing changes. The only part that, this is exactly the same. The only part that changes here is the way Spark and Flink like to speak about their join. And basically the signature of the join, but basically it's the same. And the way they handle keys. So if you could almost recompile it with Flink instead of Spark and the other way around. That's, so this keys is a notion of the pair RDD. Okay, so one next step. We have already heard this today from many speakers that it's not necessarily enough to provide the Java and the Scala API because other people who have worked with previous systems, they are maybe used to SQL or R or Python. So why not provide an interface for them as well? And both systems give you SQL interfaces. Of course they, usually they have, they support a standard that's a little older. So I think Spark currently supports with Spark to the SQL 2000 sender. And Flink is on the SQL 92 standard currently. So ANSI in both cases. But they are catching up and making sure that you can also have complex queries on top of these. So for example, here in this query, I would like to have a look at the data and select the stores that have the most transactions. And you can achieve that with two SQL statement. One of the downsides of using SQL is you sort of lose the type information that you have in your data. Because when you look, you just type this declarative query as a string and you lose the information that actually this ID was a string or this was an int and this name was a string. So if you also wanna have that, then you can use the table or data frames APIs. We also heard today that it's very difficult to write a UDF in Hive for example. And there Spark and Flink, I think give you very nice solutions in terms of writing a UDF. For example, I would like to register this months function, which is a Scala function so that I can use it in my SQL statement. This is literally this much code. You write the Scala function, you call register and then here you can use it. So I think it's super convenient, it gets propagated to your worker nodes. The note I did report to the table API. So the SQL I showed you in Spark, now I'm going to show the table or data frames API in Flink. Both have actually both, which is don't have enough time to visit everything. So this is very pretty close to SQL. It's still Scala, it feels like SQL. So the statements that you write is table.groupby and select and join and where. But you are still in Scala and you still have the type information. And if you want to go back to standard Flink or standard Spark, it's way easier. And here I accomplish the same, let's select the most important stories. Okay, another topic is finally we realize and these customer and product pairs. We would like to feed it to a machine learning system. And if you are familiar with scikit-learn, then the API should look pretty similar for you. Basically they have this estimator predictor API that's very familiar from scikit-learn. So you have a model, you call fit on it and then you can predict with a testing data set. So that's what we would be able to call here. The recommender implementation that we are using here is called ALS, but there are also other solutions to solve this problem. Instead of predicting it right here, I'm just saving the model and we will use it later in a streaming topology. So right now with this much code and this is actually how much code you have to write in ML lib, it will be very similar in Flink ML. You can train a machine learning model and save it and then we can reuse it later. In Flink it would be this much code. This is really how much you would have to write there. Okay, in terms of comparing the ML APIs, the Sparkman is definitely more mature, but API level, they are almost as close as the batch APIs are. So I think given that the Flink one catches up with algorithms, it will be really easy to switch between those. And then let's play a little bit with streaming. So we have already have a machine learning model and the tricky part there is usually when you have a recommender system model, then the problem that you are trying to solve is, I have this user, he have watched five videos on my website and I would like to give recommendations for three more. Well, if you don't give those recommendations in approximately 200 milliseconds, the user is gone, so they are worthless. That's why we need streaming or near real-time processing. And as I mentioned in Spark, it's really easy to accomplish such a thing because we just load the model that we have already prepared in batch and then we have a query which is coming from a socket. I'm just typing in user IDs and when I would like to do the prediction, I go back to my batch pipeline and use the prediction there. This is something that is currently not available in Flink and it's super convenient to code in Spark. Of course, you have other trade-offs in Spark. And just to give you a sense of how to switch to Java API, this is actually coded in Java. Of course, you don't see much difference. The only difference is the semicolon here because it almost already looks like Scala, but in Flink I decided, and the code is available on GitHub, I decided to code the whole prediction instead of relying on the implementation in Flink ML. I coded it in streaming because it's very straightforward. It's just multiplying a matrix with the vector and then selecting the top key. Yeah, in terms of differences, I think this is the part where Spark and Flink are the most different because of the choice of architecture. Flink does definitely better when it comes to stateful processing and timeliness, but Spark is improving in that department with the new structured streaming API. So, as a summary, you should definitely go beyond the word count with BigBitStore. The batch concepts of Spark and Flink are very close, but they have the differences in streaming. You have seen a lot of use cases with under 500 lines of code and it's available on GitHub and you are very welcome to check it out. And of course, an Apache Pet project is always fun. Big thanks to the guys who have helped me to accomplish the project itself and thank you very much for listening. Do we have time for a few questions? Yes, please. It's very interesting to you compared to Spark and then both have been inspired by from Java API as historically, they look very similar and then another project Apache Beam, which is yet another inclination of this and I know there is a kind of process of converging toward that. What would be the merit to use Apache Beam as a kind of common API instead of have you experimented with this and what are the limits? Yes, so basically, yes. So the question was, how does Spark and Flink come together in Apache Beam and what's the future of Apache Beam in my understanding? I think Apache Beam is a very interesting project in the sense that the Google guys who published the data flow paper which was the driving factor behind Beam really got the timeliness of streaming application rights. So their model there is really goes into details in how to handle timeliness and late events and that's one of the main powers of data flow. And I think that if you look at the Spark and Flink's regular implementations currently, none of the systems can support all of the features of data flow but Flink is much closer currently. And the basic reason why Spark is currently not supporting all of these streaming features because in Spark, because of this architectural choice, Fulterance was very easy to accomplish because it was done, batch Fulterance is done already but when you have this mini batches so then it's really, you don't really have a way to communicate between two mini batch, that's the main issue. So that was one of the driving forces while Spark is coming out with a new streaming API, this structure streaming to make up for a couple of these semantic differences between the old Spark streaming and Flink or data flow. So it's definitely getting there but Spark has to improve in terms of semantics. Yes, please. I think Medlib is a little bit in a different domain but I would like to learn more about Medlib as well. Thank you very much.