 Okay. Last talk for today and one I've been looking forward to all day. Ismail Mejia will talk about Nexmark, a benchmarking suite for data intensive systems built with Apache Beam. Okay. Hi, I'm Ismail. I'm a software engineer, a talent, and I'm a contributor to Apache Beam. I am a member of the PMC and a committer in the community. I am the top non-Google guy in this project contributor. At least Github says that. That doesn't mean that I'm the best. That means that I refactor a lot of crap and I do a lot of small commits, maybe. So this is the way it is. I work for talent at this software company that makes data integration projects and all around the data, real-time system and all this stuff. We have an open source solution, so I just word mention here. And more important, we are building this new family of products. And just, we are actively recruiting and we are based in France and most of people here can speak French. So if somebody is interested, you can join us. I work in a team that there's main responsibility is to work with Apache projects. So if you are really into open source, you are super welcome. We have an open place now and we will have a second one in the next months. So we are going to talk about benchmarking and benchmarking big data. We are going to talk a little bit about Apache Beam and Nextmar, that is a benchmarking suite on top of continuous systems. And we are going to look a little bit implementation with it for Nextmar and Apache Beam and how this is done and what you can do with it. And we are going to discuss the current status of the project and some future work. So why do we do benchmarking? This kind of seems silly, but we always have to remind this. We do benchmarking because we want to measure the performance of things, but we also want to evaluate the correctness of the things we are measuring. And every benchmark suite has these four basic steps, that is just generating some data. Then we aggregate and compute this data and then we produce two kind of outputs, the measure of how these things are executing and the correct results of the calculations. And we have different families of benchmarks. We have micro benchmarks, that is when we execute a really punctual function like a sort, for example. We have functional benchmarks when we are evaluating this kind of functional requirement thing. We have more this kind of business case benchmarks and this reminds me of, for those of you who know a little bit of the database word, all these TPC suites that for example is a CRM system and with the CRM system we create all these kind of tricky queries to see how things behave. And finally we have this data mining machine learning kind of benchmarks when we execute a really complex algorithm of machine learning and we measure how this behaves in our cluster. So what's the problem today with the benchmarks in the big data world? The problem is that we don't have many pseudo standards. We have this terasort thing that is a benchmark on sorting random numbers. We have the TPCX HAs family of benchmarks, but we're going to discuss a little bit those afterwards. But in the end the real problem is that we don't have a common model to create these benchmarks because every system implements things differently like Spark has one way to write code, think has a really different way, classical Hadoop is different too. And also the existing benchmarks are really focusing to Hadoop infrastructure and today, well, not all these systems work with Hadoop. There are now many things running on mesos and starting to get things also running in Kubernetes so they are not also that appropriate anymore. Some mix these measuring storage related things or more data storage related things with the real processing of data. And more important for the sake of this presentation is that there are really few things into the streaming world and into the streaming semantics of benchmarking. So the state of the art, as I said, we have terasort that is just sorting some random data. We have TPCX HAs that is a little bit more like how to evaluate Hadoop distributions. This is basically its use. We have the TPCDS written Spark that is really complete but it works only on Spark. And we have others like the Intel models that are high bench and big bench. They cover many cases, they have functional benchmark, they have a little bit of streaming also. And finally, the Yahoo! streaming benchmark was published two years ago and it really moved the streaming community with back and forth between different systems, Spark, etc. to be better in this benchmark. But next mark, that is the benchmark I want to discuss today, was a publication done in the 2000s, beginning of the 2000s was an exciting time for research on streaming systems. And these guys created this paper that was never published, it was just abandoned in some academic site. And they proposed a really simple but nice example of system when we are creating an online notion system. So we have people who arrive and want to sell things, they are selling things so there are the people who then create options and there are the people who want to bid on these options and in the end they want to buy something. This is nice because it's really easy to explain as a business case and it's really rich in the sense that we can express a lot of queries let's say around this, so things like how many, how much time that it takes for an auction to complete or stuff like that. So it's kind of nice. The original paper had eight queries in this CQL language that is like an SQL but with continuous semantics, like streaming kind of. And the Google guys took this paper, one Google guy Marshalls implemented this in top of Dataflow and they used this a little bit to evaluate how Dataflow performed. So what would you think? What is the link between Dataflow and this? Well as some of you may know, the Dataflow paper is the logical basis of all on Beam and the Google Dataflow, Cloud Dataflow is the system of Google to execute Beam pipelines. So what is exciting for me about Beam is that there were like two big lines of work in the big data world. All these things that were happening at Google that nobody knew outside, only knew about because of the publications they did and Apache World when they were implementing some of these ideas like Hadoop or HBase. There was like two parallel lines of evolution and of course in the Apache World there were really nice ideas, things like Spark that really changed the way we did things. And in the end, Apache Beam is a little bit like Google finally coming to open source a little bit and putting both together so it's kind of nice. So what is Beam? Beam is a unified programming model so let's say that this is like unified SDK or API where the idea is to have one only way to express both batch programs and streaming programs and with the advantage of being able to be portable so we can execute this in different systems not only in Google Dataflow but also in Spark or in Flings so we have things to translate this. And this is quickly the addition of Beam. Every Beam pipeline answers these four questions. I will go quickly into this because it's a little bit outside of the benchmark path. But more important is that we have these SDKs today in Java and Python and there's some going to work for the Go one. We have some libraries on top of these. We have Chioq is a Scala library on top of these. We have an SQL that is starting now. And we have runners that are the systems where we execute this. Of course we have connectors to different data stores, Kafka and all of these Hadoop things. These are the values. But for me the more interesting part is the runners. So today we have all the systems that are supported or partially supported by Beam. Those who are in this small square are the new ones who are coming. The others are mostly on master or private. Those who are private. Just one thing is that Beam offers a direct runner that is a local one. It's not distributed. It's just so you can quite your pipelines in your computer and test that they are valid. And this enforces a lot of things. This auto serializes things just to check that everything is serializable and all this stuff. So whether we today on Beam or what we saw, we have a quite vibrant community, at least from the Apache standards point of view. We have the first stable release in April of the last year. And currently we are in the vote for the release 2.3 this week. And I'm glad to say that Nexmark broke the vote this week. So I'm kind of proud of this because we found our regression with Nexmark. And there are some exciting features that are coming in Beam. It's the SNAPI that is able to have multiple languages now. Really. So in the sense that you write your Python program and you want to execute it in a flint cluster through Beam. And you can do this. We have working on a schema where we collect that means that we can have more, let's say, easier APIs like those of Spark, for example. We are also building new libraries for different subjects. So this is an invitation also to come and join Beam. It's a perfect time because there are still a lot of things to be built. Okay. So before I say that, Beam answers four questions. And these four questions are just to unify these batch and streaming semantics. And what is the problem we have? The problem we have is that normally we have events that happen in time. And this horizontal axis X is correspond to event time. So the moment when an event happened, for example, I'm using my telephone, I did click. This is the moment when I did click. And this vertical axis is the moment when this event arrived to my server where I'm processing. So ideally we should be processing the events at the same time they happen. This is this diagonal line that we have. But this is not true. So this diagonal line we call this the watermark. And it indicates a little bit like we don't have delayed events, but this is an approximation. We will see this in detail afterwards. But what are we answering here? We're trying to sum all these numbers. So they are arriving continuously because we are streaming. But we don't know when we are going to end this, when we decide to calculate. So this is what these questions answer. So what are we doing? We're just summing. So first part is okay. But when are we summing this stuff? We can say, okay, I want to know the results every two minutes. So we are going to group all these sevens that are arriving on time into windows of two minutes and we are going to sum this. So in the end, we will end up with a graph like this. So we have still the events arriving in time. We have the separation on event time. But we have not still said when we are going to compute. And we do this with the concept of watermark. That is the line that I say that, more or less, indicates me is this green line that no more events are arriving. So when I complete one window and there are no more events, I materialize the sum. Then it becomes yellow in the graph. But as you see, there is the case when some event can arrive late. This is this one. And we have to get a little bit of tools to reattract the sevens in time. And this is what BIM offers. A set of ways to transform. We call this to express this kind of scenarios. Like we can hit, okay, if I have laid that data until this time, I can retake it and add it to this window. But I went too far into these details of the BIM model. But I suppose that some of you don't know all this stuff. So I just wanted to bring some of these. Basically, a program in BIM is just a collection of steps. When you take an input and you do all these transformations, window, window, and stuff, and you go to an output. We have these families of processing elements. Element-wise transformations are like MAP, in the MAP-reduced model. Grouping transformation are like this child frame and reducing in MAP-reduced. And we have window-in and triggers. And this allows us to play with all these time transformation and trigger where we want to trigger the compute, the aggregations. So this is more or less in a nutshell, the BIM model. So next morning, a patchy theme of the story, quite quickly, is that Google contributed this in version Seru2. And then it was like abandon. The guy quit Google. So the project was a little bit in the limbo. And me and another colleague, we took this and we started to refactor and to bring it back with all the changes in the API. And especially we make it generic to execute in every runner. It was supposed to be like this, but it was still a little bit tight with the Google stuff. And we evolved some of the queries and finally we got it meshed in December of last year. So now it's part of master in BIM. And the advantage of using BIM for benchmarking is that we covered some of these issues that we discussed before. We have a rich model in the semantic sense. We have one code that will allow us to execute in every system. And also we can benefit of all the metrics that BIM has included, the API for metrics. So we are going to discuss quickly the implementation. We have just four big boxes. We have a launcher that executes one configuration of the benchmark. We have a generator that is going to create D7s in the system. So D7s is like new people and new bits, new options. And we have metrics that calculate everything and we have the configuration of the benchmark. This is an example of some of the queries that we have in the system. There are now 13 queries in the original paper. There were eight, but the Google guys had five more. So we have 13. And this covers some of these areas of BIM. Well, I don't go that deep into the detail because you have to see more in detail some of the BIM concepts. But just to explain a little bit, every query has this structure. It's a collection of events that happen in the system. We filter the events that we care about. For example, we were caring just about options. We filter the options and then we do the transforms in BIM. And finally we apply this result and output to this. So for example, we discussed a little bit of window in before. This thing of grouping every 10 minutes. Well, now there are different, more advanced ways of windowing and this query's covered this. One thing that I forgot to say in the last slide, in this one, is that some of these questions can be answered with an easier way that we chose a specially convoluted way to have two tests more, to cover more of the BIM model. In some cases, not in all, but in some. Also, we covered things like not triggering for amount of data, not triggering only in the time dimension. For example, in some cases you can say when I have 20 samples of something, I think I can have an approximate result. So these kind of things are covered here. And we also cover the case of data that is out of order. So what happens if one bit arrives before an option? So we store this until we receive the option and then we can compute because otherwise we couldn't. So this also we evaluate with some of the queries. So our conclusion is that we cover most of the BIM API, that we have a pseudo-realistic example, and that we run this in every system. Well, not in all that we have, but we have covered some of them as we will see. We use BIM on, next model BIM, mostly as an integration test and what we're trying to achieve here is not a comparison of which system is better than other. It's most like that between releases, we don't have serious regression in every runner. And so we're going to see a little bit of how do we use Nexmark and then the warning slide. Comparing system is a really tricky thing, I mean, because first the implementation of the runners on BIM have different levels of support. I mean, some are way more advanced than the others. Also, these native systems like Sparkling, DataFlow, they have different characteristics. They were conceived with a different design philosophy. So they are naturally better in some kind of scenarios. So it's not so easy to compare one to the other. Also, and this is one I especially like, all of those systems have a lot of knobs. And if you have done some Spark and Hadoop programming, you know how hell this can be. I mean, you can just change one variable in Hadoop and everything is different. And finally, while we have been benchmarking distributed systems, so we can come with a bad luck that one machine breaks and the resource is not going to be easily comparable with the other. Or if you are in the cloud, you have a noisy neighbor who is changing the way you are executing. So it's not so easy. In particular, there's a really interesting blog post from like three weeks ago from the Flink guys and from that artist, I really recommend it, where they discuss a little bit also this. By the way, back to Nexmar, what is cool is that if you are going to execute the benchmark suite, you just have to change the runner. Here you're executing with Spark, here with Flink. And that's it. To change the mode, we are generating the data, which means that it's battery streaming. You just change the switch. And if you want to apply this into your local machine or into a cluster, you just change this. And this is vehicle code. I mean, I'm not bullshitting here. And also if you want to use the tools you use in your system, for example in Flink, you just run Flink, run and you send the jar and the job executes. I just want to remember that this is a synthetic benchmark, so we are generating the data. This is not stored in some data store, at least for the moment. We have, you can configure the benchmark with many different parameters, like amounts of data generated, but also ratios, like how many bits you want for action, or stuff like that. And also we can generate artificial CPU and disk load. So those of you who also want to test that in your clusters. This is an example of the output of an execution, in a local execution, where I get the time that every query took, the results that it take, and the throughput. Nothing special here. But more interesting is to see a comparison, for example, this is in blue is the Flink runner. This bigger is better. And small is the direct runner. As you can see, a really huge disproportion here. But it makes sense because as I said before, the direct runner serializes everything just to enforce a realization. So we have a way to configure the direct runner and this connects with all these knobs things. If I disable all the serialization parts on the direct runner, what I have is this. And this is more similar. I mean, of course, there are some differences. But this is one case, for example. This is another example where we compare two versions of the sparking gene. We took 160 versus 163, just to see if we have refrigerations. Same code in the impact. As you can see that for some queries, we had some differences. So again, they are quite similar. We can say something really, really different. So what is the current status of the project and what is the future work? We have this support matrix in every runner and every query. Those are the number of digits that are still pending. So we can have full compatibility. There are new runners. We are already on beam, but we have not executed them yet. We have added to the matrix. We follow most of the next math things in 1G that is 160G of beam. So if you feel interested, you can take a look here. And what's next, the ongoing work is that we want to fix all the pending things at all these runners that are not yet there. We want really to automatize this into our CI infrastructure. So if you are an expert in Jenkins Kubernetes, we are super welcome, because I don't want to do that. It's interesting also to do this. And also, we have a guy from Google who is working in support for the streaming SQL into next math. So all these queries that are done today with Java, the ones that we can easily translate into SQL, they will be in SQL now. And one thing that I really want to do is to extract the generator. This is a really thing that I want to do quickly. Extract the generator so it can be used for other systems. For example, if someone wants to implement that it can just take this job and do it. And finally, there is also somebody working in a Python implementation of this, because this is part of the portability for that we have from BIM. Well, of course, you are welcome to contribute. As I said, there are many areas when this can be improved, or in general, in BIM. I want to say greetings to all the people involved in this. Usually during this presentation, you see somebody like talking and say, I don't know, I don't know. I have to be honest, the open source is like this. We have many, many people who make a little bit and make these things happen. So finally, I'll add some references for you who want to go deeper into the subject. And well, that's it. Thank you. Questions, maybe? Thank you for the talk. I was just wondering in the comparison, Flink versus Direct Runner, is this single node performance? Yes, this is single node performance because Direct Runner is only local. So, yes. Okay, thanks. As I said before, we don't want to get, I mean, BIM is like this with one of the big data projects. I cannot get into enemies with Spark guys or Flink guys. I don't want to play that game anymore. Other questions? Thanks for the talk. So in the compatibility matrix, do you think most of the pending problems were with Spark? No, not really. The thing is that, I mean, we have compatibility matrix in the BIM website where we cover all the different functionalities of the BIM model. But for the case of Nexmark, for example, there are some tricky things, just because you mentioned Spark, we don't have currently site inputs for Spark in streaming mode because it's quite tricky to implement so we don't have it yet. Not only for Spark. I mean, I don't know where was the matrix. Yeah, here. I mean, for example, you see that the one who has more is Spark but it's not exclusively because, for example, the model of Spark is different. It's not pure streaming. It's not something like that. It's the level of implementation that makes the difference. And you can remind that it's not, Google Dataflow is not there because we were really focusing to the open source stuff. I hope the Google guys do this one day so you do expect that both for the benchmark and the rest of the Dataflow model, there will be a convergence and most of these problems will be ironed out. Yes, of course. We're behind this. Actually, one interesting output of this work is that we are right to solve like 10 GDAS in the different runners. I was annoying the runner-outdoors about these things. For example, Alyosha from the Fling Guys, I was only telling Alyosha I want this but I promise the Spark guys some of them quit the project and now I'm also part of the guys supporting this so I will have to fix this. Any more questions? Okay, thanks again. Thank you everyone for spending your day here or part of your day here.