 Okay. Hola. Es un placer restaurante, voy a hablar en español. No, it was only a joke, but it's really nice to be in Madrid, and I really like Spain and Spanish culture, so it's nice to be here. So hi. Today I'm going to talk about how we can get data from Kafka to Parquet. So first, to give you a little bit of context, I work at bold.com, and bold.com is the biggest e-commerce platform in the Netherlands and Belgium. So we're a huge web shop that sells everything from toys, books, electronics, and now lately we even started selling alcoholic beverages, which is kind of awesome. So and then a little bit about me. I'm Gabor. I work as a data engineer at bold.com. And previously, I've worked in a research institute in Hungary, and I was mostly involved with contributing to open source projects like Apache Flink streaming, and I also implemented some machine learning algorithms on top of these distributed systems. So now I've been working at a measuring team and recommendations teams at bold.com. So today I'm going to talk about a really simple problem, or at least a seemingly simple problem, to have a data stream and turn this stream into files, just like files based on time. And although this seems easy, we struggled with it a lot, and we tried several solutions, and I hope we learned a lot from this struggle, and I hope you're also going to learn a little bit. So I'm just going to go through our solutions and then conclude. So what is this problem exactly? So first, we have click data, so basically events about users viewing some products, clicking on recommendations, all sorts of things. And we store this data in Kafka, and I think you probably know Kafka. How many of you know Kafka? Okay, yeah, yeah, sure, I'm at the right place. So it's a distributed message queue, simply. And we store that data there, because we can get that data in real time. And of course, we need a distributed message queue, because we have a huge amount of data. Currently, we have around 10,000 events per second in this specific topic, and it's going to increase, because our company is going to grow, and we're going to measure more and more. So then the challenge is putting that data into files on HDFS, on how to distribute it by system specifically. And why do we need this? We need it because, like, yeah, everybody is going streaming nowadays, but we still have analytics teams that are using tools that cannot read Kafka, or if you just want to gain insights from really historical data one year ago, plain old files can work easier for you. So basically our task is, how are we going to do that? How are we going to get from Kafka to HDFS files? So before I go into that, I want to touch one topic, which is the difference between batch and stream processing. So I guess probably you mostly have an idea about it. Batch processing is when you have, like, a finite amount of data, and you have a program that gets this data output to result, and that's it. The program is finished. Maybe you run it daily, but it's just plain simple finished. In stream processing, you have a continuous data stream, an infinite data stream. You process the data 24-7. Your job is running 24-7, and you're continuously creating output. But to give you an analogy, I like this analogy. Let's say you have a problem. You want to fill a pool with water. So batch processing is a bit like having buckets and carrying buckets of water around and filling it into the pool. So it seems like a terrible job. And stream processing is like when you have a water hose, and you just hold a water hose, and you can fill the pool. So actually in this analogy, the water is your data. And you can see that batch processing, batches are the buckets, basically. But what does it get to in our problem? So in our case, we have a stream of data in Kafka. We have that water hose. We have that stream. And we want to fill up buckets, basically. We want to create some batch files so that people can use it in batch processing as well. So in expectation, it should be easy. In reality, we can go crazy with that water hose like that dog. And we actually went a little bit crazy with this. So but why? Why is it difficult? It's just dumping data, right? So it's difficult because we have some requirements. First, we want a scalable solution. And it's not surprising. We are at a big data conference. Like I told you, we have a thousand events per second. It's growing. We want something scalable. Second, we also want to dump the data exactly once. And what do I mean by that? So we have one message in Kafka, and we should have exactly those messages in the files. So I don't want to lose any messages in the files. And I don't want any duplicates. Why do I care about duplicates? Because let's say there's a data analyst in our company, Katie, for example. And she will need to read this data several times. She doesn't want to start every single job by saying distinct. Katie doesn't like distinct. Katie wants the duplicated data. So we are thinking about our data consumers and not ourselves. So exactly once is one requirement. Another requirement is event time. Having these files in event time. And what do I mean by that? So let's say we are reading event streams. One event comes in at 9 a.m., a user clicked on something. But actually the user clicked on that item at 8.55. So the time when the event happened and when we got to process the event is a little bit different. So the event is a little bit late. So where should we put that event then? Should we put it into the batch file of, let's say we have batches of one hour, batch files in one hour. Should we put it for 9 a.m. or 8 a.m.? We want to put it at 8 a.m., why? Because also from a data consumer perspective, Katie, our data analyst, she doesn't really want to care about like, yeah, if I need only the events from between 8 and 9, which files should I look at? She doesn't want to look at the neighboring times and files. You just want like, I want the events that happened between 8 and 9. So we need files in event time. And last but not least, we want the columnar format because it's more optimized for reading. Again, Katie doesn't like loading files that take a long time. So that's what we want. To give you a little bit idea why this columnar format is better, let's say you have a data set like customers viewing products. So you have a customer ID, the product ID viewed, and you have a product category. And there are basically two ways to store this data on disk. One is the row-oriented format. So in this case, you store this data record by record. So first, you're going to store all the fields for the first record. So the customer ID, the visitor ID, the visited product, and the category in one record, and then for the second record, and then for the third record, and so on. Whereas in a column-oriented format, yeah, it's pretty trivial. You store it by column. So first, you store all the customer IDs, then you store all the visited products, then you store all the categories. Yeah, that seems easy. But why is this row-oriented format is not so good for analytics for reading the data? Let's say we have another data analyst, Peter, over there, and he doesn't like slow loading. And Peter is not interested in the categories. Peter is only interested in two columns, just the customer ID and the product visited. So in this case, if we store it in row-oriented format, the system that Peter uses needs to read all the data anyway, because he's reading it row by row, every row, so he needs to read those books. And if you're talking about big data, this can be an issue. This can make it a lot slower. So in contrast with the column-oriented format, because we store it in columns and we're not interested in, or at least Peter is not interested in the category, we can just decide not to read that column. And it makes it basically faster. So both our data analysts, Katie and Peter, they do love, they do love fast loading. So we want a column-oriented format. And the go-to column-oriented format in this Apache Hadoop landscape and Fire Landscape is parquet, so we want it parquet. So let's get back to our requirements. So I told you about the requirements, but what tool are we going to use to solve this problem? So the first thing that popped into our mind was Apache Flink. And that's also the distributed stream processing system. How many of you have heard about Apache Flink? OK, cool. OK, so this is not going to be really new. But Apache Flink seems suitable for these requirements. Why? Because it's distributed, it's scalable. It gives you exactly one guarantee. Even if failure happens, it has obstructions for handling this event time. And it has parquet output connectors, input connectors. So that seems cool and easy. But so yeah, actually in our problem, the question mark becomes, yeah, we are going to use Flink for that. But we'll see how it goes. So first, let's go to our first solution. We first wanted to use Flink windowing. And Flink windowing is basically windows are just chunks of events based on time. So you can have a window from 8 to 9 or 9 to 10 and so on. And what happens is that Flink will kind of store the data for one window. And then when the window finishes, it will do something, an aggregation. So in this example, I'm collecting events for two hours. So let's say again, let's say we want to write hourly batch files. So batch files for every single hour. First, you can see that in the Kafka queue, I noted with brown and blue squares. So the blue squares are for events between 19 and 20 o'clock. And the browns are events for 18 and 90. So you can see that the events are out of order. And that's also what happens in reality. You don't see your events come in order because you all have distributed systems. I also did a simplification here because I'm just showing one ordered queue. In reality, you have a distributed queue in Kafka. You have multiple partitions per topic. But you can easily generalize my examples to multiple partitions. So yeah, so we start reading the data. And we collect this data per Windows in Flink. So we have a window for events between 18 and 19 hour and so on. And we just read this data. And as soon as we know that one window has finished, we've seen all the events before 19 o'clock, we're going to write that file. And Flink just gives this data in memory. So that looks pretty straightforward and easy. But how do we handle failures? Oh, we're lucky in this case. Yeah, we can celebrate. It's out of the box. We don't really even need to care about fault tolerance. We don't need to understand it because Flink will handle it for us if we are using these windowing abstractions. But there's always a but. But we might use too much memory. Like I said, this is also a bit of simplification. But Flink keeps this state, keeps these windows in memory. And in our case, if we just assume that somebody decides we don't want batches for one hour, but we want it for two hours and we're using the same amount of memory, we need to keep two hours of data in memory. So maybe we're like collecting, collecting it in memory and boom, at one point it's gonna run out of memory. So we could just increase the memory in the machines, add more processing units and so on. But it feels a little bit bad because we want to do a simple thing. We just want to read some data and dump it into files. Why should I keep these files in memory? And so we actually experienced a lot of, like when the loads changed, we experienced these out of memory exceptions and we decided like it's better to come up with a solution that doesn't require that much memory. So we went on to the next solution which is called the bucketing sync. And in think, think the, basically the output connectors are called sync. We are syncing the data so you can have a sync for Cassandra, a sync for, for I don't know Kafka as well or any other database. And the bucketing sync is basically something that writes into files called buckets and buckets are also files based on time. So this seems exactly what we need. We want buckets of time for every single hour. And yeah, that should work well. So in this image as you see, now we are still reading the data from Kafka and we are dumping it into files, but we don't need to store it in memory. As soon as we read the record for some buckets, for like an event between 19 and 20 o'clock, we will just dump it into the file. So basically nothing is stored or many more things are stored in memory. So that sounds good. But what about handling failures? And here comes a little bit more tricky part and we need to understand Flink fault tolerance mechanism a little bit. So Flink handles failures by making checkpoints and how does it make checkpoints? It actually marks one point in the data source when a checkpoint happens. And I noted that with that green line as you can see. And when the data processing reaches that checkpoint, Flink will mark the part of the files, the point in the position in the files that has been written until the checkpoint. So I also marked it with small green lines at the files. So the checkpoint goes and we keep on processing. We're writing more data to the files and boom, suddenly a failure happens. Yeah, and failures can happen. Like for example, in our case, we had a commodity Hadoop cluster and people were like sometimes maintaining node and if somebody picks the node your job was running on, then your job's gonna fail. So it's that easy. So you need to account for failures. So in a failure scenario, what Flink does, it draws back the reading to the checkpoint and because Flink marked the position of the file when the checkpoint happened, it can just simply cut off the end of the file. It can just truncate the file and keep on writing the data. So this is the big idea that ensures exactly one's guarantees. If we kept the data at the end of the files, we would have duplicates. But by knowing exactly the position of the files at the checkpoint, we can just truncate them. So that sounds good and it should work well, but there's always a but. In our case, HDFS doesn't support truncating. It doesn't support cutting the end of the files. So we just cannot do that and that's kind of sad but when we were doing it, it was a problem. The good thing is that the Flink community has already fixed it and it can actually support systems that don't allow truncating, that doesn't support truncating by just copying the relevant parts of the files. So we know until which position it's still correct data and it can copy that data. But there's another but. Parking doesn't allow flushing. So that means at the checkpoint, we cannot know the positions what we were writing. We cannot know that simply. So that leads to a problem. We just cannot know what was the position when our checkpoint ended. So at a failure, we don't know what parts to cut in our files. So that led us to another idea. That means closing files at checkpoints and what do I mean by that? So we don't just write one file per every hour. We keep files in a directory per every hour and we're gonna write multiple part files. And first, we're gonna write one part file but when we get to the checkpoint, we just close the file that we've been writing and start a new file. And as the writing goes on, we're only going to write to the new file and when a failure happens, we can simply remove that old file and roll back to the checkpoint. So actually that helps us because we don't need truncating, we don't need to know the position. We just close the files at every checkpoint. So that kind of solves our problem but there's always a but. We need, in order to do this, we need it to change plain bucketing sync code which is like kind of feels tricky. And I think that's one really good point in open source projects because when you're doing something and something doesn't work but you could kind of use the code and change the code, it's pretty useful and in Apache projects the code quality is normally really good. So what we did, we just looked at the bucketing sync code and just changed it to our needs to close these files. And then even better, Flink is already supporting this closing files at checkpoints with something called streaming file sync. So the community actually solved it for us as well, but we've been experimenting with this before the release of Flink 1.6 which got this feature. So it's a really huge could us to Flink community because yeah, that's awesome, we won't need to worry about this anymore. But we had another problem. We, like I said, we've been writing data but we also cared about late events. So we actually had like, we said like we want to keep late events if they are within 12 hours late. So what happens in that case is that we are going to write the data and we're writing those files and at first there will be like large files and the small amount of files. But as we're moving to late events we will have like, let's say we're writing a file for nine o'clock by seven o'clock, nine o'clock in the morning by seven o'clock in the evening it's still within 12 hours but we're gonna have a lot of, we're not gonna have a lot of events for nine a.m. So we're not, maybe some files will only contain one record, one single record. And we will end up with a lot of small files and that's bad for HDFS because HDFS is not really good in handling small files but not just to this specific HDFS technology. Handling a lot of small files is not really good for processing system. For instance, in this case some people wanted to process in the company or process this data with Apache PIG and PIG first before doing any processing reads the data and reads the metadata about the data sets to estimate the size of the data and so on to create optimizations and it just couldn't handle too many files because so we got something like too many connections or something like that. And that's not really good because we cannot really support Katie and Peter our data analysts easily. So and by this time, like we've tried a lot of things. So we were like, why? Why can't this simple thing be done easily? Like we, why is this, why is this streaming for tolerance and all these systems? Why are those so complicated? Why do we need to understand them? So actually one teammate of mine Karst came up with an idea like, why don't we just run a daily job that or an hourly job that treats all the data in Kafka and dumps out the relevant things in an hour? That's pretty simple, right? So we did that and that's basically, we just, let's say for the events between 18 and 19 hour, we just read the whole Kafka queue and only dump the events relevant to that hour and then later we will run another job and we only dump the events for, for the events between 19 and 20 o'clock. So that's simple, of course, but yeah. So I kind of feel like sometimes sticking to batch processing is like carrying those buckets can still be, still be, still be cool and we can still be happy about it because we avoid all the complexities. Of course, this is not perfect solution because first we need to reprocess the data. Every time we read the data we need to process maybe the whole Kafka queue or we need to process an additional data anyway. So that's not really, really perfect and it also doesn't support the so-called semi-real-time case and I call it semi-real-time because there are some analytics people who want to use data like not everyday batches but they want to get what happened in the last half hour we were getting into the Christmas season with Black Friday coming next week so gaining insights. What happened in the last half hour or last 10 minutes is really important and maybe they cannot use Kafka for these cases. They still want to use files and if we're gonna run a daily batch job and run a batch job every day then kind of our latency will be one day and that's not acceptable for some use cases. So it's not for working for that. But of course like for the reprocessing problem we can use some optimizations for this batch processing. For instance Kafka now supports indexing based on event time which means you can kind of, Kafka keeps track of your, of the position of every hour in the data or you can set up which time range. So you don't need to read the whole Kafka queue to get the time frame that you need. You can rely on this indexing and you only read the relevant parts. So that's it basically, that's what we did but of course I'm not satisfied with it totally so what would be a proper solution for the future? And we haven't done this but actually one idea is like we were aiming for plain simple files and those are practically working well for a lot of scenarios but what if we instead of storing the batch data in files we just use the database and actually the team is working on something like that because we're using Google Cloud for some stuff and they are working on getting this data into BigQuery which is Google's analytics solution instead of just plain files. You could, yeah so that's one other solution. You could use different tools like and of course that database like BigQuery will take care of partitioning your database on time and you don't need to worry about it. And you can use another tool, not Apache Flink but maybe Kafka Streams or I'm sure that there are hundreds of solutions out there that might solve this problem better. We went with Flink because we've already used Flink and it seems like a good idea. Another idea is because our last problem was like a lot of small files that Apache Pig and other systems HDFS cannot really handle. So you could just write the small files and merge them in the end. That's another solution. Or you can just, like I said, we had small files because we were keeping up late events for 12 hours. You could just say that I don't care about those small number of late events, let's just drop them and let's maybe only keep late events that are five minutes late but not 12 hours late. So these are all options. But to get back to a problem, like I said, one interesting problem is how do we support semi-real time? And there are different solutions for that and everybody's talking about Kafka architecture and Lambda architecture nowadays. So by supporting real, so Kafka architecture basically means you do everything in streaming. That sounds cool because you have one code base, you have one system that can handle both streaming and batch workloads and all these things. Whereas Lambda architecture means you have a batch system combined with a streaming system and these are two separate systems. You have maintained two code bases and you specialize it to these two use cases. So in our example, how could we support real time? Maybe we could support real time by doing the same batch processing by reading the data and writing the data in daily batches and then create another job that simply drops the late events. So that sounds good because first, we got to keep all these late events in the historical data. So we got to keep exactly the same amount of events but we can still support our real time use cases and we can still satisfy our data analysts who wants events in the last few minutes, the last half an hour. So actually, this was just one example of this kind of Kafka architecture, Lambda architecture thing. And I think a lot of people are excited about Kafka architecture and me included. But in some cases, Lambda architecture, like separating these two domains can keep it simple because these two things have specialized solutions. So the main takeaway I want to give you is that I think streaming is not trivial. So we can think and other people can say that Kafka architecture is the way to go but I think in a lot of cases, there's a lot of complexity. In this case, the fault around and we needed to understand the fault around system which is in a batch system, it would just be like, we just restart a job if it's a bad job. In streaming, it's completely different and it's more complex. So I would say streaming is not trivial but I should be fair and out streaming is not trivial yet. And what is awesome that I've seen talks yesterday from data artisans, Ayusha was talking about these and Confluent is also here at the conference. I've heard a talk about Apache Puzar and these guys are all working on something called the Kafka architecture which is they are working on making it easy for us to just use Kafka architecture, use one code base and use it for streaming and batch as well. And I think in the coming years, it's gonna be more usable but I would say for some use cases, it's not there yet and it's just difficult and it's a struggle. So I would say just keep it simple and if you have a use case and you find like, how streaming would be awesome but doing a batch processing is still feasible then go for batch processing. Or if, but if you're adventurous and you see that like, yeah, let's try streaming, go for streaming but then understand the system because understanding fling fault arounds and understanding how these things works were really important for us to deal with the problems and understand what can go wrong. So in that case, try to understand your system better. So that's it basically. I also wrote a blog post about this topic like not everything is unclear did what I was talking here but you can check that out and you can reach me on Twitter and I'm happy to answer questions. So I saw some hands raised over there. Really nice talk. I wanted to ask you a couple of questions. I recently, you need a project where we were doing this kind of movement, moving data to to park it and the point that at some point and it is not the first time I had this discussion we tried to decide whether going to park it whether going to hive we'd optimize a row column in our format. I don't know if you guys have done any performance analysis regarding that. What are your concerns? And regarding the small files problem you've been facing I just wanted to know if you have considered to apply somehow of dynamic partition in grouping the packets of time with less utilization to improve the performance. That's all. Yeah, so I'm not sure what you mean by the dynamic partitioning but the answer is your first part of the question. Yeah, so park a format is just one choice and it's like every choice is the trade-off and I talked about our requirements but you can kind of loosen those requirements up. So for instance you could just use Apache Avro or any other format you want if you can still get out the performance for your analytics teams. So maybe using a simpler format and not like a row-oriented format could help you because then you wouldn't have these problems with the flashing and all these things. So that's something. But yeah, in generally I don't know it really depends on the use case what you can do. Oh, here's one. Yes, so that's one other thing. We do transformations, especially in the age of GDPR we need to analyze some parts of our data sets and that's one of the things why we need a like also distributed processing system and not just a simple solution like dumping the data really, really directly because then we can also do arbitrary transformations on these records. Yeah, okay. I don't see any more hands but I will be staying here for a while so you can catch me for questions if you want afterwards. Thanks. Thanks everybody. Thank you.