 Hello everyone, thank you very much for joining this presentation, hope you'll find it useful. My name is Ruslan, I work as data architect at Bolt, and today I will tell you our experience migrating a unicorn-sized company from batch to real-time stream processing. I guess most of you are fairly familiar with Bolt, we are a leading micro-mobility and transportation platform in Europe and Africa. This slide shows the volume of money raised by different players in the market. It can be a bit outdated, but it conveys very important meaning. It basically shows how important for the success is to be efficient in today's business. These are the topics we will cover in our presentation today. We will start with understanding which requirements users have towards data nowadays. Then we will talk about why would company want to adopt stream processing and migrate away from batch towards stream. Then we will discuss where to get the events to process with stream processing. Then we will discuss the actual internals of the stream processing, and we will wrap it up with discussing which problems we have experienced along our way and which lessons we have learned. So let's start with understanding what kind of requirements users have towards the data. Data can originate in many different sources. Those can be backend microservices, frontend microservices, databases, be it SQL database or noSQL database, some files, it can even be user input after all. But first and foremost requirements which users have towards data is that data has to be consistent across different systems and services. Users will not tolerate if they see one number in one place and another number somewhere else. Second requirement is that users don't actually care where this data has originated from, be it structured source or database or unstructured. They want to have easy way to query the data and to get, they want to have easy way to get to all of its fields and all of its values. All of us live in the era of big data, meaning that volumes of the data we have to process and store and handle only grows over the time. But users don't actually care about that. They want to have access to both current data or the fresh data and all the historical data. And last but not the least is actually the speed of data delivery. Users usually don't want to wait for the data to propagate through the systems or to get delivered to the final system. They want to have access to it the fastest, the better. Now that we know which requirements people have towards the data, let's discuss why would any company want to actually adopt the stream processing. And remember we have talked briefly about the importance of being efficient in today's business and that's exactly the point of this slide. Any company would prefer to adopt stream processing because it actually puts data into the motion. It allows data to flow throughout the systems. And it means that company can develop so-called reactive microservices. Those are the services which actually react or respond to some kind of change in some state or to some kind of event. Imagine you are buying something on Amazon. When you click this buy button, many actions have to happen like invoice has to be generated and sent to your email, the inventory on the warehouse has to be reserved, the shipment or delivery process has to get initiated and all of those events are triggered by one simple click. And adopting the stream processing actually allows to do it and it also allows you to detect all those changes in real time, allowing you to make real time decisions on top of your data. It also allows you, I guess most of you are fairly familiar with the concept of ETL when you basically process big volumes of data in a batch mode and stream processing allows you to spread the load of those ETLs throughout some time period like day or week for example. Basically it means that you don't have to run the heavy ETL job once a day or once a week to process all the data for the past period. And you can actually incrementally process every single new piece of data throughout the day or throughout the week in real time and adopting stream processing actually allows you to build the services and architectures which are much more scalable. So now that we know why actually would you want to adopt the stream processing or events in motion, let's talk about where actually can you get those events from. We all live in the era of microservices and they have plenty of different benefits like for example, easiness of deploy or maybe being decoupled one from another. But they also come at some cost and usually that's how it looks in your production environment and wanted or not at some point you end up in a zoo where you have different plenty of microservices talking to each other and good luck deriving some meaning from this chaos. But one of the important features of the microservices is that they don't actually live isolated from each other and they need to communicate or talk to each other to basically signal to other services that something has actually happened. Believe it or not, but it may sound easy, but it's actually one of the hardest problems in modern software engineering because you can never actually be sure that some communication between the services has happened because services can fail or maybe event or the communication there is the request can get lost and you will never know about that. Unfortunately, right now there is no engineer-friendly or easy way to guarantee that some communication between the services has happened and those of you who have worked with similar setups they know that usually people solve it with some kind of distributed transactions or state machines and they also know how painful it is. And when we had both thought about alright, how would we tackle this problem? How can we migrate towards stream processing? Where do we get events from? We started thinking from the other direction. We use MySQL as our relational database for all our microservices and those of you who know who have worked with databases they might know that most of the modern databases they have the so-called commit log under the hood. Similarly it is the log of all the changes which database has applied. It is used as an internal tool for the database to guarantee data consistency. And not many people know that actually this log it can be read by other applications and that's exactly the way which we decided to go with Bolt. But now comes the question alright, if we read this log the data which we extract from it well it has to be stored somewhere right so that later on it can be processed by someone else or maybe even better it should be processed it should be able to be processed by many different consumers. One of the crucial requirements which we have agreed on when we started developing this whole pipeline was that well it has to be scalable because company grows year over year we launch new cities new countries we provide rights to more and more customers and all of us agreed that it has to be scalable but anytime when someone says to me anything about scalability there is one piece of software one system which comes to my mind straight away and that's Apache Kafka. For those of you who might not be familiar with Apache Kafka it is the messaging system which was developed at LinkedIn and later on it was open sourced it provides plenty of beautiful guarantees to the engineers and its users but most importantly it has really high event throughput and it is horizontally scalable. One of my favorite features which Apache Kafka provides is actually write once read many semantics. It means that you can send or write one event to it which can be processed by number of independent consumers at their own pace at different points in time in the future. There is no need to duplicate this same event so that it can be processed by many consumers you can write it once and then process it as many times as you want. So now that we understand from where can we get those events from and that we persist them to Apache Kafka now comes the part of stream processing. Let's deep dive into it. Right now in the market there are plenty of different frameworks and libraries which are actually doing stream processing so it's up to you to decide whichever one of those you would like to use but before maybe discussing them let's define what actually a stream is and stream has two important characteristics. First of all it is unbounded flow of data. It basically means that you are getting the events right now you will get them in the future and there will be no end to that. You cannot say that at some point they will stop it is unbounded you will simply keep getting them continuously. And second important characteristic is that it is happening in real time. You don't have to do it once an hour once a day once a week it is happening at any given point in time. There are two types of stream whenever we are talking about the stream processing as a framework there are two types of stream processors which you have to very clearly differentiate between. First one are so-called stateless processors and they process every single event independently of all the previous events so you don't have any history whatsoever. Examples of such stream processors can be for example some transformations or maybe extracting some field from event or maybe repacking the event into different form. Something which you can do for every event separately. And second one is so-called stateful stream processors and they whenever they process any new event they also keep all the history of previous events which they have processed. Examples here can be for example calculating some aggregations or maybe doing some filtering based on the previous history and so on. Like I said there are many different stream processing frameworks on the market but there are two which I would like to specifically highlight in this presentation because they are native to the Kafka ecosystem and first one is called K SQL. It allows you to do the stream processing in a syntax which is very very SQL like which is very familiar to the engineers. With such an easy SQL like statement you can actually define which events from which topic you want to process, which fields you want to extract from them and to which destination topic you want to persist them. Second one K SQL is a good tool for let's call them lightweight stream processing but sometimes you might want to do some very sophisticated logic. It may be for example something related to your business logic when you need to I don't know do some kind of very specific aggregations or transformations and so on. And in this case Kafka ecosystem provides you with the Kafka Streams library. It allows you to define the stream processing tasks as Java applications. So whatever you can express in your Java code you can do with your stream processing. Very good news is that both of the frameworks are can be actually extended. If you need to implement something which is very very much tailored towards your business you can write the so-called user defined functions basically you define the Java function and it can be whatever like aggregation or it can do some business logic. And then you can call this function both from the K SQL like any SQL function here or from your Kafka Streams application and it can be everything. It can like I already said be something very much related towards your business logic or it can also be some kind of fancy machine learning stuff and it can be used for fraud prevention for anomaly detection anything. So that's the setup which we have adopted at Bolt. We ingest data from all the source databases. We persist them into the Kafka brokers. We do the stream processing with the libraries I have mentioned and also persist the results to Kafka and then those results are consumed by a number of different consumers. We store some events to our data lake. We also allow backend microservices to consume those events and do the business logic and decision making on top of them. Which problems have you encountered during our throughout our way? When I started this whole project at that point there was no single managed Kafka offering on the market. As of now there are already few of them available but there was no at that time. And if in case you are planning to adopt Kafka I actually would advise you to go with managed cloud offering because managing setting up and managing Kafka clusters is not the easiest task believe me. That was it's not actually a problem but that was like one relatively big challenge and obstacle which we had to overcome. Second problem is actually whenever you are working with stream processing you can be getting hundreds of thousands of events per second. And if someone comes to you and says like hey you know this number is not actually correct good luck debugging it because it's really really hard to find where actually this error is happening. Next issue is that unfortunately but currently most of the stream processing frameworks they are JVM oriented mostly. So want it or not but you would need to use either Java, Scala or some other JVM friendly language. Next issue which you have to understand is that let's say one of the big advantages of Kafka is that it actually also keeps the history of the events. It is configurable you can set it up to one day, one week, one month, one year whatever and it actually allows you to replace some data from the past. Let's say you have released some new logic and later on a few days after that you have realized that there was a mistake there but and now you can replay this history, replay those events and correct your make some adjustments towards your business logic. But you should never think that you have to replay only some specific piece of data because like I said stream is unbounded flow of data so you should always think of it in the following way you should that you start processing data from some point and you keep doing that afterwards. And other problems which you have seen along the way are the let's say for example data deduplication. Network is not reliable, different services can fail and eventually it would lead to duplicating some of the events and so it is important so your consumers are prepared for that. Next one is types incompatibility. Whenever you integrate many different systems and make them talk to each other you have to be very very careful so that they process all the data and types the same way. And last but not the least like I said if you want to start sourcing the events from your databases you should also think about how do you handle database schema migrations like for example adding some fields to the table or maybe changing the type or creating new tables. That's all I wanted to talk about today. Thank you very much for your attention. If you have any questions feel free to reach me out. Thank you.