 Yeah, thank you. Good morning, and welcome to my talk streaming. Why should I care? My name is kristin Drebing I'm working at a blue yonder a German company doing Retail solutions predicting what they need what they should order. Yeah, so that's the background of Me and now let's start with a talk What will expect you in the in this task in this task at first I will give you a motivation on why it makes sense to look at streaming at all and yeah That's also the title of my talk and I will give you some background why I came to the question at all I want to give some introduction into What is streaming? What is the basics of that and then show you how you can use this within Python? Finally showing some challenges that might await you when you go down this road or maybe swim down that stream So that you are prepared to tackle them Okay, let's have a look at the situation We faced as a team for quite some time. We have a data processing application as I already said We are doing machine learning. We get lots of data from our customers and We had a quite monolithic application So what we have is we get input data from the customer that send it in large XML files We need to validate that data. We need to make sure that this is fine that this has to write quality to go into the machine learning We do the machine learning that takes lots of input data right start writes out lots of output data And then we have this output interface where the customer can query the data The results of our machine learning So this works quite fine. I mean The results are quite are really good But we have some we have some issues with operations and worth extending with that So yeah, we have this big database in between where we store all that data where all these boxes are accessing that We're using SQL alchemy to access it from Python We're using a limb big for database upgrades and all this works fine But if you have also an application of that type you might know some of the pains that come with that So for example We are several teams developing this application and now there are dependencies We all depend on this one database and the customer desperately wants a new feature in machine learning Also, the machine learning team says, oh, that's great. It's no big deal We just have to change our query a little bit We have to write to one more field to the output tables and then that's fine Then we are that's no big deal But the other team is also working on that and it's also working on the database and they are chain doing some refactoring on them They are changing that data validation All is very good that they do that but those conflicts and they say, well, yeah It just takes two weeks time just then we're finished and we need to see that we have no conflicts in that So really, yeah, now you as a machine learning team say well now we have to wait That's bad, but that's nothing we can do about This is one thing one challenge that you might face other things are for testing So it's always hard if you have one big application to establish clear boundaries to have Solid testing strategy there you try really hard, but well, this is at least not an architecture that supports you in that Yeah, I also I like the testing in layers talk from Monday and where you see all these nice little boxes And how you can how you can separate that into layers and for sure you can do that with such a monolithic application But it's just hard to do so Then I went to Europe Python already two years ago in 2015 I heard many talks about microservices and I really like that it was a great idea it solves these issues that That you split up the monolithic application into several boxes and you can handle that boxes separately Each has its own data each has its own upgrades the teams can develop independently. Wow, that's great And we had lots of discussions afterwards After these talks on how we can use that for us But Well as I said we have a data processing application and for data processing we have to process lots of data and That's where we saw that we cannot use this model in this purest function as We would have to to transfer lots of data between the services We have to save lots of data in each of the service as each service needs many of that data so Unfortunately as nice as this look as it looked but that was nothing for us Yeah, there we stood Microservices sound great, but I mean we had lots of case studies. They were used in billing They were used for more transactional applications and this worked fine, but as I said we could not use that and Then at other conferences so many people were talking about streaming and also here at at this Europe Python already heard many talks and it's great. It's great to see but Most applications used that in financial services For stock markets where they have lots of data volume where they have to process that really fast or they are using it for online Advertising where they have to process click streams or locks and doing that really fast and Well, I thought these are the traditional domains of stream processing and we are not in there also So what possibilities do we have? We're just standing in the midst Well, we can't do microservices and streaming seems to be the domain of other people Sounds interesting, but yeah, maybe nothing for us Fortunately during a different project I came into contact and do did some evaluations for streaming there and Suddenly it came to my mind that that this might be fine for us Also, we don't have millisecond where click streams to process many things that well might be more side effects of streaming and not just This this pure millisecond processing are really really good And we can use that to solve the issues we have and that's the reason for me for giving this talk Maybe if you are in a similar situation, I hope that this will help you also and give you some ideas on how you can improve your application so let's do introduction to streaming and Give you some basic idea of what this is my background as I said is coming from a database centric application And therefore I want to compare this and give you some idea. Where are the differences? So let's look at databases and streams. I Heard the talk from Martin Kleppmann, which he gave at the strange loop conference in 2014 Turning the database inside out and I thought this was a very good way of thinking about databases and streams Because essentially in a database and in a stream you have the same information When you have a database then this database has internally a change log And this change log tells what did change within the data your database tables and from this change log This is essentially stream and you can reconstruct the database content at every point in time So when you when you look at it here in the first entry It says well change a row with a key a to the value one and then you see this Table has one row with key a and value one then as next Another information comes in change row B to the value of five and you see it in the table Then comes change row C to the value of three and you see that in the table and then it gets more interesting You have updates for existing tables So a gets the value of eight then a gets the value of four and C gets the value of two and each at each point in time You have a consistent table in your database So that's the basic idea of well the It's a similar energy between tables and streams and databases used it internally for replicating to different nodes for example But why does this matter for us? The most interesting thing from in my situation was that different services We have can be in different states So we don't have this dependency to one single state because each stream processor each service Can have a different offset within that stream? Let's take this example here. We have to service one which now is on index on offset three and he sees the table as a Has is eight B is five and C is three We have a different service that is already at the offset five and their a is four B is five and C is two And this is totally fine both Services are in a consistent state and if service one that catches up to the offset of five Then you will have the same information as service to So the interesting thing is you can have several services that can operate at different speeds And this is one thing we have for our application at one point in time We might get lots of new data from the customer We might write lots of data into that stream and we have services that process that faster We have services that process that slower, but as long as they are able to catch up. That's totally fine and Yeah, you can add new services to that structure which can process the stream from the beginning on and You can you have just more control you have more possibilities How to scale your services so you don't have That all services need to be at the same speed, but you can Design by your needs to have services that need to get faster Do I have services where it's fine that they are slower? For example, just Aggregating some reporting mechanisms whereas on the other hand you want to answer the customer fast Okay, so we have that possibility. That's great, but how can we use it? I mean I said some services might be faster some services might be slower How can we influence the speed of our services? Well, we can program better. That's fine. So we can we can just improve our code But the success just a limited effect. Sometimes we need to have even more and There comes one idea into play. That's Also very helpful in streaming. You can partition your streams as means and and that's always a decision based on your Your business domain on what partitioning makes sense for us. For example, we get sales streams We get the information about in what location did we have which sale and we can use this for partitioning our stream So let's say we have here our sales stream We have three locations. We have Rimini where we are now with the bow the last year of Python conference And we have cards rule where our company is based So we sell spaghetti we still ravioli we sell pizza and we have this on different times and in different quantities And now we can have one processor that works on all these three partitions That's fine And that's maybe the best place to start but as we see we need to get faster We have the possibility to split this up to Introduce one or two new processors and they can work in parallel on these different stream partitions So each partition could be handled by a different processor and the state stream example will follow us throughout the rest of this presentation How does it look like so we don't have the database centric We don't have the microservices, but we have here the streaming platform now in the midst And the idea is that we can be at different offsets in that So what did we gain? We have Clearer boundaries for our services. We have all these ideas that we can deploy them independently independently that we can operate them independently and The data is mostly in the streaming platform Well, you might ask yourself Why do we have this data bubbles still in the processors? I'll answer that later and What is about these database schema changes? That I talked at the beginning when the data validation team needs to update it and also the the machine learning team needs to update that Also this save this question. It will be answered later So what did we gain as I said in independent development upgrade and we have more options for scalability Did we throw out the databases completely? We have the streaming platform in between. Well, think of these data bubbles and we'll come to that later For me one important question was this all sounds so great This is such a good idea and it seems to solve so many problems Sounds like magic and magic. That's always something that makes me suspicious. Maybe there's something I don't see maybe there are new problems that arise which I just don't know of at the moment and Yeah, it is not magic. It is a trade-off. I mean a database is very powerful You have many guarantees that are given you by the database You have the the asset consistency guarantees guarantees in your transactions You have the sequel language, which is the so powerful way to retrieve the data that is there in the database You can do nearly everything, but as we have seen this comes at a price We are depending on one single state and also the scaling is hard also scaling of the database is hard So We have to think do we need all these things that the database gives us in our application and What are things that we lose? So with streaming we don't have the asset guarantees anymore We have an ordering on a stream partition. So what the streaming platform will give us It will guarantee us when we feed entries into the stream each service that will retrieve these entries will retrieve them In the same order This might be a small thing compared to the asset guarantees But it's fascinating what you can construct from that and when you have several things on several streams Just given these ordering constrained you are able to construct many of the guarantees that you need and We don't have the sequel queries anymore So it is not possible to query a stream And this is something that you really have to get in your head when you are thinking about that You might be used to sequel you might be used that you can query for any Row with any while you are doing a join on that But this is not possible the stream just goes through your processor And it's your possibility to keep that state and to remember what was the last value of a and You have to decide whether you can live with that or not and what What mechanisms you employ to help with that I Mean you see what you lose, but at least I feel better now I know it's not a magic that might come back to me in the worst moment at all But it's a it's a conscious decision that you can do and you can see the trade-off on whether that's good or bad for you Okay, so much to the theory now We are using Python at our company really really love to use Python in real most of our services and How can we do that in Python? Just taking a step back Apache Kafka as a streaming platform. This is not Python. This is implemented in Java But we'll come back to Python soon. This is This is what we on what? This is an example of the streaming platform So you can have here produce us that put data into the streaming platform You can have consumers that retrieve data from the streaming platform and you have the stream processors that what they take that data and they Wrangle that data and they might they put it back to different streams then And you can also have some connections to the databases to get that data from that directly So that's a really cool Streaming platform. It's used by many people. It's very scalable and it's really better proof. So that's thing that you can build on and There are also Kafka clients in Python We also I heard just yesterday also from others that way that are using them and You have pie Kafka Python Kafka and the confluent Kafka client Which are the three I have seen and also other people already have done very nice comparisons On that for example here the attribution games You can have a look there in detail if you want the interesting The interesting differences between them are pie Kafka and Python Kafka Kafka are both depending written completely in Python Hi Kafka has a very Pythonic interface whereas Python Kafka more simulates to see interface And we have the confluent Kafka client This is not pure Python, but this is using the C library lip RG Kafka, but it's the most performant of these So we decide to use the confluent Kafka client There's many configuration options and it's really worth looking at them because this might You can use them for performance tuning So at first I was a little bit surprised when I used that client and it seemed also slow But it was just to my testing set up where I had very little records and the standard settings are not for these little records But for more it has more buffering in there And if you reduce that you can come to very low latency also in your test setup, which then feels much better So let's see how can we use these clients? First thing is to have a producer. I just have to give that producer with the bootstrap server Which is my Kafka node, which is default on port 1992 I have some data you can see here the sales data and Rimini on On Monday we sold some ravioli and we wanted to input that to a stream called sales input and we are using JSON to serialize that data Then we want to consume that and also the consumer it needs to know the Kafka node to connect to Well, it has some further settings of which here the topic config might be most interesting For you because topic config tells you do I just use the new values appearing on that stream Or do I want to process the stream from the beginning on so default is that you just look at new values But here especially in testing is always very helpful to start at the beginning We subscribe our consumer to the same topic to say its input and The most important things are here. We are pulling the consumer. We're checking that this worked fine that we have something in there And then we just print the received message and it will show up as the JSON string we put in there That's great This already works as we needed to now we use JSON as a serialization format That's also good for a starting point, but as you are working with many teams It's always good to have some defined schema and your databases the schema was defined by the database and this is good So that everyone knows what is in there and also for your for your streaming applications You can use more rigid schemas than just put on JSON What we decided for is for Apache Avro and This is a schema where you define this is something where you find the schema at first You give some data types or give some fields. It has many many possibilities to define that schema and It's also a very good compacted So it's not just writing the JSON format, but it has a compaction in it Which is also very good if you want to save some space But what excited me most about Apache Avro? It does it's that it defines also schema evolution this means You can enhance your schema you can add new fields and It it defines the criteria on how you can enhance that schema So for example for new fields you always have to give a default value Because this ensures that also processors that are at an older state can use that data And if they want to retrieve data that was written by an older service Then the default while you are supplied and if they are reading the data that was written by a newer version of the service then it already has this field with With what the service has put in there. So this is really great and this Solves one issue which I promised to answer you ten minutes ago This is when different teams wanting to want in to enhance the schema So by that they can use in a compatible way different versions of records in there And you don't have to reprocess all entries in the in the stream as you would have done with a database upgrade script So how does that look like in Python we have here the schema defined it has a name It says it's a record of many fields and I say what is the What are the types of that field we have strings we have quantities there are many other types also And as I said you can you can say whether it is optional you can say whether field has a default value or not So by using this and now we can use a different producer and consumer We use the afro producer and the afro consumer and you can see here Still we have to give the Kafka now now we have to give the schema registry URL So this is something what is schema is registered where every service can retrieve all versions of that and this schema registry This also checks that you only do compatible schema upgrades Which is great because as soon as you want to write a schema in a new version and it's not compatible It raises an exception and heads of work. This is wrong. So, you know it from an early point on The other things are mainly the same. We are giving default Schemas for a key and a value we can use our data here and we now don't encode it in Jason But we give it directly to the afro producer He uses the afro schema to encode that and writes it to the stream Also for a consumer the main new thing is that we have give that schema registry URL so that he does know how to interpret things that are on that stream and We are using here also polling and we just can check on what is the value on here So these are mainly the examples used from the confluent Kafka client So you can also have a look there to dig in deeper and to have some more explanations on that Okay, this these are the basics how we can write and read from from a stream What do we do with that so let's have a look at data validation as you remember That's the second box we had in there and what do we want to do there? We have to say its input and we need to check whether that's correct or not So we're separating the wallet and invalid sales records in the same way as during the Cinderella fairy tale She has to separate the good piece and the bad piece and this these are what we want to do here within our service So very basically we just pulled in your records Let's say we have a function which checks whether the sales record is wallet Whether the location is a wallet one whether the quantity is non-negative for example all these things you can think of and If it's wallet you write it to a new stream say it's validated And if not you can write it to a new stream say it's error and then read the other processors handle this information For example, how to answer the customer that he sent an invalid sales record That's fine a Very interesting thing about streaming is that you can add additional processors in very easy way so let's say either we want some new stuff we We want to have some more in monitoring some reporting on that and we write this monitoring or reporting to a new stream or We have a new validation logic, which we just want to try We don't want to put it directly into production But we want to write its results out for a different topic to check whether this work fine When we had the database centric application we would have to remember the processing states So we have one state and for every record in our sales table We would have known is it validated or not hasn't been processed or not so a second service for validation would be really hard because it has to It has to know well it this has been validated by that service, but I have not validated I need to introduce new field or so But this is not the case here each service can know how far it has processed So that you can work independently Which is a thing that might might not sound that interesting, but as soon as you have tried it. It really is Fun to work with that because it makes things really much easier Especially when you do a lot and you want to try out new things So these are the basics for using the streaming Now let's come to the challenges I said we do machine learning and machine learning Especially in the training. It's not a thing that you do in streaming and the the results The real machine learning answers where they might be different But for us as we do forecasts For sales in a daily way. This is also something we do in batch So how can we work with these batch-like processes within our streaming? Our machine learning application needs to get all data based for the training in one batch Or maybe in several they are working in a partitioned way But still there's many data in that so how do we get that input data and remember you don't you can't query that stream You can't tell give me for all keys the current value So somewhere we need to save that data with Several options. We can just keep it in the memory of our service. We can use a separate database So serving database that is Doesn't need to be that powerful as the database from our monolith But still we could use a smaller database or we could use a blob store which is Also just cheaper than a database and we can save the values in there and it can be used by the by all machine in an application And yes, that is data duplication It feels bad at the moment, especially if you come from a very normalized database scheme But that's the price we'll have to pay we have several advantages and we also then have to live with that data duplication But what's the idea behind such a duplication? How can you explain that and what I found very helpful in there was to differentiate between a right path and a read path So for reasoning about does it make sense or not? I think that's a very helpful distinction and In our old way with a database we had a relatively short right path so we put the validated data in the database and Then the data validation is finished and when we start our machine learning We are doing a machine learning query that needs to fat all the fetch all this data with a very big joint statement And then feed it into the machine learning so this is to read path and this Machine learning query is something that really puts the database under much pressure And you have to see your database sizing that it can work with such a big query and relatively short time frame Now let's compare this to how we would do it with streaming. You see the right path now is longer Because we have to data validation. We write that to our topics and as soon as we have new data on that topic We can all we we can already do the joining. So We are joining the new sales tower with sales data with the location data with the product information And maybe further data and write that to a blob store and there it sits and it waits until the machine learning started What we have lost is our normalize schema because Because now we have to data duplicate it But what we have gained is a very big operational advantage Because that right path can be more scaled As soon as the data arise we can write it there and it sits there until the machine learning starts So when the machine learning starts it doesn't have to do that big join that really puts the database under pressure But it can use the data in a format that it needs and that this is consumed by that So we have duplication, but we have gained operational advantages How would such a thing look like? So as I said, we have this location data We have the products data and we have the sales data We can treat that locations and that products. It's master data. That's not such a high data volume We can keep that in the in the service as a table and we join that additional information to the sales stream And then that joint data we can append to a file and this file then sits there and waits until the machine learning started This is the possibility to cope with a challenge that the machine learning is still batch you might have Noticed when I explained that that I said well the sales in the location and the product data They are kept in the processor. So that's a state that we have in a processor and state as you might know is the nightmare of every distributed systems engineer because It's hard to handle What state do we need there? I mean for streaming the data could just rush through but for example in our In our scenario, we need to know what is the master data so that we can join that information for that So that's the data you want to join with but there are also other things that might be more subtle So when you have some time window processing you want To aggregate a stream that comes in and you want to know the sum every five minutes Then you need to know the data of the last five minutes so that you can sum that up correctly And you need to know when to start a new aggregation And there are different ways of doing that you can have hopping time windows or sliding time windows So that's just different possibilities that have different requirements to the state Formerly the database did this for you. You could ask them You could ask for master data. You could ask for things in the last five minutes. Now you have to know that within your service Well, it's fine. You might think you just keep that in memory. You know what came in everything's fine But there are some challenges. Well a processor might fail and Then it needs to restart and all state is lost that was in its memory So from where does he get the data for restarting? Or something less dramatic But as I said one thing that we really want to use is scaling so at first we had a processor that took care of all locations and Now we want for each processor to have one location. So this also changes the state in there Or let's say we had three processors for each location and now we want to merge that into one So at least the location master data needs now also to be merged to this one processor. How can we do that? well As I said, we just keep in memory and maybe we just keep all state in memory That we might possibly need in the future. That's one option, but it's not the best one You can read process just a stream from the beginning on to warm that up. This might take a long time or Each processor could keep its own database instance and save the state in there So that it can be used at restart and you just need to know to connect to which database, which yeah might also be interesting You can save your condensed state in a stream, which is an interesting thing because you have that streaming platform already Or you can just ask a different service that hopefully knows all the master data and you can Tell him, please give me all things I need to know several frameworks already exist to cope with that problem and it's very interesting if you start with a You're very naive approach and using that to have a look at such frameworks what they do and Why they might do it and then you you can really learn from their experience even if you are not in the same language for example, you can have you can have a look at Kafka streams or at Apache Samsung and Up to yesterday my impression was there is nothing we can use in Python and Well, we have to think can we employ some learnings from these other things? Do we have to write our own framework? Does it make sense or not? And we really were searching for an answer there and it's very glad to be in a talk from Winton yesterday and they said they now open sourced yesterday the Winton Kafka streams Which is the Kafka streams doing in Python and I'm really eager to have a more look to have a look and that they also said It's at the beginning state and there are still some topics they need to solve But it's very great to have a starting point in there and to check whether we can whether you can contribute to that and Yeah, just grow this functionality in Python, which would be really great to build more applications on that So you see here. Let's dig it up link Winton code Winton Kafka streams Well, and that's the end of my talk Let's summarize what have we learned and that you have more options for your data processing application than you might have thought But you also know the trade-offs trade-offs and I want to encourage you just to broaden your way of thinking about your application To see there's more things that you could use and you need to check whether you can live with a trade-off Or not and you know the challenges, you know some possible solutions So yeah, no go on and build some credit applications on that That's it from my side Okay. Well a great. Thank you very much for the great talk. That's awesome. We have Almost 10 minutes for question great timing any question Any question for Christian on streaming otherwise I do have a question Well, I'll start with my question Very interesting technology. I've never used streams How do you deal? Or maybe the framework deals with missing Value say you have a network glitch somewhere Somebody trips on a network cable and you're missing. I don't know two or three values in your stream or do you care? Well, what the framework Guarantees to you is that within a stream partition you get all the values I mean if it cannot guarantee it, you will get an error. So this is something really can rely on but What you have to deal with is some streams might operate at different speeds than other ones So for us one challenge will be and this was just too complex to bring it into that short presentation Let's say the customer delivers you some location and some product master data and he delivers you some sales data and now the Location data is delayed this there's some issue on the note that process is that and you are at a At a past offset, but you are the newest offset for the states data And now you would raise an error because you say the sales record is invalid And you don't because that location doesn't exist But this will be back because the customer will tell you well I already send you that why do you give me an error and this is something where your application logic then has to cope with and we will do that as we have one input The data input it always will assign delivery IDs and processing timestamps when this delivery when this was delivered And now your validation logic needs to check that you are that you have processed also the recent master data records So this would be the solution to cope with that issue But for the stream partitions you always can rely that you will get all values in the correct order Okay, great Thank you. Any other question? We do have a Few minutes seven minutes more or less. I can keep asking questions folks But if you have any question and a question over there, sorry, I didn't see you Hi, thanks a lot for the great talk. It's really nice to see Somebody else doing the same thing that we do as well in our company. I have a question about recovering State in the case that your consumer crashes Did you try some sort of seeking back in the stream to to find the latest? Latest state that you could recover from Well, that's the thing you it's not seeking back in the stream, but You can always have a snapshot of your data so That you know that you have in in some data file You save what what master data did I receive for example? And this is used also when we do some to some Acquiries for that to do some data science in there and with that stream within that Storage you're also safe on what was the offset in the stream which corresponds to that So by that you always know from which point in time you have to reprocess your stream If you want to get the updates compared to such a blob storage So this would be the way well to to search in the stream for updates that are missing Thanks, I like this approach If I can just quickly share what we did we implemented time-based seek in In the stream so each one of our Avro messages just like you we use Avro Each of our Avro messages contains a timestamp So we implemented a sort of binary search in the in the stream so we can return to a particular point in time But it is I mean it works, but it's not very elegant. So I like your solution better Yeah, yeah, I mean it's hard with time-based things as we want to think about always in a distributed systems manner And then you have different processors that might have slightly different time things So what we want to do is we want to use that stream of the delivery IDs Which then is guaranteed to always be in the same order and if you if you reference these newly created Things then you can ensure that you are really in the correct sequence Cool very excellent. So any other question. I'm standing here so I can see you Okay No other questions Well, um, let me see a question here. I have a beginner question. How do you integrate the data? Different databases, right Excuse me, can you repeat the question? How do you integrate the data of those several databases you have? data paths and they are Saving the data in several databases, right or not Yeah, I mean for each stream processors You have for each stream processor you have different possibilities to save that in there And not everyone uses a database and not everyone needs to be queried. So we want that That each service They do not need to know about what technology the other services use. So the communication really is With the streams and how each the databases are more that each service can be restarted for that or Or that also other parts of application can say well, I need to know all Locations that you have in there, but then he queries that service for that. So it's it's not a question of database Integration all this integration should work while the streaming platform and they should not need to know about that It's the only thing where we need to know about it is Well, as I said, we want to do some exploratory data science on this which is not connected to this to the streams but directly To to the output data, but then the data scientist knows where the data lies in but within that application or communication should be wireless streams Okay, we have time for one more question Yeah, one question Do you have any particular strategy to handle the Migration updating of the data that it's already queued into the topics of your streaming Infrastructure if it changed the format or anything like that due to code changes or things like that Let me see what I understood the question correctly I mean if it's not just the the schema evolution we have but if you say that we that we need additional queer additional Fields in the query for our machine learning is is that a question how to handle that? Yeah, the question is I mean I suppose there is a service that puts your the data into yours to mean Q Okay, and the data. It's probably Formatted Marshall Dean some way if you change that format in any way. I mean, do you how do you tackle that? For that you then really have to reprocess your data. So yeah, I mean If you have a different output format and this does not Help you what you have done in there then you need to reprocess that data Actually, I was looking for the next speaker. Sorry So we have we have no more time for questions. Sorry I'm looking for the next speaker. Please identify yourself and come to the podium and let's thank Christian again The great talk great questions. Thank you all Thank you