 Yeah, thanks a lot for coming everyone special greets to everyone watching this stream at the moment Today we want to talk a little bit around our experiences and story of how we've applied stream processing over the years what we've learned and Towards the ends. We're gonna have an exciting announcement of how we think we can make stream processing with Python quite a lot better So in short, who are we? We are both two developers at Wynton, which is a global investment management and Data science company you might have already talked to us quite a lot at our stand downstairs and we've been in the game for quite a while now for 20 years and Our core thing really is that we believe that the scientific method can be applied profitably to financial markets Which means we run a lot of data processing on financial and non-financial data and try to make a profit out of that and As part of that we apply a lot of technology and data science and this talk is really about One part of that of how we do real-time stream processing I'm interested so quick hands up, please if you've ever heard of Kafka or done some stream processing. Oh Very good. That's good. Okay, then I go about this a bit shorter Basically what stream processing is is that you instead of doing the classic batch processing Where you would go at the end of the day and you collect all of your data for today Write it into a big file and then run for example a huge spark job or something on that on it You work on your data as a stream Which means data comes in throughout the entire day and you process it as it comes in Which is quite nice because then you get real-time and interactive data. You can put it on a nice dashboard and Often it also makes thinking about your processing quite a lot simpler because you don't have those Random interruptions where your one batch shops stops and the next one starts and One example for such stream processing in Finance we've actually been doing stream processing for quite a long time now like this is just the natural thing you do For example, if you buy and sell stocks at an exchange You can subscribe to their market data feed and basically what they do is every time someone buys something or sells something You get a little message that tells you for example Someone at 1015 has bought some stock in Apple and they've bought 10 shares at the price of 144 dollars for example and you really get this flowing in you get a massive amount of messages every second and Then you do various things with it One of the things we do for example is we showed in various dashboards and It feeds into various dragging algorithms and they don't necessarily always care about every single trait that happens But like in the question for the talk before Often we want to have like a little bit of an aggregate. So we want for example to know every minute How many Apple has been traded in total for what average price for example and that's something that one of our stream processors would do so we take a Nice input stream of high frequency events Run a bidding process and then group it down into slower streaming events that are closer to what we really care about and then Our downstream consumers don't really need to do that much work if they're not interested in all the details And over the time we found The streaming model it really works well. So now it's expanded from just handling market data Which is this core financial data of the stock has been traded to alternative data Which means we run some web scraping we look at news we look at various other Not really core financial data If nowadays got an entire office in San Francisco Focusing on this and the other thing that we found was really useful is just normal business events So something happens on the day and you want to make sure that once that happens This other job runs or you want to update some monitoring page And there we've really found the combination of stream processing where we would send for example This job has run on an event bus in Combination with a more traditional database is really useful So you would write your normal data into the database and then you write a little notification Onto your event stream and that then can tell various applications to the right thing for example We run monitoring on it. We do our risk management based of those events we do our trading decisions and there's various other things and What I really want to focus on today It's kind of the thing in the middle where we get various raw events in and then we process it a little bit via various transformations Into the format that we really care about in the end And if you're talking about stream processing nowadays, what you really mean usually is Apache Kafka Which is the project which has won the before mentioned streaming wars really Apache Kafka is a message broker, which means it's a little bit like a database So you install it on a server and it provides you with something called topics where you would write various Data that belongs together on a single topic from a producer and then various consumers can pick up that data and do things with it and One important thing around it is the topics are Subdivided into different partitions and that gives you Parallelism so for example if you have trades you would send some of your trades on Partition one someone partition two and then you can run multiple consumers each consumer gets a petition and then they can share their work a little bit and And the other nice thing around Kafka is it makes it really easy to or it has a really nice Underlying model of how it deals with data. So within a partition every message Is ordered and you get it out in exactly the same order and it's really quite beautiful and elegant how it's implemented And because it's so popular It has huge support around to big data ecosystem So we talked before we heard their system can interact with Kafka There's a bunch of others which can do it So there's a lot of projects just have sprung up lately in the last couple of years that try to Make it easier for you for you to do stream processing on those events we've looked at pretty much all of them and We weren't really too happy with them Basically for two main reasons if you're interested in details come talk to me later. I can talk about this for hours But really the two core reasons Where we had issues with those projects is that usually they're pretty heavy-weight So kind of if you want to use spark screaming to do your stream processing You need to run and maintain an entire spark cluster and then your code runs inside the spark cluster And there's quite a bit of overhead in that just installing it setting it up Configuring it correctly and then doing things like scaling up And the other thing is pretty much all of those projects are written for the JVM, which means Python is a bit of a second-class citizen and then doesn't really fit in or you talk With the actual system via this text JSON input output, which isn't really great for performance There's one nice outline Pretty much one year ago Confluent and the creators of Apache Kafka Introduced a new API for Kafka that does very similar things what these other stream processing frameworks do but Does it as just a simple library? Which means it turns around the relationship between your code and The framework so in the others you would submit your code You need to package it up and then it runs somewhere Inside the storm server and what you can do with Kafka screams is It's just a library which means the master of everything is still just your normal Python process You can run it any way you want you can run it on the AWS if you auto scale it You just all the scale completely the normal way because it's just a normal process It it's also got proper real Event by event processing, which means unlike for example spark streaming You don't need to batch events up so you can get pretty good latency and it helps you a lot with all those Really tricky things around stream processing that gets difficult if you want to do some stateful processing for example for the binning or if you want to join Different streams together if you want to do some aggregations and also for tolerance and Distributed processing is solved quite nicely, but and it's now nowadays part of the main Apache Kafka client But the big drop back for us is So far, it's Java only and it's not that easy to write bindings to that Java interface And that's a bit of a problem for us because it went and we are big Python users we are around 450 people in total and Really at least half of them know some Python they're not all developers But we have lots of data scientists researchers Operations people and if you want to know a little bit more about our Python usage We had a talk last year by some of our colleagues that showed a little bit What we do with Python Yep, and what we really what would want is to give all those people who know Python the ability to do stream processing nicely and easily and What we've done so far is kind of this hipster programming which means Every time you do it you roll it on your own you just grab a Kafka client. There's various good ones and you connect to your cluster and In four lines you can connect it you subscribe to messages you get a callback for every single message and To get started with this is actually a really good idea. And we've been running quite a lot just like this various processes that update monitoring and so on If you want to go down that route there's two clients I would recommend there is one written by Donna powers. Thanks a lot for writing that which is Native Python client, they've written all of the protocol interactions in Python and they especially have a really nice Pyphonic interface. It feels very natural to use it. It's really nice and robust. It handles for example Your cluster scaling correctly and the other recommendation I would have is the confluent Kafka Client which is a wrapper on the C library, which means it's really really high performance So we've pumped something like a million messages per second for it pretty easily And yeah, they're both good since we had parsley talking earlier. Thanks to them as well They've also built a Python client, but I haven't really used it that much But I thought I shouldn't mention it So so what we've done is we've wrote our own straight streaming framework quite a couple of times Every time you start with those ten lines or four lines, maybe You start building it and then you run it in production in your demo setup. It all looks fine and You let it lose and things slightly break or are slightly wrong because streaming is actually really really hard For example, if you want to do the spinning from earlier Where you say you group some traits together, then you do an aggregation over them and calculate some average or total Really need to be careful that you handle every message exactly once because if you double count them your aggregation is just wrong and That's really something that people have actually struggled with Stateful processing is difficult if you want to do joins or if you want to cache like the last bits You kind of run into the same problems that If you keep a local database As a lookup you need to be really really careful that you don't apply things twice and That's hard then distributing load is quite different can be quite difficult Because while the client supported usually what you have to do is you need to handle certain callbacks that get triggered When the rebalancing happens correctly and that can be quite tricky It's fine if you've been doing it for a while But if you really just a data scientists who wants to write some algorithm and some code and run that It can be quite difficult and then if you want performance out of your streaming You usually want to do micro batching which means that you buffer up. You don't Do like a database write on every single message you do but you group them together and after a while like for example every second then you Process all the messages you got as one chunk and that gives you quite a lot of speed up and Because we want all of that we've decided Yeah We've started an open source project on our new open source organization To basically bring Kafka streams what which we think has solved all those problems really really nicely for Java And we're doing a re-implementation of that just in Python pure pyphonic Which means you don't have any dependencies over and above a simple Kafka server And my colleague Rob will now show you a little demo of how it really looks like and what it can do Hi, I'm very excited a little nervous to do a live demo, but it was it was a So I'm going to start up. Ah, it's good. Oops. Sorry Now I'm going to start up all the components required for the application Starting with zookeeper so Kafka uses zookeeper in the background for cluster management for storing the topics leadership elections For example, so let's start up our zookeeper Start a Kafka server now. I'm going to start just one instance in production. You really would want to say three instances Three replicas for extra resilience and so on. So let's go ahead and start up Kafka That's now running I'm going to start up a consumer This is just the default console consumer that comes with Kafka It'll take whatever is in the topic bin prices. Sorry. Can you see this? Is it big enough? great So it will write out whatever finds in bin prices out to the console There's currently nothing in there. We're not running our application just yet. So Let's Now just before I let's let's start up the application There it is it started we can see that it's listening to the prices topic So I think was at the fourth the fifth line down It says that it's listening to the prices topic. It will write out to the bin prices topic We've currently not got no data in the bin in the prices topic though. I'm going to start one more consumer first That's here in my Jupyter notebook So let's do the imports Set up bokeh Set up our consumer. This is very similar to the the lines code. You just saw Andreas Show you and Our plot so there's our plot ready We've got one more thing to do and that's just to start up this loop down here Which will loop on the consumer to read in the prices from the input topic. So let's start it now again Like the previous consumer. There's no output yet because there are no inputs in there. So one last thing to do Let's get some inputs Right, we've got this generator script and the generator script is going to generate some prices Just random prices that Normally distributed returns for the prices. So let's start generating some prices Go back to our application. We can see it's already begun to process those Go to this consumer here. We see there's nothing there yet There's nothing there because we're binning the prices our definition of a bin is when we reach the end of a minute For the purpose of the example in fact two minutes We will push the last the last value on the bin The last two values, sorry The last two minutes out to the output topic. So let's have a look inside Chrome At our plot we can see That our input prices are being parsed Or being read in our bin prices will appear as these red dots here So once we reach two minutes I have sped this up you you can run the the generator in approximately real time So one moment more and we should see the dots There we go. So they so as I say in production if we were running similar code We would obviously be interested in pushing out the bin price every minute for the purpose of the example here We've just waited for two prices to be available before pushing them out This will just carry on plotting We'll just leave it go there We'll move back here You can see in the consumer the console consumer here. It's also found the two prices So these are two perfectly independent applications neither Knows about the other but both are consuming the same data. So what's the code behind this look like? Here we go So the first thing you probably want to do is to write a topology Our topology comprises at least one source One sink the source is your input topic. So prices our sink is the output product topic bin prices And you will want at least one processor in between The processor is where the meter your calculation really takes place so in here in the binning class is The processor so we've got some initialization fairly standard stuff and we've had this Process function process method It the process method will be called for every value that is read in from the input topic and then we convert it to the output When we find two so you can see down here when our store contains two outputs, so we obviously Transitioned into the third minute say We will call punctuate punctuate simply forwards the values the bin values out to the output topic So that's what that's the flow from input to output and what we saw in the bokeh plot in Chrome So we've seen the application. We've seen some code What comes next so right so The first thing and most importantly really is to finish our implementation. We're at a very early stage We're very early stage, but we absolutely encourage contributions. We'd love you to check out our code start on github file some issues Right tests right documentation, whatever you like just get involved We'd really like to experiment with a more pythonic API in particular for the DSL DSL is the domain specific language That sits on top of the streams API in Kafka We feel that there is a very good opportunity to leverage Python strengths to make it even easier to set up a topology Using the domain specific language optimize Python is a great language. We all love it. We wouldn't be here if we didn't but we're under no illusions It does have it can hit limits and if when we hit limits, we will work on optimizing the performance further This could include batching it could include using some of the the great Python libraries NumPy pandas are out there or even leveraging say new new projects like Apache arrow Lastly, well lastly on this slide, but really the roadmap is longer There are many more advanced features of Kafka The 011 release recently Introduced exactly one semantics for example. So yes, as I said before and I'll repeat again get in touch There's the github Page I can't seem to click on up So it is there the the github page get involved check it out and Are there any questions? Hello. Hello. I am thank you very much very interesting. I waited for this for half a year now. So thank you I would check it out just now. I haven't found it. So if you Google for this one doesn't show up So we have to do some search engine optimization It was it was out of just today Okay, so but I already tweeted it so excellent. Thank you Questions hi so in the last presentation we saw Apache storm and how supposedly nice it is with like Tracing the workflows and like sync errors and stuff like that and can you have that like easily out of the block box with Kafka streams and your library It's a bit more lightweight so parts of them you can have especially confluent is I think happy to sell you an entire seat of tools on the other side Because it's just a library. It's pretty lightweight and doesn't do like spinning up of your processes, which means that if you already have existing Infrastructure to do all of that for example. We run a lot of our code on Kubernetes nowadays You can just do it on there and it's integrated with what you already have Thanks Another one Thanks for the way talk. Have you tried some kinds of iter tools? Algebra for the stream processing part In particularly Not yet, but I would love to talk to you about that how you think this could rock Because this is exactly something that we would really like to make nice and pyphonic So please come to our booth maybe later I would say that if you if you look at the the current code base It's it follows the Java code base reasonably closely Which we felt was you know a reasonable way to get started get get to learn the code behind the Java implementation But yes, it would be fantastic to introduce a New code where you know where we can leverage as I said before Python strengths When I walked one comment, so if Someone is asking yourself Should I use streaming at all there's a talk from him tomorrow why you should use streaming? Yeah, thank you And it's really great to see the project you started because I would have also said that this is missing in Python at the moment So I'm very glad to try this out Looked also at Dreams and one thing I found particularly Interesting is the scaling approach and the state management approach. Is that already included in your approach or is that on the roadmap? The scaling we do have working in a in a simple fashion. So basically You can run two different Python processes and they will share the load between them and balance it The state processing is a bit further out. So before we want to do that Oh, basically, that's what we're working on next But it's really tricky to implement which is why we want to put it in this library But it's coming. I think we are already over isn't No, so first It's good that you're making this open source. I'm not familiar with Kafka streams, but I use I'm a Hipster Okay, so I Kind of like feel the pain you were describing and my question is I try to combine RX Python to these reactive extensions Yep with Kafka And it sort of works, but you know, you have to Write a little code. It's I'm not familiar with how Kafka stream works So it's similar to that concept of having operation of the streaming or the events are coming Through Kafka and window and stuff like that. Yes, exactly. It's basically exactly this So it's the same you get your stream input object And then you can map a function onto it or you can group by and you can split it up It's basically exactly that we have time for Maybe two more questions So do you use some Binary protocol to get events from Kafka like is it performant is fast Okay, so To talk to cut so basically what this is the sits on top of an existing Kafka client and basically Takes away all the boilerplate that you have to write and beneath it at the moment. We're using the Confluent Kafka Client with it does one the second one and basically that's talking Just a normal binary Kafka protocol, but this highly optimized C implementation. It's really really fast Where I think the current bottleneck would be is what goes from C? Implementation where you get one call back for every single message into Python code and basically what we've seen is that we really need to include some Batching that for example you get a lumpy array of all your events back because if you do Processing with Python of a million records per second If you look at the profiler really every single if shows up and it's it's getting a bit hard to optimize there But yeah in short. Yeah, we're using a really optimized Kafka client underneath and it Talks a binary protocol. Okay one last question Thank you we recently implemented the exact same solution at my company Kinesis which is very similar to Kafka. Yeah, and I was wondering if you found any way to Test your applications using this library without having the need of a full Kafka installation in your local development environment Okay, so I think this library makes it a bit easier because you Aren't so low-level so it's relatively if you write a processor You have a really nicely defined Python interface and basically we will test against that But I feel your pain around if you stitch more of them together at some point We've managed or we've mostly gone for integration tests Because there's so many small things that are really hard to unit test correctly I decode Andreas's comments I would just say though that the code that I showed the binning top I was the full application. So it should be Not necessarily easy, but you know not very difficult to to abstract that in a way to make it more amenable to testing Because that is that is everything Okay, so thank you very much for for your talk and for all the sourcing this thing so