 Welcome back everyone. We're at the Flink Forward Conference. This is the user conference for the Flink community started by Data Artisans and sponsored by Data Artisans. We're at the Kabuki Hotel in San Francisco and we have with us another special guest, Jamie Greer who's Director of Applications Engineering at Data Artisans. Jamie, welcome. Thanks. So we've seen an incredible pace of innovation in the Apache open-source community and as soon as one technology achieves mainstream acceptance it sort of gets blown away by another one like MapReduce and Spark. Yeah. And there's an energy building around Flink and help us understand where it fits relative to not necessarily things that it's replacing so much as things that it's complimenting. Sure. Really what Flink is is it's a real stream processor right so it's a stateful stream processor. The reason I say it's a real stream processor is because the model, the computation model, the way the engine works, the semantics of the whole thing are a continuous programming model which means that really you just consume events one at a time. You can update any sort of data structures you want which Flink manages fault-tolerantly at that scale and you can do flexible things with processing with regards to time scheduling things to happen at different times when certain amounts of data are complete etc. So it's not oriented strictly towards some a lot of the stream processing in the past has been oriented sort of towards analytics alone or like that's the real sweet spot whereas Flink as a technology enables you to build much more complex event and time-driven applications in a much more flexible way. Okay so let me unpack that a bit. Sure. So what we've seen in the Hadoop community for the last X many years was really an analytic data pipeline put the data into a data lake and the handoffs between the services made it a batch process. We tried to start adding data science and machine learning to it. It remained pretty much a batch process because it's it's in the data lake. Yeah. And then when we started to experiment with stream processors, their building blocks were all around analytics and so they were basically an analytic pipeline. If I'm understanding you, you handle not just the analytics but the update oriented or the cred oriented operations create read update delete. Yeah exactly. That you would expect from having a database as part of an application platform. Yeah I mean it's that's all true but it goes beyond that. I mean Flink as a stateful stream processor has in a sense a micro sort of simple database as part of the as part of the stream processor. So yeah you can update that state like you said the CRUD operations on that state but it's more than that you can you can build any kind of logic at all that you can think of that's driven by consuming events right. Consuming events, doing calculations and emitting events. Analytics is is very easily built on top of something as powerful as that but if you drop down below sort of these higher-level analytics APIs you truly can build anything you want that consumes events, updates state and emits events and especially when there's a time dimension to these things. Sometimes you consume of some event and it means that at some future time you want to schedule some processing to happen right. And these basic primitives are really allow you to build, I tell people all the time, Flink allows you to do this consuming of events and updating data structures of your own choosing. Does it fault tonally in that scale? Build whatever you want out of that and what people are building are things that are truly not really expressible as an analytics job. It's more just building applications. Okay so let me drill down on that. Sure. Let's take an example app whether it's well I'll let you pick it but one where you have to assume that you can you know update state and that you can do analytics and yeah you know that they're both in the same app which is what we've come to expect well from traditional apps although they have their shared state in a database outside the application. So a good example is that I just got done doing a demo literally just before this and it's a trading application so it's a you build a trading engine it's consuming position information from upstream systems and it's consuming quotes. Quotes are all the bids and all the offers to buy stock at a given price. We have our own positions we're holding within the firm if we're a bank and those positions are all of our that's our state we're talking about right so it says I own a million shares of Apple I own this many shares of Google this is the price I paid etc. So then we have some series of complex rules that say hey I've been holding this position serves for a certain period of time I've been holding it for a day now so now I want to more aggressively trade out of this position and I do that by changing modifying my state driven by time right so more time has gone past I'm going to lower my ask price. Now trades are streaming in as well to the system and I'm trying to more aggressively make trades by lowering the price I'm willing to trade for right. So these things are all just event driven applications the state is your positions in the market and the time dimension is exactly that like as you've been holding the position longer you start to change your price or change your trading trading strategy in order to liquidate a little bit more aggressively. None of that is like in the category of I'd say you could use and you're using analytics along the way but none of that is just what you think of as a typical analytics or an analytics API you need an API that allows you to build those sorts of flexible event driven things. And the the persistence part of maybe transactional part is I need to make a decision as a human or or the machine record that decision and yes and so that's why there's benefit to having the analytics and the database whatever you whatever term we give it sure in the same co-located co-located yeah in the same platform. Yeah there's a bunch of reasons why that's good that's one of them another reason is because when you do things at high scale and you have like high throughput say in that trading system we're consuming like the entire options chain worth of all the bids and ass right it's a load of data so you want to use a bunch of machines but you want to you don't want to have to look up your state in some database for every single message when instead you can shard the input stream and sort of both input streams by the same key and you end up doing all of your sort of look up join type operations locally on one machine right so at high scale it's a huge just performance benefit also allows you to manage that state consistent consistent with the input streams if you have the data in a external database and a node fails and you need to sort of back up in the input stream a little bit replay a little bit of the data you have to also be able to back up your state to a consistent point with all of the inputs if you don't manage that state you cannot do it so that's one of the core reasons why stream processors need to have state so they can provide strong guarantees about correctness what are some of the other sort of popular stream processors when they choose perhaps not to manage state to the same integrated degree that yeah what what what was their thinking in terms of you know what trade-off did they did they make it was hard so I've also worked on previous streaming systems in the past and for a long time actually and I mean managing all this state in a consistent way is difficult and so the the early generation systems didn't do it for exactly that reason let's just put it in the database but the problem with that is exactly what I just mentioned and in stream processing we tend to talk about exactly once and at least once this is actually the source right this is actually the source of the problem so if the if the database is storing your state you can't really provide these at least these exactly ones type guarantees because when you replay some data you back up in the input you all have to back you also have to back up the state and that's not a really a database operation that's normally available okay so when you manage the state yourself in the stream processor you can consistently manage the input in the state okay so you can get exactly what semantics in the face of failure and and what do you trade in not having a would you give up and not having a shared database that has you know 40 years of maturity and scalability behind it versus having you know these sort of micro databases yeah distributed around is it the shuffling of you give up and I you give up a robust external query interface for one thing right you give up you give up some things you don't need like the ability to have multiple writers and transactions and all that stuff you don't need any of that because in a stream processor for any given key there's always one writer and so you've got a much simpler sort of type of database you have to support what else those are the main things you really give up but I would like to also like draw distinction here yeah between state and storage databases are still obviously flink state is not storage not long-term storage it's to hold the data that's currently sort of in-flight and mutable until sort of it's no longer being mutated and then the best practice would be to emit that as some sort of event or as a sink into a database and then stored for the long-term so it's really it's good to start to think about the difference between what is state and what is storage does that make sense I think so so think of like you're counting you're doing distributed counting which is an analytics thing you're counting by key the ins the the the count per key is your state until that window closes and I'm not going to be mutated anymore then ready to the database right yeah but that internal that in-flight state is what you need to manage in the stream process okay so it's not a total replacement for database yeah but this opens up another thread that I don't think we've seen we've heard enough of okay Jamie we're gonna we're gonna pause it here okay I hope to pick this thread up with you again okay the big surprise from the last two interviews really is flink is not just about being able to do low latency per event processing it's that it's a new way of thinking about applications beyond the traditional stream processors where it manages state or data that you want to keep that's not just transient and that it becomes a new way of building microservices so exactly on that note we're gonna sign off from the data artisans a user conference flink forward we're here in San Francisco on the ground at the Kabuki hotel