 Well, welcome everyone. I know it's almost beer o'clock on a Friday, so I'm grateful that you showed up. Welcome to the North Pole. I put a little thermometer on the board here to remind you how cold it is. But at least you won't fall asleep, I guess, or maybe it'll be hypothermia anyway. So my name is Dean Wampler. I have this pretentious made-up title at LightBend VP of fast-data engineering. Basically, I lead the development team that's building a distribution of streaming technologies on top of DCOS. Spark, Kafka, MESO, well MESO is underneath, obviously, Flink eventually, Aka Streams, and so forth. I'll talk a little bit about the pieces in this, you know, I know this, you don't want a marketing talk and that's not what this is about. But I did want to, you know, having played with these frameworks for a while in MESOS and actually made the choice of MESOS over alternatives, that's really what this talks about and what we've learned from that. By the way, if you look for the slides on the website, I forgot to check if you could even get them yet, but I uploaded them like an hour ago, so they are up there if you want to get the slides. And there's some general themes I want to hit on. I'm not going to do this in order, but just some takeaways is... First, talk a little bit about how applications and application architectures are changing. You know, why are they different today versus, say, five years ago or pick some time frame and how that's influenced everything else that I'm going to talk about. You know, why we picked MESOS and some of the particular things about it that have proven really valuable for our needs and obviously for our customer needs. And then talk about a few specific insights. It won't be a highly technical talk by any means, but hopefully you'll get some useful stuff out of it. All right, so first, why do we care? Well, I'm leading the team that's developing this product, we're calling the FastData Platform. It's in beta now, and it'll come out this fall. Just two slides about it. So this is sort of a block diagram of the pieces. Well, I don't know how useful this actually is. Let me just show you the interesting bits, which is what's under the little blocks. Mostly we're going to talk about the sort of middle right top, which is the streaming engines and then also Kafka and kind of what we've learned from using those. But also, you know, I'll get into a little bit how it turns out we're seeing a lot of need to build microservices that integrate with these other tools, much more so than maybe in a classic dupe environment. And then there's other pieces you need like management monitoring tools, machine learning and all that and deployed on DCOS for, you know, having a commercially supported MESOS distribution. The bottom, actually, that's sort of a little small to read is bring her on storage, although we are actually supporting HDFS ourselves in this case. Okay, so enough architecture. Well, the first question that we came up with, say, three years ago when we really started thinking about what we should do in big data was, well, should we use Hadoop? So let me step back a second and just to provide a little context. So I'd been working in, like, the Spark, rather the Scala community for a long time and actually knew everybody at LightBand called TypeSafe at the time when they got started. But I wasn't actually that interested in working on microservices, which has been kind of the focus. LightBand was founded by the creator of the Scala programming language and the guy who created Acca. You probably heard about Acca because it's the A in smack. Actually, there was a really good talk. I just went to by a couple of guys from Adobe who've been using OpenWisk, which is the serverless framework that is written in Acca. So you probably, if you had never seen it before, you got a little taste of what Acca is about there. And I'll briefly talk a little bit about it as we go along. But anyway, so, you know, three or so years ago, it turned out Scala was starting to get very interesting in the big data world because Spark and Kafka are written in Scala. So people started talking to LightBand about it because it was kind of driving them to use Scala. And then I got involved with LightBand to think about, well, how do we actually take what we've learned about microservice development and Scala programming and functional programming and apply that to the big data world. If you know anything about me, I have a lot of talks out there on YouTube and Twitter. And I have a few where I rant and rave about how the Hadoop world is screwed up because it's not functional. You know, we're using crappy languages and all this. But anyway, not to get into too much about that. So fast forward a little bit. And, you know, a couple of years ago, when we sort of crystallized this idea of, well, you know, the world is kind of moving towards more stream-oriented versus batch-oriented. Not that batch is going away, but people need answers faster. You know, it's the usual thing that time is money. If I can extract value more quickly from data, then that's better. So that's been kind of one of the driving forces behind streaming apps. The other one is just the pragmatics of how do I serve mobile? How do I serve, you know, like map applications and all that stuff? You can't do that in batch, right? Well, at least not completely. You might do some data processing offline to have the data ready for real-time serving. But as we started thinking about how to support customers doing stream processing at scale, it became clear that Hadoop just wasn't really cut out for it. This is sort of my schematic diagram of a Hadoop cluster. Really, there's only three important pieces here, and that is there's some sort of distributed storage, which is HDFS, the Hadoop file system. There's some sort of compute, which used to be map-reduced, and now it's mostly Spark. And then there's this thing called YARN, which is yet another resource negotiator that's sort of the analog of MAISO's, roughly speaking, that knows what to do. And you say, I want to run this job, and the data's over here, and it figures out how to partition that into tasks, which are basically processes, JVM processes. And run them over the cluster, over all this data. And then there's a bunch of support stuff that people put in. And if you buy a Hadoop distribution today, you can get some streaming technologies, like Storm and Spark, and, sorry, I actually meant to say Kafka, Storm and Kafka, Spark Streaming, and so forth. But I want to make the case that that's not really good enough for what we need today, and that's one of the reasons MAISO's is much better positioned for this. So, but what's that to like? Well, three things. Limitations in YARN, the sort of batch orientation, and this idea of, well, what if I need to write other services, and I'm just going to use microservices, that buzzword to talk about the other services I might need to write in my environment. So let's talk about YARN first. And I have to be honest to the Hadoop community, a lot of these issues are trying to fix to some level, but it's relatively old software, so it can be difficult to bring it up to what we have today in MAISO's. Well, YARN, yet another resource negotiator, was always designed with this idea that you're going to submit a job to me, and I'm going to partition it based on what resources are available into tasks, and I'm going to run them, and they're going to have some finite duration. Because I'm assuming it's a batch job, that means I kind of know in advance how much data I'm going to actually be working with, so I kind of know roughly how many tasks to schedule. Why is that important? Well, because in a streaming world, I don't know how much data I'm going to be processing. Super Bowls happen, but Justin Bieber tweets, these kind of things happen, and suddenly you get these big spikes in traffic, you get network partitions if you run long enough, and so forth. So it's not terribly good if all I understand is batch jobs. In fact, it's so limited, it can't even run the demons for HDFS, like the name node and data node. So in fact, this is kind of a theme. If you're running Kafka, if you're running Storm, if you're running HDFS, you're not really going to be using the sort of Hadoop resource management. When did that happen? There it goes. You're going to be kind of hardwiring these things a little bit on your own, and then only getting the resource negotiation that we're used to, that we really love about mesos, just when you run your data analysis jobs. The other thing that's really bad these days is they do now support containers. There is, in fact, I Googled this right before to make sure that I wasn't lying about this. There is actually additions for container support, but it's way behind what we're used to, like the Open Container Initiative and CNI, and those kind of things. And once again, because of that batch orientation, it's kind of ideal for big, fat JVMs that are reading big chunks of file systems, but not so great if I made relatively small jobs for some things, big jobs for others. And so that means that if we don't want to do batch, if we want to do streaming, it's a little bit more like we're going to kind of roll our own services, we're going to kind of work around the limitations of yarn and do things our own way, like running Kafka, Cassandra, and so forth. Now, one of the things I want to argue as we go along is that increasingly what we see customers doing is they're not only building, like, you know, Spark streaming jobs and they're running HDFS, but they're also writing a lot of microservices that interoperate with these things. I'll give you an example just from a lunchtime conversation I had where I might work for a healthcare company, and even though I'm writing maybe classic data warehousing jobs with Spark, I'm also writing, like, microservices that embed the rules for, like, regulatory enforcement of data privacy, and I don't know all the terms that they use for this stuff, but basically anything I might need to know about healthcare billing and record management and all that, and I really want those services to be accessible from my streaming job so that as data's coming through, I can properly manage it with these requirements. I'm sorry about the flickering display here. So this kind of mixed workload is not so great in Hadoop either. Really, Hadoop is kind of focused on, I'm gonna do lots of batch jobs, and if I want to mix in microservices, you kind of have to sneak them in on the side or partition your cluster in some way. In fact, if they even say, if you're gonna run Kafka, you want dedicated nodes for Kafka, you don't want them mixed up with the rest of Hadoop. Okay, so that brings us to why we picked Mezos, and I know some of this is preaching to the choir, so I won't belabor these points, but just to sort of catalog them. Well, I was thinking about, what do I like about Mezos? And I thought, you know, I'm gonna Google, I'm just gonna go to the homepage for Mezos for inspiration, and I liked what I saw so much, I just figured I'd do a screen capture. So this is just Mezos at Apache.org, and I'm only gonna talk about three of these. I don't want to go through all of them, but three that I think really are important for the problems that we're facing in streaming and mixed microservice architectures is container support, and all of the benefits we get from that for isolation, for running concurrent versions, people who deploy a container, and then it runs for a few seconds, and then it goes away, or they're deploying a service that's gonna run for months, and all of that just kind of works. Sort of at a lower level is really actually fine-grained management of my cluster resources so that when I really do need to optimize usage of CPU, GPU, et cetera, I get to do that just for free. It just comes with my ecosystem, and this list of resources that you can manage is more than Yarn is capable of today, for example. Like, they still haven't gotten GPU resource management done. I had a chat recently with a guy from Skymine, the company that does deep learning for J, and those machine learning guys, especially in the AI world, are all about GPUs now, so if you can't manage your GPU resources, it's a bad deal. Oh, and this reminds me of something else. I can never, I always find this amazing when I think back to the 90s, so I've been around a little while, as you can tell, from my platinum blonde hairstyle. But back in the 90s, we would stand up servers, and it was a bad deal if the server got more than 30% utilized, you'd start worrying, right? Because that was all your capacity was to, and these were like massive sun boxes too, so we only had a few of them. But these days, we want to get much closer to 100% without falling over for all the reasons of economy. And the last one is the famous two-level scheduling in Mesos. I love the simplicity of the Meso scheduler model that they've actually partitioned the notion of what scheduling means in sort of this tandem dance between the framework itself, well, I used the wrong word, between Mesos itself and your application, which is a framework, meaning that Mesos does not have to know everything there is to know about how to schedule a Spark job. Instead, it offers resources to Spark, and then Spark has the knowledge inside to know what resources it needs to take to do whatever it is it wants to do. So if I want to add my own custom database or whatever, then I can write the scheduler that knows how to accept resource offers, and Mesos remains relatively ignorant about what it is I'm going to do with those resources. That's the opposite of the way Yarn works, where they've actually hard-coded information in Yarn about what it means to be a Spark job or whatever. And that has some advantages, but the disadvantage is that Spark doesn't, sorry, Hadoop doesn't have the flexibility to run other weird things like file systems and databases. So it's those things that lets us do crazy stuff like add these new frameworks and then have them work in a fairly optimal way. So this is just a diagram of how Spark actually works. When you submit your job, the scheduler talks to the Mesos master, and this stance goes on about allocating resources on different nodes under the ages of a Mesos executor, and inside is a Spark executor. I just pronounced that word differently for no good reason. And inside that will be the task that actually, you know, run your job. So here's a little story. I don't know if everybody knows this, but maybe you do. If you read the original Mesos research paper that Ben wrote, there's this hilarious paragraph in there where he says, in order to prove our ideas about two-level scheduling, we invented this little framework called Spark to actually test it. So yeah, I figured most of you had heard that story before, but obviously Spark took off on its own. All right, let's dive into streaming a little bit more specifically then. So what are some of the characteristics at a high level that streaming systems have to support? One is, well, I'm going to go into these in detail, but continuous processing, variable lifespans of these streaming jobs, resiliency, and scalability. So what's the deal here? Well, continuous processing in a stream, hopefully the input isn't ever going to stop coming, and if it does stop coming, it probably means you went out of business or something. And similarly, you're going to have to keep feeding output to downstream consumers, storage, whatever. And one of the implications of that is that you need your framework to really scale dynamically on demand like for those Super Bowl events or Justin Bieber tweets or whatever. I'm going to come back to this one in a little bit in just a minute. The other one, though, that I think is kind of a cool thing in terms of optimization are tools like the container network interface that let you fine-tune how that your container networking works so that you can optimize transport between various components. So you really could fine-tune if you got this high bandwidth stream coming and you could actually fine-tune how that is configured in the cluster and maybe take resources away from other things. So stuff like that, I think, is going to further help us optimize how we use our clusters. Variable lifespans. So normally a stream is going to run for weeks to months. This is a big contrast with a batch job which might run for hours maybe overnight. But there's only so much optimization you have to do or dynamism you need if you're going to be done in a few hours. But if this thing is going to have to stand up and be resilient for weeks or months maybe, then it's a new ballgame. And of course, now why would you... You wouldn't necessarily have a streaming job that only lasts for a few minutes or a few hours normally, as I just said, this is data that's supposed to never stop coming. But in fact, and this is something Jay Kreps from Confluent likes to harp on a lot. If you have a stream processing system then you can treat batch as just a finite stream in principle. So I could take my batch stream processor if I realized I made a mistake in a calculation but I only want to rerun yesterday's data then I can just start up a stream job that's actually going to have a finite stream which is yesterday's data and I basically have one system that does it all in principle. So that's why I'd still put minutes on here. And of course, Mesos is really good at managing very short-lived containers because it's fairly lightweight. You wouldn't do seconds necessarily because there is some overhead with resource negotiation and offer management but you could do minutes without too much pain. And if you're actually running services that are going to last for ever, like the demons that run HDFS, the name nodes and so forth, that just works. So coming back to this scalability resilience and scalability or whether those last two high-level bullets, you just get a lot of stuff for free that's been really useful like a federation through ZooKeeper. Actually, in the open WISC talk there was discussion of not using ZooKeeper but using ACA clustering instead for federation within WISC the way they had modified it. But there's a lot of full-tolerant facilities like if you've submitted through a marathon you can have the job restarted if it fails but there is one important point that I'll come back to kind of at the end and that is if you're building stateful streaming services then you have to have some mechanism for checkpointing in the state so that if it goes down you can bring it back up where it was. In fact, it turns out there's kind of a bad bug in spark streaming. The older spark streaming versus the new structured streaming if you know that distinction. The older spark streaming could successfully start up again with a checkpoint but there's some particular kinds of data that actually gets lost. Basically, they have this notion of counters that you can build in. Those actually get reset to zero. It turns out in spark streaming if you restart. The reason I'm mentioning this is because I've been dealing with a customer issue the last few days on that particular point. But apparently the new structured streaming is much better at handling stuff like that. So this is a mechanism that you end up having to solve some way yourself or the tool that you're using like spark has to handle it for you. And then the scalability, the fact that it's not that hard actually but this is something that you have to build into your streaming system to actually do dynamic scaling up and down. And we've got all kinds of facilities in Mezos now to support this like optimistic offers and dynamic resource allocation that lets us do this kind of stuff as long as it's built as long as your framework takes advantage of this we've actually worked with the Mesosphere for Spark for example to support these kind of newer capabilities so that spark jobs can be more dynamic to scale and demand. There's still actually work to do though I have to be honest. Okay so that's kind of the high-level picture of the sort of things you have to worry about with streaming apps in general. I wanted to take a few minutes and talk about some particulars of some of the streaming engines that we've used. So we actually focus on four in this fast data platform and we picked these four because we felt that they covered a reasonable percentage of kind of the 100% of things you might need to do but it kind of sucks that you have four. I'd rather have like one or maybe two but kind of that's where the world we live in is that if you really want to cover everything you kind of need to be able to pick from a list of four in our opinion and some other criteria that we use that I won't get into in a lot of detail here are pragmatic things like are these actually viable projects? And somebody wrote this great blog post a couple of years ago listing like 11 or 12 Apache projects that claim to be streaming engines. So you know it's right you have this paradox of choice problem. It's sort of like going into Best Buy to buy a refrigerator and you see this line that goes down you know it'll a football field of refrigerators so you kind of walk out of there because you're afraid to buy the wrong refrigerator. So I don't really like giving people too many choices because of that problem but what I've tried to do is distill it down to four that cover the spectrum and then I'll talk briefly about how each one works for particular kinds of problems and then discuss how they fit into Meizos and how Meizos supports them. But to go over these sort of characteristics that you might use so now I'm at the level of I'm going to write an application and I need to figure out which of these engines I want to pick. So the first thing you might ask is what's my latency budget? This is actually pretty important because if you're going to do things like authorized credit cards the sort of the rule of thumb that I've heard in the banking world is if for usability reasons I have 200 milliseconds to like refresh a web page you know you've heard that number probably it turns out of all the things that have to happen between me clicking buy and getting a response the bank gets about 10 milliseconds to make a decision with your credit card purchase. That is way too short for something like Spark Streaming which is still mostly kind of a mini batch model where it likes to get data in chunks and then use the efficiency of the cluster to process in mass. But for something like ACA or Kafka Streams or one of those you could reasonably do 10 millisecond individual processing of events you know complex event processing. Sorry about that again. Volume is also important so some of the engines I'll talk about really don't scale horizontally so they're really great at low latency maybe but if you need to do a lot of stuff like process the Twitter fire hose maybe I'll just hold this thing I think this is a bit of an old Mac and it might be the connector screen. Oh it does? Okay all right well I won't touch it then maybe that's better. But volume so if you are processing the Twitter fire hose and something like Spark or Flink is really great because they can partition that stream as it's coming in and then do things in parallel but you know if you don't have quite that scale but you need lower latency then maybe another tool is better. What kind of processing are you going to do? It seems like everybody is layering SQL on top of their streaming engine. You may know that Kafka the Confluent guys just announced this at Kafka Summit like a week ago or two weeks ago. If you want to do machine learning and here there's a really interesting problem in machine learning in the streaming context and that is I'd kind of like to be training models incrementally because you know let's just say spam is actually evolving more quickly than it really is. I'd like to be adjusting my spam filters to the evolving threat and there's this notion of concept drift in machine learning where my model is getting stale over time. I had a one of the best managers I ever had used to say that software has a half life you know whether you're touching it or not it's kind of decaying maybe because it's growing irrelevant or something and I think that's true in machine learning. So back to the point we'd like to be able to train models but also serve them with lower latency than we can possibly train them so that kind of forces sometimes a choice between maybe I'll train with Spark because it's pretty good at that kind of batch or you know mini batch stuff like I might do maybe every hour I'm going to update my models but then how do I take that model and actually serve it from my low latency stream engine that may not be Spark so there's some interesting problems there and if I'm just doing simple filtering and transformations you know like for say ETL kind of problems extract transform and load then then I have a lot more options and some things are going to be better than others for other reasons all right moving on to the third page of these trade-offs this goes back to that thing I mentioned a minute ago where I might actually need to look at each event individually like authorizing a credit card doing like fraud detection that kind of stuff or maybe I'm just doing bulk processing where everything is coming in each of these records is actually kind of anonymous I just want to clean it up maybe join it with some other side band data and then spew it out to its downstream consumers I don't need to process it individually I could kind of actually exploit the efficiency of doing things sort of in groups and then the last one is what kind of interoperability do I need to other tools actually the reason this point is here is because it turns out Kafka Streams is really designed to read and write Kafka topics and it does that very very well but if you wanted to have it process like be a hook for restful input then you'd have to use something else along with it but other tools have a lot more flexibility in terms of just direct connection to this that and the other thing like writing to databases so let's talk about these four tools that I've sort of already mentioned and how they fit into this picture and then what Mesos does for us to make them you know really good tools to work with and they're actually fall into two groups there's Kafka Streams and Aka Streams have a lot of synergies and various ways they'll get into and then Flink and Spark kind of fit into their own little subgroup so if I'm going to do Kafka Streams this again is the library that sits on top of Kafka and actually to be more precise this is actually important you write an application that embeds Kafka Streams as a library and so you manage your application your microservice whatever any way you want the exception of that it turns out if you do use this Kafka SQL KSQL library that sits on top of Kafka Streams there are some services that you do run in that case but anyway it's it can be pretty low latency latency here is actually more limited by how long you let your queues get your topic length become in Kafka that's going to be your kind of latency budget in Kafka Streams it's not designed for you know it's not a tool you use to shard this massive pipe of data across a cluster into partition so you can run in parallel it's not really designed for that it's more like medium volume which I don't want to say low volume because that sounds bad but you know not volumes like Spark can do it's fantastic it has a lot of really good primitives written in to make a lot of common problems easy to do like extract transform and load transformation kind of stuff where I'm just gonna my favorite example of this is maybe I'm in chesting raw log data and I want to parse it into some sort of record format write it to a new Kafka topic and then downstream consumers aren't parsing strings they're actually reading records that represent the log data but they also have some cool table abstractions so if I just need to do aggregations I don't actually need to see every record I want to see like the average over the last minute or something they have some nice facilities for that and so you can sort of if you think about how what that means in terms of individual record process and I kind of wrote this last bullet backwards in a way ETL would be like looking at each event one at a time whereas the table abstraction would be more like I'm doing sort of a data flow that's doing aggregations sort of anyway now Aka streams is so again we're talking about the smack stack a placeholder Aka streams is actually a DSL domain specific language on top of Aka actors so rather than having to write low level actor primitives you can write things as data flows and then it will materialize actors for you it also is a library that you embed in your applications so it's sort of analogous to Kafka streams in that way it can be very low latency so even though there is a bit of overhead sending messages between Aka actors it's pretty good down into the millisecondish range I mean you wouldn't use it for like controlling SpaceX rockets when they're landing but it's pretty good for things like fraud detection credit card authorization and stuff like that again sort of medium volume not designed to partition your data it actually of all these tools that has the most sophistication in terms of the kind of event flows you can define it's the only one of these for example that lets you do like feedback loops although I really have no idea what that would look like I'm not sure I would try but you could do feedback loops if you wanted and so forth and it's Aka's kind of famous for doing this so-called complex event processing like you know fraud detection or credit card authorization and so forth so this is I drew this diagram like maybe two hours ago so it's not very good but it's sort of the idea of what you might what I mean by how you might embed Aka or Kafka streams into an application that's actually the application here is represented as microservices that you'd be running that are all processing the stream of events the thing about this really the reason this slide mostly is here is to talk about how this is supported in Mezos and once again because we have really good support for containers for you know lightweight containers as well as you know fairly big things it's pretty easy to have that event stream coming out of Kafka whether it's Aka or Kafka streams and then I've got these microservices that are reading the data maybe really this diagram is better for Aka because it implies that I'm going to route this over to other microservices and do manipulations of some kind depending on what the events are and I'm not really showing output but you would imagine that maybe it goes back to a Kafka topic or a black screen or something like that now the other two that sort of fit into a different category Spark and Flink are actually deployed as systems that are running their own services and then you submit jobs to them that they figure out how to partition and to work over the cluster and that's really actually even true for streaming it's true for queries etc okay so medium latency though because at least as of today and this is sort of going away over time Spark actually is doing mini batches behind the scene so when you say I want to run a stream job it's actually going to box up some amount of data and then do some processing over it so there's some latency there in the old Spark streaming API the latency was maybe 200 to 500 milliseconds as a minimum so that like that credit card authorization example no way you can it can use Spark streaming for that because you don't have 200 milliseconds to wait but it is great if you're training machine learning models and you don't mind like having minutes or whatever windows where you just accumulate data then you know do an incremental training of the model and then do something with the data downstream they will eventually get rid of this minimum latency and make it more of a true streaming engine that's work that's actively being done but this is kind of the state of the art today I really love the SQL support in Spark it's now ANSI like 2002 compatible so it's kind of impressive and then of course if you've written your logic for Spark streaming you can use it for batch mode processing too so a lot of your offline warehousing apps can be done this way but it isn't designed for single event processing that's what I mean by unmask processing more like you know just chunks so the flink is the fourth one and a year ago I wouldn't have included flink here but I really decided that there were two important reasons why I wanted to have flink in as part of our product one is that it is designed as one of these high volume tools but it was designed initially as a streaming engine from the get go so it does actually do low latency really well so if you need to do like this partitioning of data but still have reasonably low latency then flink is often picked by teams over Spark for that reason it's really more of a true streaming engine from the get go the other reason is that there is this Apache project called BEAM that's Google data flows sort of top side and the reason I put it that way is Google did something rather clever they open sourced the data flow definition part of data flow I'm overloading terms a little bit I know but not the runner part the thing would actually materialize that data flow and run it so if you're in Google Cloud you get data flow and you can run it that way but if you're not you need something else to run it and as of today at least Apache flink is the best tool for running BEAM data flows and what that mean why is that important well BEAM to me is like the small talk of our era I'm really dating myself here but small talk was like the the programming language that everyone aspired to back in the 80s okay yeah I know the 80s actually existed for most of you it's just you know this thing that people the people my age claim actually existed but the thing was nobody actually used small talk but they all talked about it it's like the gold standard well I feel that BEAM might sort of fall into the same category it's influencing everything that everybody else is doing the reason being that they're they've done a really good job thinking about all the weird things that happen when you're processing data and you want actual accuracy not approximate numbers so for example suppose that you want to for your accounting purposes you need to know like let's say for every 10 minutes how many units of a particular you know skew did I sell in my stores you know around the country you know I work for a physical retailer let's say well that sounds like not too bad I'll just set up this sort of batch or mini batch job or something in streaming that looks at 10 minute windows and then it just does these rollups and bang right into the accounting system well not so fast unfortunately because light has a finite speed chances are pretty high that most of those at least a lot of those numbers are actually going to show up later just because of the time it takes data trans it traverses the network but even worse if I get a network partition and data doesn't show up for maybe 10 minutes or an hour you know how do I handle this late arrival of data do I decide all right I'll do some approximate provisional calculations and send those down but I have this mechanism for sending corrections if I get data arriving late these are the kind of questions you end up asking yourself if you want to build something that doesn't just do approximate aggregations over windows but tries to actually be you know sort of really really accurate like the kind of stuff you'd want in an accounting system so Google's thought about all of this the beam does a really sophisticated job of presenting this and that's why I really like Flink because they're ahead of everybody else in terms of supporting these semantics if you need that kind of capability okay I'm running out of time so let me just quickly finish I showed this diagram before but just to emphasize both Spark and Flink they run services in the cluster you submit jobs and then they figure out how to partition it into tasks as opposed to embedding these tools in your applications the last point I want to get into a little bit that I've alluded to is the sort of merging of architectures that's kind of happening so if you go back a few years ago that there was the big data people who mostly worried about data availability and you know scaling to big data sets but didn't worry too much about you know high availability high resilience scalability those kind of things it wasn't as big a problem for them as it was for the guys building the web servers in your organization which I just lumped under services but I think today as we've moved to microservices and fast data they're kind of converging a little bit and let me just quickly make the case and then I'll quit so if you think about like a classic microsystems architecture it's usually I've got each little microservice does its own thing has a one responsibility I can evolve them independently I can you know scale them independently I can drop in new ones and this fits the masos model really well because masos supports containers really well and I can easily you know deploy multiple things and do this kind of stuff pretty nicely that may not seem to represent what's going on is streaming very much but actually I think most of the time people are building very similar things in streaming architectures I've got whether I've got spark or aca streams or whatever I'm usually writing an app that does one thing I'm going to deploy it I may need to deploy a lot of them I may need concurrent versions so I have a lot of the same concerns and I'll just skip past this the synergy stuff just to get to the point but I think what's actually happening is that it's kind of forcing these architectures to look more like than different if I'm a microservice person and I'm used to building I was building three tier apps now I'm building microservices well if I'm successful then my data is going to become dominant as my business grows and so now I'm going to be worried about building more like stream data processing you know if you think about what the Twitter architecture must have looked like from 2007-ish or eight to today and that's sort of the evolution that they went through conversely if I'm now going from Hadoop to streaming architectures now I'm going to have to learn how to write services that will live for months and resist network partitions and stuff okay the last thing is one thing I think we need to still solve I'd love to see this solve sort of generically in the masis world is like common mechanisms other than maybe zookeeper where I could have stateful apps that can persist state in a globally available way so that if some of those processes go down I can easily reconstitute them and not lose where I was and right now like I said everybody spark flank etc they all do it their own way or they don't do it at all okay that's it any questions like where's the beer so I'll repeat the question for the video if flank is so much better kind of the low latency stuff and the kind of sophisticated semantics why would I use spark I think it's a couple of things one is I still think spark is probably your better choice for a lot of if you still need to do a lot of batch stuff and if you have an organization that's already bet on spark like maybe you have a established Hadoop cluster a Hadoop organization that's using spark then it wouldn't make sense to use it and not everybody needs those semantics that I described you know briefly with fling yeah so I don't think anybody's going to use all four of these unless you're a really big organization where you're one of those places that uses everything in the world but I think that's one of the challenges is people are going to see a lot of overlap in some of these tools and just kind of make an arbitrary or semi-arbitrary choice between them and maybe have two like I'll use Kafka streams and flink or I'll use Aka streams and spark and but I won't use more than that yeah yeah some questions what about SAMHSA and there's some others you could throw in that I've heard about or people ask about like Apex Apache Apex and I think I think SAMHSA and storm kind of fall into the same category for example of tools that kind of sort of were pioneers in this space but have kind of suffered a little bit from oh we kind of did that wrong we kind of did this wrong and now maybe we need to start over and you know build the next generation so I think that they both kind of suffer from that a little bit although if you're a storm user you should really look at Twitter here and which is like their complete rewrite of storm I can't say I've actually used SAMHSA but I think in general from what I've seen of it it sort of falls into that camp of this was a really good first generation but we probably need to move on to a second generation technology maybe time for one more okay thanks a lot