 So let's take a poll so that I can get a fair bit of idea here. Why you guys are here? I mean, you guys are into big data, or who is writing big data jobs in there? That's their day job, okay? So others are here just because they're curious. That actually makes a lot of sense, because I'm also along with you guys. My experience mainly has been building distributed applications. I work for a company called Typeship. That's mainly what I've done, played with Scala, Acca, play framework for a little bit. But recently I moved to a Spark team. We have a Spark team where we contribute stuff to Spark streaming. So this presentation is essentially going to be... kind of share my experience for the last six months, and the way I'm seeing this space, big data space is evolving. So I'm probably not going to give you a bunch of answers. I'm going to end up with lots of questions here. And maybe that's a good thing, so that you guys can go back and explore this field a little bit because it's still evolving. But one of the things that I'm realizing, I've been doing programming for a while, is I've been watching this big data space. I'm going to define these terms shortly. What does big data mean and all that stuff? From the fence, looking at this world of Hadoop and how it's evolving and lots of interesting stuff, stops happening, analytics. Now there's a streaming bit of things coming in. And as a programmer realizing the types of problems the business is solving, because most of the technology changes can be driven by... typically gets driven by business requirements, is as a programmer's for the folks we haven't raised the hand yet, it's becoming very, very critical skill to have to at least understand how this space or how to work with this large amount of data. Because if you think about it, data is the kind of problem we solve day in, day out in type of programming we do. What is changing is essentially the sheer amount of data that we have to handle here or the way the data is getting produced. I think it's becoming very, very important and critical skills for any programmer out there. So let's define this term big data. What does that mean to you? Shout it out. We're going to make this very interactive otherwise I'm going to sleep. Please, shout it out. What is it big data means? Data is where? High rated data. Okay, what else? Unorganized. Unstructured, unorganized, okay? You cannot fit in in one machine. You mean in the memory, in the file system, you cannot fix it. Okay, what else? Big data volume. That's true. One of the critical problems of big data is where size is a problem. Okay, size becomes a problem. But actually the way I like to define this, this is a word for storing, managing, and making money out of large data set. That's essentially what it boils down to. People want to make money out of it. What's the fast data? Well, typically when people talk about big data, the things that come to your mind is this big gigantic cluster of server data center where jobs are running forever or takes a while, hours and hours to run. So in that comparison, fast essentially comes down to, well, processing is fast. It's almost happening real-time analysis, online machine learning, again for profit. We are doing something to make out something out of it. This is probably much better definition of that. That was a joke. This is probably much more better definition. That is fast data essentially captures, so we're going to look at some of the examples, the different approaches people are making here. Essentially captures those general set of ideas and sees that because it's a very interesting aspect of fast data is a trade-off, so you do. The previous presentation, there was a discussion about online distribution, a large computation. This is all essentially boils down to some sort of trade-offs. And how can we kind of deliver things timely? That's what's happening. I want to know things right now, how my customer is doing. I put the ad out there, how my ad is performing. Questions like that, they want answers to the question. And processing streaming data. So what we're going to do in the next 40 minutes is to kind of parse this idea and see what is that essentially means. How people are using this kind of solutions. I really like this code. This is one of the fundamental things where the space is coming from. The value doesn't appear automatically out of it. You have to do something for it. Now, big picture idea. We'll step back here and what really happens when you say, okay, I have this streaming of data with a large data set. Yes, I agree that value doesn't come automatically out of it. I want to get value out of it. But essentially, all this thing starts with a problem. Okay? I have a question. I want to find an answer for it. So if you start with the problem, then what happens is that, again, big data space of the way this teams perform are very cost-functional, very, very interesting skills. Okay? Because you'd be working with PhDs. Data scientists out there, engineers out there, data analysts out there, very, very cost-functional things, usually. So first thing essentially happens, okay, we define the problem. Can I express this problem in some sort of specification? Some sort of terms. Once that happens, the usually the first data solution looks like is that we're going to get this data feed into some sort of pipeline. We're going to explore what that looks like, what that means. That pipeline could be backed by a bunch of stuff, like source managers, HDFS storage. Again, somebody's saying that one of the biggest problems with big data is that storage, HDFS kind of addresses and distributed file system, right? And essentially what happens is that this is very, very typical architecture. Okay? Once the process is done, the next thing you do is that immediately dump that to a fast storage, in-memory storage, radius, cache, whatever. So that, so that at the end, someone can take that and consume the result back in. Please ask me questions, okay? Whenever there is any thoughts, comments, all are good. Okay, that's a general idea. This is how they try to solve this problem, this big data problem. Now let's look at some of the interesting examples. Here is one example of how some companies are trying to do detect network intrusion. See if there is a DDoS attack happening in my... Is there an anomaly in my access in the network? How can I figure that out? This thing has to be fast, real-time, right? I cannot wait for something to happen tomorrow. So usually this is a very common pattern. What happens is stream of data coming in and the first thing they try to do is that save that data somewhere as soon as possible. Okay? Then you mind that data, process that data. Then you feed into subsystem where your customers, users, somebody can consume and see what's happening in the network. Let's go more specific. This is the fast data platform in Pinterest. What they're trying to do here is that, again, essentially see how users are interacting with their application. Again, very typical. If you see that this was the... This was very generic mental model, right? This is a more specific one because now we're putting specific tools around there. So first thing usually happens is that those data, in this case, getting dumped into something called Kafka. This is your event log. Okay? As soon as possible, save it somewhere. All right? Then the next part in the chain picks it up. In this case, what Pinterest is doing is that they use some tool called Sensor, which usually saves the data in S3, a large storage. After that happens, the data pipeline then picks it up. Some sort of processing happens on top of it. So that I can come up with some interesting results so that I can answer some questions. Again, it all starts with a question. This is another example. I will give you a bunch of examples so that I can relate things a little bit, okay? This is a recommendation engine and Netflix. Netflix? It's what Americans do Friday evening. Watch movies, right? And I'm not sure about Canadians, but... Recommendation engine. Now, who has time to browse to all the movies available, right? So I recommend that Netflix is very much recommended, driven. Your page looks very different than mine. So what that means, and this is a new feature they have added, is when you go to Netflix.com, this panel shows up called Trending Now. This is the movies people watching right now. Now, this is not the raw dump of movies happening. On that trending now, all the movies people are watching, they're doing algorithms on top of it and only showing movies that you might be interested in. That's an interesting problem. That's a value add right there. Again, a very typical pattern. Some sort of data collection happens, okay? Data collection happens. Then they try to do some sort of processing on there. Here they're using tool, again Kafka shows up. Kafka is very, very becoming a standard defactor for event logging. And again, some sort of processing happens and we produce the result back in. Again, I will show you a quote for this. Here is another interesting example, is that we want to predict a breaking news. There's something interesting happening. We should watch out for something. Again, a very interesting problem if you think about it. Again, something that you cannot wait at the end of the day and find an answer for. It is something you need to know right now. Okay, so that you can take some sort of action on it. So this is an example where it's driven by Wiki edits, which is very interesting. If you watch out the real-time updates happening on Wikipedia, there's this crazy amount of data gets added. And another interesting thing happens in Wiki is that when something important happened and interesting event happened, there are a group of people who jumps in Wikipedia and try to update the document out there. So Wiki becomes an interesting indicator that something interesting happened going on. Maybe you should watch out for a particular page. But what this all has to do with functional programming? Why are you even talking about this? I'm already bored for a bunch of people here. Well, I think, and I agree with Dint Wampler here, DIC Data Apps is the killer app for functional programming. Actually, there is no other way to solve the problems that we've been talking about here. I have seen so many companies out there just jumping on Scala for an example just so that they can try to solve this kind of problem. Look at this. I'm hoping. I just hope that you can read out the code here. This is an example of what goes on in a... Oops. What goes on in a box like this, where I'm calling a data pipeline? Where we are doing interesting stuff? Typical code looks like this. It's almost like data flow programming. You take stream of data. You map on it, transform something. You then you flat map. That means probably integrating another stream of data. We are combining. We're essentially joining two stream of data. Then you're performing this transformation one after another. This is your functional programming, isn't it? And finally, when you're done with this, you perform the end-of-the-world side-effect terminal action saying, okay, save as a text file. Go to S3. Save it to dump it to Cassandra or whatever. So what is streaming? Remember, we've talked about the first data's definition is I need to work with... I don't need to work with streaming data. And I have to do a bunch of trade-offs because it has to be close to real time. I have to find answers as soon as possible. So definition of streaming is a type of data processing engine. It's an engine, data processing engine, designed to work with infinite set of data. That kind of gives an example of that, right? Trending now, amount of movies, people are watching on Netflix. I don't have control over the size. Amount of edits happening in Wikipedia. Amount of people trying to log into your network. Infinite. Just continuously, this event's happening, right? We cannot size them anymore. There are other terms that have been used to explain this scenario, but we're going to stick to streaming, okay? So if that's the case, what are the hurdles? If I've convinced you guys that fast data is the way to go, because one of the fundamental evolution here happening, an interesting change happening here is that it was okay two years back when a Hadoop job took five hours to come up with an answer. What business is obviously, it's coming with the business, it's also evolving. They're trying to say, why wait for five hours if I can find the answer in five minutes? That's kind of forcing a bunch of frameworks, tools out there that can rebuild something that addresses this fast data problem, this scenario, this analytics, these algorithms that people are trying to build. So there are a bunch of hurdles in here. It's a difficult problem to solve. The patch so far have been called something called Lambda architecture. Has anyone read about Lambda architectures? This is the patch that people use right now to solve this problem. This architecture talks about, hey, yes, our streaming solution is not great, but we still have to do that because my customer is saying I want answer in five minutes, but what we could do instead is mix that with our batch process, the traditional model that we have. And somehow kind of combine those two results and come up with a better answer for you. The downside of this is that this is an interesting idea but terrible hack because it doesn't really work in reality because what happens most of the time is you have to implement same algorithm twice. One for the speed layer, one for the batch layer. What the solution we want to do is I want to make my speed layer so fast enough, a better enough, a feature rich enough so that I don't even have to do a batch layer. I want to, my only data processing pipeline will be the speed layer, nothing batch. The next interesting problem in this space is called, is this latency, right? Not only I have to process the data, I need to come up with an answer as soon as possible. So there are lots of interesting change happening in this space is something called probabilistic data structure. And the most interesting tool I have found in this called BlinkDB, this is one of the Barclay project as well. This is doing queries with approximation, with time bounds and error bounds. We're doing stream processing, we're saying, yes, it's impossible for me to process one terabytes of data and give you an answer in five minutes. Yes, I'm sorry, I cannot do that. Then we come up with an idea of the world. Can you come up with an answer which is 95% correct? I can work with that. This is a very interesting space where we need tools like this, where we can have this time bound guarantees or error bound guarantees. Another aspect of this is that when I'm taking this job and running in the server node, the cluster of nodes out there, I need some sort of cache that can work along across the nodes out there. There are tools like Taqqon which is addressing that problem as well. Machine learning algorithms, something machine learning that's going to run on as the stream of data processing through the pipeline. I need algorithm which can figure that out. Clustering algorithm, anomaly algorithm, whatever algorithm out there. This is probably the most important thing that related to us, the computational model. Okay, that's fine. But if you really want me to write a Spark job or Hadoop job, you better have a good API, good model, because this doesn't work. If we ever written a Hadoop job, probably knows what I'm talking about here. Skip this. We know it's all important. So what is my wish list? There are a bunch of stuff in my wish list. I always try to create this wish list. Why not, right? You can always wish for something. And it turns out there are no tool out there which kind of addresses this problem. But what is emerging out of this space is, what is emerging out of this space is a combination of platform that takes these ideas, and so the one tool cannot solve this problem. We have to kind of combine this together. And those tools are Apache Spark, which is not really close real time, but something that works with mini batches. I'm going to explain a bunch of this stuff shortly. And another layer, which is a low latency layer. So there is no tool out there which effectively does par event processing and a batch processing together. So what essentially happens is that people are typically combining low latency streams, the streams that can process with load as quickly as possible with something that can do batching as well. Let's take some examples here. So the Spark here is the replacement of this thing I was talking about here. Spark going to be the replacement of our speed layer. And it has the capabilities out there which can essentially get rid of the batch there we are talking about. That's what we are driving for here. So what happens here? If we want to really solve this kind of problem, let's have a real time analytics problem here. What essentially Spark does is that this is where my data pipeline is going to look like. We're going to have this input stream of data. This could be click stream, people viewing movies, someone logging into the network. All this data comes in and Spark streaming uses a concept of mini-batching. The idea of mini-batching is it's not worse like I am waiting for all the data to be saved before I start my job. It is essentially saying as the data is coming in, break into windows, break into batch of let's say one second or five seconds. It turns out that solution is good enough for the 95% cases out there. For 5% other case, we can talk about the low latency alternative. And then this mini-batches gets into this engine where the actual work happens. This is where we write that pipeline code, the functional code, the Dataflow programming code. And spit out this data, this process data, the results, the outcome that gets saved somewhere else. Let's look at a code right now. I'm going to first run it, and then we will go through the code. This is the example for showing how we can use Wiki edits to kind of predict what interesting events happening. So I'm running something called back-end which essentially hooks into IRC stream of Wikipedia and gets all the real-time events as when the people are editing. If we are lucky, we might discover something very interesting happening right now. Or I'm lucky, I don't know. So I set the back-end. Now I'm actually starting my fast data job. Or I am starting right now. What this is essentially going to do now is is obviously connect to that, and you see the time showing up. This is my window size. This is the mini-batch. So we are showing only top 10 or 15 here. So there's some interesting change happening in Joseph Vaughn. How do you read that? So within the last, I will show you the window size, within the last two seconds, or two minutes, that particular file, that particular URL has been updated twice already. I have seen cases where it gets updated five times, six times as well. This is a continuous running. The way the things are designed, it will never end. I will keep it running, unless you look at the code a little bit here to give you the context of what's going on. So this is implemented using something called Apache Spark. This is another speed layer example. The first thing I'm essentially doing, let's start from here at the end. In line number 40, what I am doing here is that I'm saying the context, some sort of context, the runtime, the engine. I didn't start my context, my job here yet. The reason I'm wrapping it to get or create, because if my job has previously ran and it has cached some data, I want to reuse it again. So it's a fault tolerance mechanism. Now let's say we are starting this job for the first time. When you're starting this job for the first time, line number 17 here, what I'm saying is that we are starting the streaming context and we are saying our interval size is one second. That means what I'm saying is that for every seconds, whatever source is, get the last worth of last second data, pack it in one batch and give it to me. So essentially every second, what is my source here? My source here is a socket-based stream. So what we are essentially saying here is that for every second, worth of data, put it in the batch and give it to me. In line number 18, we're doing a checkpoint directory. Again, it's a fault tolerance mechanism, as when I'm caching data, because this is a state-full-stream operation. Remember what's happening. Every time I encounter an URL, I have to keep a track of it how many times I have seen it. That's why I can tell. Then we're going to put in some sort of limit saying, if I see a URL getting updated within a minute, five times or more, then something interesting is happening with that page. Once that is done, the next bit of code probably will excite all the functional programmers in the room. This is my data flow programming right there. That's my transformation. So it's saying that while the stream is flowing through my system, what you essentially do is that I can have n number of lines in there. Split that into multiple lines. Flat map it. Why am I flat mapping it? Can anyone tell me? Anyone understand this code? Yes, so why flat mapping it? Exactly. We are starting with a collection. There could be n number of things coming in. Now, for each one, I am partitioning to a number of lines. So flat map essentially transforming it and flattening the structure. So each element in a collection could produce another collection. What it's going to do is that instead of producing collection of collections, it's going to flatten the structure. So at the end of this transformation, that flat map, I'm going to end up with a collection of lines. In this case, the line is essentially the URL getting edited and the timestamp with that. Once the line is done, essentially I'm doing a map here. Map is essentially doing a transformation. That means taking, giving a string or element, transform that to a JSON object. And once the JSON is done, then I am doing in the line number 24, essentially taking out specific elements out of it. I know this is a Scala code. I don't expect everybody to understand Scala code in here, but that's a general idea. It's all transformation, all pure functions happening here. And another interesting beauty of Sparky is there is I haven't done any work yet. It's actually lazy. It differs the computation. All this once is done, then I am doing in this line here, line number 26, this is where I'm doing the most interesting bit here. This is my stateful computation. Once I have figured out all the elements that I'm interested on, then I'm saying for each time you are finding an element here that matches this... So I'm saying reduce by key. In this case, my key is the URL. See, if the key is the URL and you have seen that URL, increment the count here. Now, remember, this is a very interesting windowing operation, right? I am going through it. I'm only interested in the data worth of last five minutes. I don't care if a file gets updated 20 times over a day. That's not interesting to me. What happened in the last five minutes window? So as when the data elements are going off the boundary of the window, I'm using an inverse function here to kind of get the reduced account back. As an element enters, I'm incrementing the count. As the window progresses, I'm decrementing the count. Make sense? Because I'm only interested in a specific time frame here. Then I create this window duration here, the size of a window, and then finally I'm creating... Don't show me everything here. I'm only interested in putting two in here. In real world, you probably should put 10 or 20. Only keep the elements that is interesting to me. Has more than two counts. Others just drop them out. Still, again, this is what I'm doing. I just created my process pipeline, a description. Nothing else happened yet. Now, finally, when I perform in Spark wall, this is called action, a side effect, an action. In this case, I'm saying print. This is the one that's going to trigger all the computation happening here. Any question? Show me 20 elements. Stop 20 elements. No, so the question is where I'm saving the data. This reduced by key operation is state-full. It's actually saved in a memory right now. That's why you need to have a checkpoint directory. So Spark has this facility that every time-to-time is going to dump that data. This stream is the abstraction, the type. Wow, that was loud. Abstraction. Hello. This stream is the one that those green boxes here are a this stream. Abstraction over that chunk. One more time, sorry. Yeah, yeah, totally. I'm doing an accumulation here. That's why I'm saying it's a state-full operation. As the elements flying up, I'm actually keeping track of it, what's happening. The track is actually limited by a duration. That's why I have this inverse function in here. I'm not keeping it forever. I'm actually taking things down because I will keep reducing when the inverse will kick in and the filter will make sure I drop the element back out so that my memory doesn't grow gradually. The top edits is my process description. If that makes sense. It's not the data yet. I just defined my pipeline. It's a description. It's how I've encoded. That's my program. You see what I'm saying? That's my program. That's my transformation pipeline. The data that I am saving in memory is the data produced by this reduced function. The counters. The data I'm saving in memory is the one that I'm printing here. It's nice abstraction. It almost has this characteristic. If you think about it, SQL is kind of a subset of a functional programming concept. It's with this old relational algebra and the features they use. It's a where becomes nothing but a filter. All the transformation we do in SQL also. It's like a map. The big data space, if you're really interested in functional programming, big data space is the place where you can explore your skills because they are very math based and functional based. Very, very much. Not to mention the types of problems you get to solve. Moving on. Show me the code I did, actually. Under the hood, what happens here is when I describe, build this top, it is my process pipeline. That's where I'm building it. Once that is built and I say print, then when this program will get sent to you, my nodes, that, hey, time to run and execute this description. And this model has an interesting abstraction over, you know, now this is a cluster. I mean, run it wherever you like. To ask are, yes, it can. Yes, based on the resources. I mean, the cluster manager is abstraction. If you're using, for an example, missiles here, yes, it can automatically figure that out for you. Yeah. It's a soft real-time, though. But that's an interesting question because if you look at this here, because that's my next topic, that's a good segue. Now, I have no control over the data that's getting produced. Am I right? This is a hot input stream coming in. People are clicking away in my website. I cannot stop clicking. So data coming in. Now, most of my hard work is happening inside this engine because that's where the top edits or whatever thing that we just displayed there, right? Well, let's say for some reason, so let's say my window size is once again. So every second, no matter what happens, this guy is going to keep producing this box system, this chunks. Essentially, this box streaming is a box generator. They have a class called box generator. So they create these chunks out here. Any idea what will happen if for some reason, let's say I'm taking more than one second to kind of do the analysis. So you're producing data every second, but I am usually fast enough, but something is going wrong. I'm taking more time to process your data that you're piling on. Because remember the amount of data, this is all very interesting, the amount of data I get within each chunk have no control. What spark streaming is doing, give me last one second state. I don't care how much you have it. So sometimes my batches will have 10 rows, sometimes it could be 100,000. One of the biggest problems, that's why I was saying there is no tool out there which kind of addresses this problem. But any guess is what will happen if I slow down. The producer is faster than the consumer. What? All right, I keep queuing. I mean how much I can queue? Yeah, that's one of the biggest problem here with the spark streaming had this issue is you're going to keep crashing because you really don't have control over the producer the data is coming in. So they didn't have a concept of back pressure. Back pressure is a mechanism in each way you control the flow of the data. You control the demand. So typically what happens, there's another new kid in the block it's called reactive stream which is trying to solve this problem. So this also gives you an indication that streaming is a very hot topic these days in a big data space and a way it's evolving. The idea of reactive stream is is there a way I can take that because it's a very, very common problem out there, very, very common, that I'm not keep up with the data is coming in. I want to be real-time. The problem with saving in the file system or doing it later is again going to the batch mode, right? Because has anyone heard the term called code is your liability? Narish and I was discussing about that few days back as well. Your source code is a liability. Anyone agrees with that idea? More code you have, more bugs you can have there, right? And all that, all the things like that. The interesting evolution is happening is that data is becoming a liability now for lots of companies. Yes, we have spent the last five years collecting data, but I have no idea what to do with this, right? So this is also an evolutionary state to fast data. Hey, let's start with the question, what answers we're looking for, and go through this data pipeline and then do the processing. Once we're done the processing data, I don't even care about that. So that's also going again saving the file for something else and data. Going back to this idea of that, I want to do a stream processing, but again have to worry about because I don't have control over the way the data is getting produced. So reactive streams kind of taking that idea and saying, hey, there got to be a way to solve this problem. There's got to be a way where I can communicate my demand, my flow, request, the amount of data I can handle in non-blocking fashion. The way it looks like this, a publisher, it could be any source here. You only send enough data to Subscriber based on his demand. This is how TCP works, by the way. We are taking that idea, borrowing that idea here. We're saying Subscriber should be the guy who should be controlling the flow of data here. Mostak has done a tremendous, there was a morning session in here about reactive streams to kind of talk about this idea here. And it goes beyond this model because you cannot just say, I will do a pool-based streaming system. That means, okay, how about consumer just pull the data? It doesn't work that way. It almost has to be dynamic because some point your producer might be slower and consumer might be faster. So it's a dynamic mix. An ideal streaming solution should have this mix where it can switch between push and pull. Not to mention you can merge data and streams all together. You can skip this. Let's look at another example because this example is interesting to me because not only, oh, yes, okay, you're doing a count on a running stream. What's a big deal? Okay, I'm incrementing a counter. This is not very interesting in a way. But this could be an interesting, right? What's interesting in here is that I am observing my network traffic. Let's look at a problem because that's probably the best way to explain this thing. And yes, I'm saving it some sort of log so that I can mine this data. While I'm doing it, there could be an... I have a specific requirement where I need to process this. Every network request coming in, I need to do some analysis on this. So that per second batching idea will not work. Okay, the requirement is I need to do par event. Okay, eventing system and streaming is very correlated here, par event. Every network, I need to, let's say, let's take a simple example. Every time something logs in, I want to show that in a map here in some dashboard. A login happened from China. Logging happened from North Korea. Logging happened from somewhere. And then we produce this data and not only that we do, we also need to figure out, hey, this network request looks valid. Maybe something weird going on. I want to do a network intrusion detection here. Maybe this request looks odd. How can I do that? Why are the stream of events happening? So there is another evolving space in here is that how can we take this traditional machine learning algorithms which usually work typically you give a set of data, training data, right? Give enough training data to the computers so the model can understand, okay, now I can tell, if you show me a picture of Apple, I can tell it's a fruit, okay, or it's a red color. Then you take that model and you say, okay, here you go, run on the real data and let me give me the detection. But the problem we have here is that first of all, A, we don't even know what the odd request looks like. We cannot, it has to be completely unsupervised here. I cannot even tell you. I don't even know how if somebody doesn't hack on it, how it will happen, I cannot tell you. Because if I can, I will fix that immediately. Second is that I cannot give you enough training data to do that because I don't have the data set anymore. I don't have that data set anymore. You need to automatically evolve and learn. So there is a new space completely happening where people are building online machine learning algorithms, set of algorithms that can learn as the data is processing through it. So in this case, what essentially we could do is that if we have to solve this problem in Spark Streaming, is we have the standard usual suspects in here, check pointing and all that. Then there is a solution called Streaming K-Means. K-Means systems are clustering. And we are starting with some random data set because I don't look like it. The center of this cluster is this. Pick out some random three centers for me. Usually K-Means, the way work is that usually if all the normal HTTP requests, TCP requests happen, they will cluster together. They all look very similar in one group. The one that's left out of that group. And then usually what you can do is that you create your... Oops. You create your streams here, then you finally say, hey, K-Means train on this stream as this stream is processing. And give me the prediction result. Is there any interesting event happened in the system that looks odd out of this? This is a very, very interesting space where people are trying to solve this kind of problem. Well, I got actually to explore a little bit. I know it was not much polished. Lots of opposite and weird ideas. But any question, any thoughts, comments, share your experiences. That's a great question. So the question is, I will repeat the question. The question is, all right, I'm communicating my demand channel to my publisher. What happens to publisher? What publishers do? I mean it could be a publisher to go to its source and try to reduce it there. Could be a possibility one way. Another common implementation of back pressure could be just drop the packets. Or drop the oldest one you have in the queue. That's also a very common way because for an example in one of the cases we solve this problem where there is a company out there, they want to do real-time traffic analysis. I want to know what's going to happen in Hasu Road right now because I'm living from work. And if you have a data queued up, that gives you a traffic information half an hour back. There's no point. There's SLS in there. So that could be another interest. We didn't prepare this though, but thank you for asking the question. We did contribute a back pressure implementation to SparkStream. It's a PID based though. Yes. I don't, but I will in five minutes. No, I meant 95% of SparkStreaming can solve for other 5% even specific there are other two. I mean, I am saying if I can find an answer in five minutes, why should I wait for five hours? What I am seeing is that it might not be true in very soon. I mean, people, I mean, Lambda felt like a patchwork to fix the incapabilities of streaming solutions. But what's happening in this space is streaming solutions are getting smarter, better and better and better. So if I, essentially, if I can do that, then I don't have to worry about it. I mean, there was a, there was a great blog post about the whole Lambda architecture, the drawbacks of Lambda architecture and another, there was an alternative proposal called CAP architecture, right? The whole, yeah. But again, this is, we've talked about a little bit of vision here, a wish list. I'll get back to you. Probably it's my jet lag kicking in, but there are others. We should talk, let's look it up. But what's happening is that the machine learning package in Spark is one of the most highly contributed active project right now. Lots of people are contributing there, so it's gonna get better. Yeah. Yeah. Mm-hmm. Yeah. Well, in this case of Spark, the story is a little better. Yeah, maybe we can, we should explore this idea a little bit. I don't have an answer to it, but in Spark, there is a way to share the same code. You don't have to write it twice. There is a method called transform using which you can use the same process pipeline code. Yeah. Well, yeah, that's a trend we're seeing as well. Lots of interest in this streaming field. Well, thank you very much. I hope it was useful. Thanks. Thank you.