 That was loud. Thank you very much for having me here. Absolutely thrilled. Enjoyed both of those talks a lot. Before I start, just a couple of quick questions. How many people here in the audience have worked with streaming data before? OK, actually a few. And it sounds like I assume people coming to this meetup would you classify yourselves as developers? How many data scientists? How many people do machine learning? OK, architects? OK, handful. Let's get started. Thanks very much for having me. I want to talk today. This is going to be quite a different talk, much less technical, kind of up on a higher level where you're looking way down at all these different projects. I'm going to talk mainly about a big change that's happening across a lot of industries. And that's a move toward what we're calling stream-first architecture. We'll talk a lot about that and why people are making this change. But it really is a big disruption in the way people are thinking about things and certainly the way they're architecting solutions. Also on a larger scale, the way companies are making choices about the teams they want to build and what resources they want to put into them. I'm also going to talk a little bit about Apache Flink and some other emerging technologies that support this kind of work. Now, I'm going to talk about Apache Flink just to introduce something a little bit different. Most of the work that I'm talking about, people are executing with Spark, Spark streaming, and it's very widely used. And that's certainly something I run into a lot. This is just a reminder. I'm absolutely thrilled to be here in Singapore. This is the second time I've been here. I was here for Strata last year. It's really a fantastic place. And I'm very happy that we're giving the excuse to come. I do consulting for MapR Technologies. I'm also an author more recently for O'Reilly originally in this area writing about machine learning, about the Apache Mahood project for Manning. Many years ago, my original career was as a research biochemist. And so I worked and wrote much more in that area. But now I'm fully involved in looking at big data and how people are using it. Not the person that executes and implements these things, but instead, I look mainly at how people are using it, how they're changing it. My introduction to all of this was via machine learning. So that's something I still have a great interest in. I am a committer on the Apache drill and the Apache Mahood projects. This is some contact information, and I'll repeat at the end. Now I keep wanting to look at people on this side of the room and somehow I turn my head and then we lose the voice. So see if I can remember to do this. OK, so the first question is, why stream? Why are people turning towards streaming data? And really the answer is that the work that we do, people as data scientists especially, the things that you analyze, you get the best results when the work that you do, the data that you use is the best fit to the way that life really happens. And to put this in a very short way, generally life doesn't happen in batches. So working in batches is a very clever and often powerful and a very useful way to make things work and as a workaround. But there are situations where being able to work with data as data streams, as a stream of messages, as data that's coming from a series of continuous events, the closer you can get to that and the way you handle work with data and your analysis, in many cases, the easier and the better the results. But there are some surprising differences to begin to work with stream-based architecture, stream-based computing, and some of those actually have to do with situations that really don't have to do with real-time or very low latency events. So we'll talk about that a little bit. Now I think for a lot of people, the thing that they first think of when they think of working with a lot of streaming data, so popular to talk about IoT or internet or things, data, and there certainly is a lot of that out. The use of sensors, where sensors are going, the amount of data that's being produced, the techniques, the platforms that people are evolving and to be able to do that processing, some of it at the edge, to transport it back to data centers, to share it among data centers. This is a tremendous area and a tremendous change in industry and it is certainly a place that we see a lot of people working with streaming data. I was just recently up, I was in the UK, I was invited to talk at the University of Sheffield in Northern England. They have a seven, eight campus center that is the Advanced Manufacturing Research Center. I think it's called the Boeing Advanced Manufacturing Research Center. Absolutely fascinating place, really state of the art. To look at the way manufacturing is changing so that you have fully reconfigurable factory floor so that people are being guided by smart tools, smart data, they're bringing engineers and design people very close to the actual manufacturing process itself. Changing the way testing is done across a huge range of different manufacturing techniques and we were there and talked to them about how they're using data and some of these streaming techniques. But this is just one example, you see it across so many different industries. This isn't just about real time and sensor data but it's also about the combination of using data from continuous events. We use the example here of say a big industrial setting such as working with data coming from sensors and a drilling rig, an oil field, water well drilling. But where you have drills that have enormous number of sensors collecting data about different parameters, doing this at a phenomenal rate, so huge amounts of data coming in. One thing you might do with that data is literally have various sort of threshold dashboards set up real time dashboards so that you're looking for situations to make sure that things don't go out of hand. But what's even more powerful is to be able to do a form of machine learning called predictive maintenance where you have good long term records about the maintenance history of all the different parts that are going into an industry. Long term records about when there have been failures and what happened with those failures and you begin to look for a potential warning or failure signature that happens before, there's a failure before, a catastrophic event happens and the further up ahead of time that you can recognize that kind of signature the more chance you have to step in, do maintenance, do proactive maintenance and avoid those problems. So that combination of a good long term history, huge amounts of records from that and being able to look at the pattern of data that's coming in and keep those as a kind of history, not just look at the data, lose it right after you use it is a very powerful thing, again across a number of different industries. But streaming is becoming mainstream, not some of these more exotic or very specialized industries. People for a long time have looked at using streaming data, but using it for very specialized projects, they don't think of it as kind of the central part of the organization and that's really the thing that's changing. And so I use an example here of something as common as a web-based business, a web-based business that's using, looking for real-time insights, maybe trying to update a real-time dashboard, looking at clickstream data, you can imagine how widespread these examples are. So again, these are often the first reason that people come to look at streaming data and how you deliver that data to the processing parts of the architecture. So I've taken this example from a web-stream business, I think I may have to do this. There, from a, is that okay? From a web-based business, where you're looking at data coming in from a number of different source, maybe you're taking log data, you have to have some sort of transport mechanism to get it to your processing, in this case, using this processing, maybe to update a real-time dashboard. And so that's what would bring you to that interest in streaming data. But in fact, there are a lot of other reasons to use streaming data beyond those real-time applications and that comes as a surprise to a lot of people. So we take the same example and sort of expand it. If you have the right kind of stream transport and the right kind of architecture built up, you have other classes of use. So rather than just using that streaming data, some kind of very low latency application and then discarding the data, although you may want to do that, if you have a way to persist it, depending on how long you want to persist it, it gives you a medium or long-term history, an event-by-event history or a long-term auditable log that would be at the bottom of this figure, kind of class C. And the other way you might be using that data is to sort of realize out of that a current state of the world, some sort of current status for that data. In this case, I think you use the example of archival data or maybe just a customer 360 database. Those might be stored in a search document, they might be stored in a database. And so there are situations where you actually need to know the event-by-event history of what's going on, other situations where you want to get insights, immediately real-time insights, but you also may want to pull out of that an aggregate of the current situation. If you have the correct sort of architecture for this and the right kind of stream transport, you really find that at the heart of these systems, it's the stream transport technology itself that can make the difference to be able to set up this sort of situation and be able to take advantage of all the different ways in which you want to use streaming data. In this case, I switched to a medical example. The patterns are really the same through a wide range of different verticals, but that same sort of approach you can find. And in the heart of this is the stream transport technology. Okay, so what kind of stream transport do we consider to be, quote, the right kind to support this kind of stream-first architecture? Well, at this point in the world, I think it's more important to look at the capabilities rather than the individual technology or tool because there are always new technologies being developed. But at the moment, we find that Apache Kafka, which is an open source technology, how many people in here have used Kafka are familiar with it? A large number, yeah, very popular. We find that Kafka has really excellent characteristics, excellent capabilities, and I do consulting for MapR. MapR Technologies is a large-scale, distributed computing platform, a converged data platform, and as part of that converged platform, it has a stream transport technology called MapR Streams that uses the open source Apache Kafka API. And so they have a similar approach to how you can use either one to build this kind of stream-first architecture. MapR Streams has a few capabilities that are a little different to go beyond what Kafka does, but at this level of thinking about how to build an architecture like this, they're really quite similar. And so in some ways at this point in time, I think they're kind of two buckets. One is the Apache Kafka MapR Streams bucket in terms of what kind of stream transport technology will support this, and the other bucket is everybody else. And so let's look at why. One of the greatest differences is that you want a stream transport that has great performance, but also persistence. Some of the older technologies, you have one or the other, but there's too much of a trade-off between them to support this kind of work at the very large scale of data volumes that we're talking about, you really need to be able to do both well. You need to be able to support your stream transport, needs to be able to support multiple producers and multiple consumers, but most importantly, they need to be decoupled. And why decoupled is really important is that you want to be able to not have a situation where you're broadcasting data to the consumers, but rather one where the consumer is pulling the data that the consumer doesn't even have to be online. In fact, it doesn't even have to exist. You could write it later. You can add a consumer. It doesn't have to exist at the time that the data is delivered. So you want data that you want to transport that can deliver data, have it be available immediately for use by consumers, but they don't have to use it immediately. It's there later. And that's why you need the persistence as well as the high performance. And you want also that independence so that as you add a new consumer, it doesn't affect what's happening with the other consumers. So it sounds like just one simple thing, but it actually is all a world of difference to be able to sort these kinds of new ways of architecting solutions. It's very helpful if you have the ability to configure the time to live for those messages. So the persistence is something that's under your control. This is true for Apache Kafka. It's also true for MapR Streams. Slightly different with MapR Streams because Streams being part of this, it's all part of one technology. It's in the same technology, the same system as the distributed file system, as the NoSQL database MapR Streams. It's all one thing. It all runs on one cluster. And so in that case, the partitions of a topic in MapR Streams is distributed across the entire cluster. And that means it's actually practical to set that time to live basically for years. And so how practical it is to use it as a really long-term auditable history. That part's a little bit different than Kafka, but the basic principle is the same and for a lot of use cases, you could use either one. A feature that MapR Streams has a capability that is very useful for certain use cases and it is unique to MapR Streams is the ability to do geo-distributed replication. That is, you can actually do direct stream replication across data centers. And it does this by maintaining offsets. And so what that means is that you're actually sharing data. You're actually sharing the same data. You maintain that knowledge about the sequence at which things happen, even as you duplicate them across data centers or between clusters. This can happen on-premise. It can happen from cluster to cluster or data center to data center within cloud computing. It can happen from on-premise to cloud. And so this is a feature that MapR Streams has. It really opens a lot of use cases, a sort of situation that people hadn't really thought about that capability hadn't been there before. Okay, we've talked about this. Absolutely, the key idea here is multiple producers, multiple consumers, and they're operating in a decoupled fashion. So that's probably the single most important thing to keep in mind. This is a reminder just to tell you, which I think I already have, that part of the why MapR Streams functions slightly differently is that it is actually part of that single technology. It's basically the same code as distributed files and as a NoSQL database MapRDB. Now, in the basic way that Apache Kafka works at MapR Streams works is really the same. You have multiple producers. You're assigning data to a topic. You have many different topics. In the case of MapR Streams, you can have many more topics, hundreds of thousands or millions of topics so that kind of goes to extreme situation. And that can give you a much more fine-grained way of dealing with data, but the basic principle is the same. You assign data to a topic. You name the topic, which makes it very easy to keep track of what you're doing. Topics are then broken up into partitions which helps with load balancing. As I mentioned in the case of Kafka, a petition tends to live on a single machine. In the case of Kafka, normally you would be running a separate cluster for Kafka, so you have your stream transfer from one cluster. You have the systems that you're using for stream processing on a different cluster. If you're working with MapR Streams, you're doing both of those. Maybe you're using Spark Streaming, you're using MapR Streams, that's all being done on the same cluster. MapR actually ships with the entire Spark stack. But whatever tool you're using for the stream processing, it's all done together. A difference here is that, especially because you can use such a larger number of topics with MapR Streams, there is an object, I think it's not named terribly well, it's hard to keep track, but there's an object which is really a collection of streams. It's a first class object in the file system. It's called a stream. And so it's really a collection, I said a collection of streams, I'm sorry, a collection of topics. And it's at the stream level in MapR Streams that you can set policies such as time to live, such as the replication across different data centers. You also have very fine grained control of who has access and that can make a huge difference where you have a multi-tenancy situation, you have multiple stakeholders, even within your organization, you wanna control who has access to which streams of data. And so MapR Streams has access control expressions that are set at the stream level. Now, a big trend in industry too is to think about the power of, in an organization or a subpart of an organization to work in a microservices style of approach. And this is really paid off for a lot of companies. If you have a new company, if you're working with a startup, it's a very nice idea to think of designing from the beginning up in this style. If you have a very large organization that's designed in a more monolithic way, it's a little more daunting to think about changing over, but a lot of companies are doing this because microservices has really paid off so well for different companies. But what is surprising to a lot of people is that working with streaming data, having streamed transport at the heart of your architecture is actually a way to support a microservices sort of approach. And this is true whether you use Apache Kafka or you use MapR Streams. These principles are the same and they're very broadly applicable across different industries. And so I'm gonna take just this example. Our last speaker was talking a lot about how machine learning is done, the sort of iterative process of using testing data, training data, evaluating, deploying. I think something that she didn't directly say, but I'm sure she meant, is that evaluation doesn't stop at the point that you feel your model is good enough and you deploy it. It's a continuing process. And even after a model is in production, you want to continue to evaluate because changes can happen in the model itself or in the code, but more importantly, you're interacting with the real world and the data, the world, the situations that you're reacting to or learning from changes well. And so you may need to update or change that model or you come up with a better idea. You come up with a better algorithm, better source of data or whatever and you want to change and use another model. So it's very important to have that ability. Now we take this single example of credit card working with a system to detect credit card fraud. This one is actually based on the idea that you're detecting card velocity, meaning you're looking at a series of history of transactions for a particular card. You look at where the last transaction was done. You have a model that can assess whether it's reasonable to think that I can be buying an ice cream here in Singapore and about 10 minutes earlier, I bought a pair of sunglasses in San Diego in the US, not likely, okay. So you can have developed a model that's our fraud detector. In this case, you're actually looking at a call and response sort of system for the data coming in. You would normally be storing in a database your last card use and you would have some system that actually a process running that serves as an updater to be able to update that database as a new transaction happens, you're comparing that historical data to the new transaction and making that difference. What we've drawn here is a modification of the way it's more traditionally done and instead of the fraud detector, basically speaking directly to that updater system to update the database and thinking of the database as being kind of the universal central source of knowledge for a lot of different parts of your organization. We've made a change and we've said, suppose you add a data stream. So whether that's Kafka or MapRStream, same idea exactly. So we add a data stream, we kind of pull it outside the box to remind you that the box is not a physical thing, but that's a project that has a specific goal there. And if you have the data that's coming from the fraud detector to a stream and that stream is shared by a number of different consumers, now different groups have access, different processes have access to that same data. They're not limited by how you process it and change it or aggregate it for this particular database. So now the database is no longer the central source of knowledge for a bunch of different groups, but it becomes the local source of knowledge. And this data stream becomes the shared piece. And I think you can see how that fits the way you want to work in a microservices sort of approach. Now in this case, you're not only sharing data, maybe between different groups. Somebody's analyzing something entirely different. You're also keeping the data in kind of a more pure form. You don't always know what you're gonna want to know later. And so having an access to less processed data can be a really useful thing. But you're also able to do this to iterate your own models. And so when you decide later you want to test a new model, you're testing it against the same data. You can set up a second box like this. You're running the same system for a fraud detector that you're testing offline while you're keeping this one functioning online until you want to have it go live. So this provides a lot of flexibility. It's making a big, big difference in the way that people work. Here's an example of situations where the capability to be able to replicate data across data centers can really pay off. I was here in Singapore last year for the Strata conference. First time I'd ever been to Singapore. It was very impressed by being here. And so I actually used this example that I'm about to show you as the last chapter in a short little book that I recently wrote under Patchy Flink. It's either a very long report or a very short book. It's called Introduction to Patchy Flink. And I wrote this with Kostas Tumas who's one of the founders of Flink. He's the CEO of a company that's headquartered in Berlin called Data Artisans. And I'll tell you a little more about the book. But a lot of the examples that I'm about to talk about come out of this book, I'm sorry, this is a streaming architecture book on the wrong book. So that was the previous book that talks about, wrote this with Ted Dunning from MapR. And it talks about a lot of the principles we've talked about here with StreamFirst architecture. This example is actually the last one. This book did not use Flink. I apologize for that. Okay. So here in Singapore, I was very impressed when I saw your report and saw the phenomenal amount of shipping, the phenomenal amount of container shipping that's done from here. And so we use this situation to show you how you could use streaming data, picking up data from sensors in containers on ships and how this kind of model could really pay off. It's a little bit different way to think about using this data. So you have multiple stakeholders whoever owns a shipping company, whoever owns those containers which may be the shipping company or someone else. People may lease space on the ship so you have containers from multiple companies. You have a stakeholder of whoever owns the goods, the products that are actually being shipped and you're looking at these stakeholders in a number of different situations. So the little box here, it represents a small cluster that would actually be an onboard cluster. In this case, we are talking about MapR Streams because this is a capability that's kind of unique to Streams. And so you're picking up data from sensors in the containers. They may be measuring humidity, temperature, they're actually just indicating that that container exists on that ship at a particular time. And so while the ship's been at sea, that information has been stored in Streams on this small cluster. When it comes into a port, it forms a temporary link with an onshore cluster, transfers that information back at headquarters wherever headquarters may be, maybe in Greece, back in London, wherever they also wanna know. Somebody else manufactured the toys that are in the container. Some other part of the world, they wanna know what's happening with their product. Now you can also see in this situation, it's very important. I talked about being able to control who has access through access control expressions that are set at the Stream level. In this particular case, not everybody should be able to know what's happening with everybody else's containers or products. The shipping company obviously needs to be able to track it all. So it comes into port, you do a link, you transfer that information, ship offloads some containers, loads some more, starts heading for the next port that in this case happens to be Singapore. And before it ever gets there, the onshore cluster and our example in Tokyo copies that data to that Stream to an onshore cluster in Singapore. So Singapore, the shipping owner or maybe a port authority, already knows what's coming. They know what's on the ship and they know what to expect. But while the ships at sea, it may be not as easy to have a direct uplink is they're still collecting information on the onshore cluster about what's happening while the ship is at sea. It comes into port, another temporary link, it does the copy again. We have a little joke at the end here. When it gets to Sydney, some containers fall off the back of the boat. Sometimes you need some very real time information about what's going on from the sensor data, maybe the humidity goes up pretty high for those containers. You know something happened, you know they're not there. This situation, this kind of pattern applies to a lot of different industries, not only different kinds of transportation, but a lot of situations, including what goes on in telecommunication, what goes on with shared information across data centers, say, for people in the ad tech industry where the ads themselves are actually a shared inventory. It has to be updated very fast. You have different data centers that are located in different places, but they need to have access to the same data. Telecom is an area that obviously uses streaming data a lot at huge levels. They also need to have very responsive machine learning models, particularly anomaly detection so that they can look at changes in usage. The previous speaker, Juliet, talked about, I think it was Juliet that talked about what goes on in telecom. Yes, her model was analyzing churn, and one of the things that can cause churn is obviously poor interrupted service, and one of the situations that we see happen, so you have callers that are, all of their mobile phones are interacting with towers. The towers are trying to send data records back to a central center, huge amount of data, but oftentimes has to be processed in very short order, especially for these anomaly detection models. So in this case, we have a situation where you have multiple towers, you have groups of people, they're interacting with the tower nearest them. Suppose there's a sporting event or a concert or something at this one area, suddenly you have a crowd around that area, a lot of people are communicating, maybe they're tweeting or whatever they're using their cell phones for, and so suddenly they're not getting very good service because they're overwhelming that tower. In that situation, what you wanna be able to do is detect that very quickly, very quickly while it's happening, be able to actually tune this tower to take part of the load, but in order to do that, you have to be able to handle huge amounts of data, have your machine learning model work well, be able to communicate across different data centers, and you need to be able to do that without the kinds of delays that happen when you have to process at a number of different levels. Now taking this example, and these are real examples in telecommunication, the data collection and handling happens at so many different levels, and if you're doing that by batch, you can have delays that may take 30 minutes at each stage. So by the time you do a couple of jumps, you can see somebody saying, I don't have very good cell phone coverage right now while this event is happening, or not gonna be a very happy customer, they're gonna be one of the ones that she's gonna have to be analyzing in her churn model, so you want to keep customers happy. If you have a system where you have a way to tune the cell phone tower, you have a model and a way to handle and move data across data centers fast enough to do this as a streaming model rather than in batch, you could actually reduce that 30 minutes per level of data processing down to a few seconds or sub seconds, and that means suddenly you can respond to events as they happen. So you can imagine a lot of different situations where this really pays off, not just for telecommunications, but that industry is absolutely classic as to why they need these sorts of approaches. Now, I've been talking about the stream transport, and everything that I've set up to now would be useful whether you're using Spark, Spark Streaming, Apache Apex, Storm, and now I'll talk a little bit about Flink. It's not about which processing you're using, it's about how do you set up the system and how do you set up the architecture, how do you deliver the data to those consumers? But let's take just a moment and switch and look a little bit about the consumers and we'll skip through a lot of this because I think there's not enough time to go into much detail. Now, I'm talking about Apache Flink, I think it's a very interesting project. It's certainly not, I think not as well known in Asia, certainly as Spark, how many people here have heard of Flink or have used Flink? We'll have heard of Flink first, show of hands, everybody. How many people have actually used it? And very few. Well, the fact is, Flink is very big in Europe. It's much better known there, the project originated similar to the way Spark started from Amplab at the University of California, Berkeley. Flink started from a project called Stratosphere, that was at a number of research universities, but I think most centered in Berlin. And as a result, people know about it more. When it came into the Apache Foundation, the name had to change because of a name conflict. And so it was named Flink, which I'm told in German means agile or fast, which this processing is. They picked as their logo a squirrel. They picked a squirrel of a very unusual color because it turns out there's a rare kind of squirrel in Berlin that is an incredible, bizarre color of bright orange. So it's not just an odd logo, but it's also an odd squirrel. This project came in already with a very large international community of people, developers and users. It came into the Apache Foundation and was, excuse me, very quickly reached a top level Apache project. It is actually being used in production by a number of different companies. Here in Asia, the only one I know of is Alibaba, that's using a derivation of Flink, but it's being used by a number of different companies. A lot more people are experimenting with it. There was a company called Data Artisans. Have people heard of Data Artisans? Yeah, so Data Artisans is a company that was founded by several of the people who originated Flink and are still working on Flink. That company again has a number of different countries, but the headquarters are in Berlin. Flink has its own conference. It's called FlinkFord. It's held in Berlin each year. This year there's going to be, or in 2017, there's going to be two FlinkFord conferences, one in Berlin and one in San Francisco. If you go online, you'll find a lot of good information, Data Artisans has a really excellent blog. MapR has a few resources, including the book that I wrote that they make available for free. A lot of the content that I'll just point out to you is what's in the book. I'd laugh at the pointer saying with our first speaker saying, you know, go and buy my book. So in fact, this isn't a Riley book. A Riley sells it, you're welcome to go buy it. That's nice for me, I would get a royalty, but I will let you know that MapR makes this book available for free, the streaming architecture book for free as well. If you go to MapR, if you're willing to sign up and put your contact information down, you can read any of these as a download a free PDF. Both books are also available for MapR to just read literally online if you don't want to have to download them. So we talked about the fact that Flink is the processing step. You would still have something like Pachi Kafka to deliver to transport messages or MapR Stream to transport the message stream to this processing step. Go to the book, you'll see some of the differences in some of the major choices for stream processing. Flink is actually true real-time processing like Pachi Storm, an older project. Pachi Spark Streaming has taken a different approach, very clever approach, where they're actually a batch processing. So they do micro-batching, basically the idea if you keep cutting the batch as small enough, it sort of approximates true streaming. And for many, many use cases, those kinds of latencies are sufficient to meet the SLAs that you need for that project. But there certainly are projects where they're not in the past. A lot of people have sort of set up on their system a combination of something like Spark, Spark Streaming, to cover a lot of it. And they use something like Pachi Storm for the really real-time or low-latency systems. And now people are beginning to look at replacing a Pachi Storm as that combination with a Pachi Flink. Flink has some advantages over storms. It's much more developer-friendly, it's a lot easier to use. And in fact, you can actually use a Pachi Flink all the way across those latencies because it also can work in batch. Chapter three of this little short book talks about different kinds of correctness. And so I'll just direct you to take a look at the book. I think you'll find that interesting. The later chapters, Custis wrote, and he goes into a lot more technical detail on these topics, but just as a quick overview, I'll just mention a couple of them. So these are some of the topics that are covered in chapter three where we talk about different forms of correctness, natural fit for sessions, and event time versus processing time, which I'm gonna touch on briefly. Other issues are being able to have accuracy after failures. Do you have a stateful system, which Flink is? Answers when they matter, mainly saying, if you can deliver an extremely low-latency, in some cases, that's what you need to have a kind of correct assumption about what's going on in the real world. And if you have things that are easier for developers and easier to maintain in operations, basically, you're less likely to have errors. And so that's a different kind of larger way to look at correctness. Just a touch to think about windowing. Flink supports windowing in a number of different styles, but this is just a quick comparison of the sort of issues that you get. If you do windowing, if you think of it from a micro-batching approach, where if these are a series of different events, the horizontal blocks or events in the real world, and the dotted lines there are to show what you get if you sort of slice this into micro-batches, there's really never gonna be a time that you don't sort of overlap one real-world event with another. It's very hard to find a clean separation between events. But on the other side, with the gap, you see how this is done with windowing, one option of how it's done with windowing in Flink, where you can actually define the window session by that gap between events, and so you get a clean spacing between what's actually happening in the real world. Again, a better fit the way events are actually happening and the way you're handling them in your computing. So take a look at the book, and there are also some good blogs in the data artists' insight that talk about windowing. This is a reminder in case people don't know the difference between event time and processing time. We use the example of Star Wars movies, and so there's a difference between when the events and the story happened and when the movies came out, and that lets you know the difference between event time and processing time. There are situations where event time will give you much more accurate results than processing time. Isn't time to go into the detail of this particular example, but there's a very good online video done by Jamie Greer, who works for Data Artisans, and I think if you go to the book, you'll find the link to that, and you can actually go through the example that he did. We have some quick little white-word walk-throughs on the map or a site. These are done by Stefan One, who is a CTO for Data Artisans and a co-founder again of Flink, and he talks about using event time. He also has another one where he talks about using save points, which are related to checkpoints, and how you actually can maintain stateful processing, how you can reprocess data for bug fixes or for different deployment and so forth, and so they're very short videos, they're about five minutes long, and a good little introduction to the topic and how Flink handles these. Apache Flink does not ship with MapR, but it does run on MapR. Storm, I think, still ships with MapR. The full SparkStack ships with MapR, but Apache Flink has been tested and benchmarked on MapR to see how it works with MapR Streams as compared to how it works with Apache Kafka. This is kind of a quick diagram of how the test is set up, which is very straightforward to think that Streams is being used for the Stream Transport and Flink for the processing. This was actually an extension of the Yahoo benchmark that was earlier used to test Flink as compared to Storm being supported by Kafka. They hit a little bit of a problem, Flink performs better than Storm, but the whole system didn't run as fast as it should, and this had to do with a network problem because you have two different clusters again for the Stream Transport with Kafka and the processing, which is being done at a different cluster. And so one way to fix that is to try to set up a better networking system just for the benchmark of the people who did this actually just did a kind of completely artificial workaround where they just stopped using Kafka. They made an artificial data generator right on the same cluster. So you could kind of, you know, it's not real, but it's just a way to say, can you get Flink to process much more accurately? Sometime later, the benchmark was done with Flink on MapR Streams, and there you don't have that network connector. You have a real system where you're actually using data and transporting it to the processing system, as you would in a real-world situation, and Flink performed very, very well on MapR Streams actually faster than it has on anything else. So that sums this up. Streaming is a good approach because it's a better way to fit the way life happens. I think you'll have access to the slides and certainly happy to make them available. This is a number of different resources that you might find useful. Several of these are short videos or tutorials. There are a number of different books that are written on machine learning and other topics. Again, Riley sells them all. MapR makes them available for free download. I look around this room and say, I think soon maybe I don't have to put the slide up where I say please support women in technology, but I see a lot of women in the room, so that's really good news. And I thank you very much for having me here. It does handle back pressure. I think we talked about it in the book. Definitely it's talked about on a blog on the data artisan site, but one way that you can find that out is I have exactly one copy of the book. So if you'll come up, I'm sorry, the question was, how does Flink handle back pressure? And take a read through that and see what you find out. All right. Thank you very much. Yes. Yes. Yes. Yes. Okay. So I'm gonna give you a very simple answer and if you need a deeper answer, I have two of my colleagues here who can probably answer the question even better than I can. If I was hearing it correctly, the gentleman is saying that especially in financial industry, there have been examples like this of using streaming data for a long time and obviously other ways of delivering or transporting the data. So why take an approach for something like Kafka or Kafka-esque like MapRStreams? And there are a number of different reasons, but I think at the heart of it, probably one that is the single most important one is that trade-off between persistence and performance. We're talking about very large amounts of data that have to be handled very, very rapidly and that's not entirely unlike the systems that you're talking about before and some of those and things like high performance trading, I think you wouldn't do on a system like this. They're very specialized systems, they're hard to bill, they're expensive, they're used by big organizations. These are systems that can be used in a much more widespread fashion. They do do very high performance but they also don't trade-off performance for persistence. So I think in a lot of the older systems depending on your use case, you would pick one transported to the other depending on which was more important to you. And we're trying, and these are situations where you need very good performance and persistence and the persistence again supports this microservices approach. It lets you have these long-term auditable logs, it lets you have consumers that you add after the fact and so they become much more generalized. The other reason have to do with cost. These are systems that make this kind of work very cost-effective and very cost-effective across a number of industries. And so I think that's part of why we see these changes happening. Did you have anything you would add to that? Oh, he says good. Okay. Anything else? Yes. In first, don't you picture? Okay. Is it in a different phase than before? Yes. Yes. The short answer is yes. And it's different in several different ways and see if this fits your experience. I think in the past people saw working with real-time applications as being a sort of specialized project or specialized part of what a larger organization or industry might be doing. And so they would have some way to transport or deliver. I mean, you could basically just use streaming data directly, but that's probably a bad idea. So if for nothing else you want some sort of a safety cue, something upstream to sort of collect and deliver the data. So the basic idea is the same. But the question is do you use it across a whole industry? And that's what I meant about streaming becoming mainstream. I think it's partly a difference in the technology. It's partly different in the data sources. People are now using streaming data from sources that maybe they wouldn't have looked at those sources they were thrown away. There are more sources of streaming data in the sense that people are certainly putting sensors on things they never put sensors on before. So it's a shift not just in the technology, but it's a shift in who's using it, who's aware of it, who begins to think that that's a useful way to do their work. So these technologies that I talked about can be useful, even if you use them in a sort of isolated, whether you do a microservice approach or not, if you use them in a sort of isolated project where you need real-time results and real-time analytics done. And that is usually the real driver that gets people to look at them. But we are beginning to see that people recognize that there's a larger benefit if they design around this stream-first architecture. So it's a shift in thinking. And it's the same kind of shift, for example, with MapR Streams, same thing with MapR NoSQL databases, where there's a shift and now a technology that allows people to use these setup tables, setup streams with much less administration, with less burden on IT. You don't have to have a big committee that decides whether you can do this. You don't have to have a big committee that decides whether to change the way your table is set up or where you use streams. And the last thing, again, is this shift, even if you just look at how people use databases versus thinking of using streams. In some cases, you're actually using that stream in the way that traditionally you would have used a database. And by making that change, you have an enormously larger amount of flexibility. So it is a change in the technology, but in some ways, it's more a change in the thinking of the people who are using the systems and to recognize that now they can apply this in situations they maybe wouldn't have before. Just a curiosity, may I ask what sort of systems did you work with real-time processing? In e-commerce? Yeah, and you're using Kafka now? Okay. Are you using it strictly for some sort of, what do you use for the processing, or do you design your own? Okay. So she says she's working in e-commerce that she's used Apache Kafka already, she's using Kafka, is using Spark for processing, Hive, you said Drill, or just look at it. Using Impala, same sort of idea. Okay, great, thank you. All right, thank you all very much. I really appreciate it.