 I have Nicholas Frankel, developer advocate at Hazelcast. Nicholas, hi, how are you? Hi folks, yeah, I'm fine and you? I'm very well, I'm suffering a bit today because we're having all kinds of technical issues but I can see you, I can hear you so now all I'm gonna do is hope that we can actually see your screen as well and then I can celebrate. So far, so good, so let's try it and one, two, three, does it work? Yes, I guess so. I assume it's so. If it doesn't work, if you don't see my screen then please shout and, well, I will try to do something about it. So thanks to be here for this talk about Introduction to Stream Processing. I'm Nicholas Frankel, I'm a developer, I've been a team lead, I've been an architect but I've been a lot of like technique, I've been holding a lot of technical roles but since two years I've become a developer advocate. I work for a company called Hazelcast, as to product, our first product was an in-memory data grid and you can imagine an in-memory data grid as distributed data extractors, the most used one is the map. So distributed means that you can short it, you can replicate it among several nodes and the second one is HazelcastJet, is actually does in-memory stream processing and since I will present it to be later then I will stop right it for now. In this talk, I will try to list the following points to lead you to its logical conclusion. So the first question is why do we do streaming? You might have heard about streaming a lot in the latest years and why? There are a couple of streaming approaches and we will see them and then I will talk a bit about HazelcastJet because I need some implementation. Once we have defined streaming, I want to show you a demo, so I need data, so I will have a bit of a section about open data in general and about a specification called the general transit fit specification and finally, I will try to impress you with a demo of the map or hold the trends in Switzerland in close to real time. So let's start in no further and let's start. When I started working, as I mentioned, like two decades ago, SQL databases, they were everywhere. You had no choice. You went to your DBA, you asked, hey, can I get to Datastore? Yeah, here is our company-approved SQL database. It could have been Oracle, it could have been SQL Server from Microsoft, could have been anything, but it was the way to store data. And even at the time, there were some issues with this way that SQL database stores data. There are at least two use cases which don't fit that nice model. The first is analytics. Like imagine you're the director of supermarkets, you will probably ask your analytics data warehouse what all the sales in the previous hour and then you decide, okay, let's have a discount on bananas because we have a lot of stocks and we didn't sell enough and if you don't sell enough, then they will go to waste. So let's try to sell them. The other kind of use case that doesn't fit nicely the SQL model is reporting. And reporting is when, for example, your bank sends you every month or every year your balance of your accounts. In both of those cases, we are only reading data. And that's the problem with SQL databases. Actually the SQL databases, they are very interested in reads and in writes. And actually a lot of what they are doing is preventing you from writing bad data or they also preventing you or at least the design of the database should prevent you from writing duplicate data because when this is expensive, you don't want to duplicate the data. And even if this is not that expensive, if you want to like update data that has been duplicated by multiple records, you will need to go to every record to update it. And so you have those two ages of this world that on one side, you have normalized data because you want your data to be nicely updated. And on the other end, you have like denormalized data because you want to read super fast. And it balances between like having correct data versus having fast data. And that's an issue with SQL because it heavily is used toward having like correct data. And when you work with SQL, you know that when you design your table, you should uphold at least the first three normal forms. Then when you do a query, then you need to join and with the number of joins over several tables, then you like decrease the performance of your query, then you have your constraints because you want that in this column, you only have integers. And in this column, you only have timestamps. Well, again, it's a lot about having like correct rights and reading, it's another issue. And so even at that time, we recognize that because there were actors with different needs. Actors who actually wrote transactional data and actors who needed analytical data. And so using the 10 database doesn't make a lot of sense. And for that reason, we invented ETL, extracts from transform loads. You have one database, you extract the data, you transform it in the way you want. Probably you denormalize it, you put everything into the same table and you load it again. And then you've got a database, a data store that you can read from and you don't care, you never write in it. So data can be denormalized, but at least you've got like very, very fast reads. And in order to do these ETL process, you needed something and we call it the batch model, the batch process. And if you have been working in IT for like more than a couple of months, then you probably came upon batches because batches, they are everywhere. They, I mean, you cannot go to a company with a decent history and not find any batch, they are everywhere. And batches, they have interesting properties. The first property is in general, they are scheduled at regular intervals. Most of the batches, they are not run at the press of the button. Most batches, they run daily, weekly, monthly, yearly, whatever. And they have another property, they take a certain amount of time to run. And well, you might imagine that at the beginning, if you have a batch that runs every hour, you probably made sure that the data that needs to be processed, it doesn't need more than let's say 30 minutes or 40 minutes and so you've got a buffer in 20 minutes. But then over the time, then the data that it gets like bigger and bigger and bigger. And you see that the buffer gets thinner and thinner and thinner. And so batches, they have a lot of problems. And chief among them is, yeah, the execution time, it overlaps the next scheduled execution. Like hourly batches, they take more than 60 minutes to run and you don't want to be in that situation. But believe me, probably you will be or probably you already have been. Also, what about the size? It can be the size in memory because with batch models, you take everything into memory. Or when you dump it, you probably denormalize the data. So it takes a lot of storage and what about the problem of the disk? About the overlapping with the next execution, well, imagine it runs hourly and it takes 30 minutes. But what if it fails just at the end, then you need to rerun it. And again, we get back to the overlapping problem again. So there are a couple of solutions. The easiest one is, yeah, if it's too big, if it fails mid-execution, let's chunk it. So instead of loading everything, we make chunks, like artificial chunks. And then we keep a cursor and we say, oh, this chunk has this process, so let's process the other chunk. But then if we design it like that, what about new data that comes in? Might be chunk, might be not, it's an issue. But mainly the problem is about scaling. At some point you will have problems of scale and that's the big data movement to try to solve. If you start having a lot of data to process, then you will need to scale. And at least all SQL databases that I know, they cannot scale horizontally. They have one main node, one leader that handle the rights. And based off case you have followers, just in case like the primary node fails, then it can like delegate and the other one, the follower becomes the primary node, then it will be tasked to write. But this is like the bottleneck of the whole stuff of the whole architecture. And big data, like try to say, okay, so let's forget everything about SQL because it's too much a constraint and let's try to scale horizontally. And let's design data stores that can scale horizontally. And then came Azure, MapReduce. But there were also these like philosophy of a transactional data. Well, perhaps we don't need that anymore and perhaps we can do, and in other ways. Well, I'm not sure it was very successful like saying, hey, it's not the problem with the database anymore, it's a problem of the developer or the architect. I'm not sure it worked that much. But yeah, it's in the no SQL words, it's a lot about schema on reads. So you just dump the data and when you read it, then it's up to the reader to make sense of the format it was in instead of enforcing constraints and saying, hey, this data when it comes in, it needs to be like this and this and this and this like with columns or evens of documents. And there is this funny stuff called an event and well, we have been like using events since ages. Like we use events when we interact with a graphical user interface like at the click of a button. So if you have been doing graphical user interface development, you're probably aware of events. And so the idea is, okay, instead of like having batches, let's make everything an event. So it's not about clicking a button, but something happens like data comes in and it's an event and let's process it. And it has a couple of benefits. The first thing is it's memory friendly. So instead of having like two gigs or three gigs or like like pitabytes of data to handle at one time, well, you can have like one little event. And even if you have many of them, at least everyone separately like nicely fits into a small memory space. For the same reason, they are really like easily processed. And a benefit that actually is the side effects, instead of like pulling the data every time like at regular intervals, then the data becomes pushed to you. And then it becomes very, very close to real time. This is a side effect, but I think that's the main benefit because now your like, I wouldn't like your targets of this derived data like is actually nearly reflecting your source of truth. Of course, with all the transformant filters and whatever you met it, but it follows the trends and it can be used as a derived data source of truth. However, we need to change our mindsets from like a finite data sets because when you handle batch, you say, okay, I will take like data from the year before, from the hour before and it's finite to something that is infinite that can virtually never stops. Like if you have a stream processing engine, of course it can fail, but then you like distribute it over the network, you set up that many nodes. And so even if individual nodes, they will fail, then the process will always run. And so you've got like an ever streaming process and the data it can handle. It can be infinite, can never stop. I don't mention StealthFoodStream, but of course you probably need to have some degree of analytics and whoever says analytics probably means some aggregation. Well, it's just like the next step. So yes, you will like handle every events like individually, but then you can also like aggregate them either on disk or like in memory to make like analytics out of them. And then you've got the windows, they can be like sliding the windows or they can be tumbling windows. Well, this is like classical stuff in StealthFoodStreams. And actually streaming is just like ETL, but distributed with the same operations that you could do before the same transform just just like filters or mapping or whatever. And you can read from like as many sources as you want and dump it into as many targets as you want. You can also combine streams together if you want or you can enrich streams with like references. In general the events that try to be as small as possible so they will only reference the idea of something but then when you push it into your target probably you will need the whole payloads. So you will probably need to reference the ID and gets the reference for somewhere. That's part of this streaming process and it works. And it opens a whole new world because of this real time or I shouldn't say real time because it's actually untrue because you need time to process it. You need time to go through the network. So I will say close to real time or near real time. It opens a lot of doors like near real time dashboard or statistics like probably machine learning is very hype right now so you can also analyze this stream of data through an ML model. And so your ML model can learn from real data in real time. And if you have been doing if you have been in the enterprise for some years like 15 years ago, perhaps 10 or 15 years ago I think there was this new thing called complex event processing. And it was the idea that in the enterprise you would put an enterprise service bus and then that like the application they wouldn't send messages to this enterprise service bus and then you could subscribe to those. It reminds me, it reminds you of something. It's exactly the concept of streaming. And the idea was like some application they would subscribe to multiple events and because those events happen in that order in this way then they will like make, they would like infer some additional sense to those like individual events they will make something out. And that was called complex event processing for no real reason. I mean, I don't know why but it was never really successful but streaming is actually the realization of that right now. Now, when you have those events you probably need to store them somewhere. And right now, Kefka is the king of event storage. They are the event data store company. I mean, Confluent is sorry. Also, I just would like to mention that yeah, there are alternatives and among them is Prusa which is also an Apache project. So Kefka is this distributed data store meaning that it persists everything on disk which can be a benefit or not depending on your use case. And then you can have consumers that subscribe to topics. And the good thing is that with Kefka is that it's up to the consumer to keep the cursor to where it was, it read its data last so that you can have like slow consumer and fast consumers like reading from the same topic and it's not an issue which is really, really benefit. As I mentioned, you might want to store your data, your events and sometimes it might not be that a great idea to like store the events, process them and store them again and then process them and store them again. Perhaps what you would like to have is like to read them from a source, to process them like make all the steps necessary and then just dump the final results somewhere. And for that reasons, for some use cases you might like in memory stream processing engines would be a better fit. And there are a lot of them, of course there is Flink, there is as well called Jets, there can be also on the cloud so you would have Amazon Kinesis or Google Popsub. I mean, every cloud provider has its own stream processing engine. Also word, there is a project called Apache Beam that tries to be an abstraction layer of several of the above. So that's the idea would be, hey, you only deal with this abstraction layer and you don't care about the real implementation. I think it's a nice idea. Of course, since there is no standard yet in the stream processing engine world, it's hard to be a completely safe abstraction. It's a bit tricky somewhere, but I think it's a good idea anyway. So let me just like have a few slides about Hazel Cosjet because I need to go into the details. That's an Apache 2 open source license project can be used as a single draw if you're a Java developer, it's very easy to use. I will have just a few words afterwards. And we are leveraging the INDG to distribute and to like form a cluster. And the good thing about Hazel Cosjet is it has a unified API for both streaming and batching. So if you are more used to batches and you want to go to the streaming world, you can like leverage first the batch API and just with a single line, you can cross the charts and go to the streaming world. And of course, we've got lights and enterprise offering, but everything that I will show you right now is completely open source and free. Hazel Cosjet and all streaming engine actually have those two concepts of pipeline and jobs that might call it with different names, but the concepts behind is pretty similar. First, you write your pipeline and basically it can be declarative or at least it looks declarative, but in general, it's codes. And then you will tell where you will be reading from, where you will be writing to and all the steps, all the transform and filters and whatever that you want to do. And then there will be a client that sends this pipeline, this code to the stream processing engine. The stream processing engine, like receive the pipeline and then because it knows about its topology, it knows how many nodes that it has, it will distribute it all over the network. And of course, following some constraints, for example, hey, probably the reading and the writing might not be parallel, so it will like enforce those constraints. And once it runs the pipeline, then it becomes a job instance. As I mentioned, if you are a Java developer, it's quite easy to start with jets because it's just a jar that you put on your class path and then you start it with jet.newjet instance. And when you do that, actually it starts, well, a jet instance and the jets nodes will start multicasting over the network to find all the nodes. And when they find each other, they form a cluster. There are a couple of limitations to this approach and the first one is, well, you have the jets load and you have your application load and probably they are not the same. So it will be hard to fine-tune your JVM regarding the overall thing. And yeah, for that reason, it might not be that good. The other reason is at some point you will reach a limit and you will need to scale your application, you will need to put more nodes. And the problem is if you reach the limits with jets, then you will scale your cluster along the number of jet nodes, but your application is bound to it, so it will scale your application also for no real reasons. So in general, this is good when you start or for small companies, but then when you start getting serious about it, then you go to the other deployment model, which is client server. In that case, you have two different, sorry, two different parts. You have your jet clusters on one side and then you have your client application. Sorry. And in this way, you can configure your JVM according to your jet loads on the client server part and configure your JVM according to your application workloads on the lower part. The second advantage is that in that case, you don't need to use Java at all. We offer a client's API for C and C++, for C-sharp, for Python, for Go, and for Node.js. So on one side, you are stuck to Java. On the other side, you can use a lot of different stacks. And this is the schema that I showed you from the ETL, but applied to jets. So we offer a couple of connectors, like input connectors and output connectors, like we called them like sources and sinks. In all cases, if you don't have something that fits your needs, we have an API and you can write your own. And then, as I mentioned, you can do everything that an ETL does, but like distributed all over the network by essence. I told you about the enrichment, the fact that sometimes your events, probably as just references an ID and you need to fetch the data. Imagine that if you do that inside a traditional database ETL stuff, that means that every time you would need to do a query to the database, which is not super great. What Hazel Cosjet allows you to do is you would prefetch the data, well, from the database or already from somewhere else and put that in memory in the cluster. And then when you need to find the ID, it's just like saying, hey, give me the ID from a map and that is very, very fast. And actually that's what I will do in the demo. Now that we know how to process data well, we need data. And I would like just to have a small section about open data. Like open data is to data, what open source was to source because right now we have a lot of open source entry software, but when you want to have data endpoints, it's really hard. So I believe this is a nice initiative. And well, I'm from France. I work in Switzerland. So here are a couple of open data initiative. In general, they are pushed by the governments because the governments wants to open its data because in general, administration has a lot of data that for ages was kept inside its border and they want to give you as much data as possible so to improve society in general or at least to make some business can earn some money. If you are not from the European Union or well, I've seen some open data initiative in California, for example, so you need to check where you live. Well, it's just the beginning of the journey actually because when you want to leverage open data endpoints, well, you face the following challenge. They're accessing them first. What's the formats? Do they follow any standards and the correctness of the data? So let's talk a bit about them. If you are a developer and I tell you, hey, I have an open data, something you will think about a web service. Well, that's the problem is that most administrations they don't, they are not IT organizations and their IT department probably are just task to handle a computer. So in general, most of the time you need to download the file, which is not super great because then you need to set up a job that will download the file at regular intervals and then it's not super great. The second is, well, you would expect that open data means open formats and yes, it can be the case, but good luck if you ever had to scrap data out of a PDF. But sometimes you just receive like Microsoft Excel files and I'm not talking about the new Microsoft Excel file that is just a zip archive with several XML files. I'm talking about the old binary formats that is completely proprietary. So second problem. You might, I've heard about this comic where they say, hey, there are like 14 competing standards. Wow, that's a bad situation. I will create a new standard that binds them all and soon there are 15 standards. Well, here it's the same situation. Like imagine we are already in a nice place, we have like a web service, it returned XML and we expect XML to be following some kind of grammar, not just being anything. That's the problem. Most of what you receive, they follow no real grammar. If it's JSON, of course, there is no grammar, but XML, I would expect something, but no. So we would like to have some standards, but no. And finally, fun thing. That is actually part of my demo. This is one of the data file I receive and the second and third columns, they are actually times. So I know that 4.20 is, I know which time it is. I have a hard time knowing what 25 hour is. I mean, what does it mean? Shall I remove 24 hours or shall I say it's like one o'clock the day after? I don't know. I don't know. So perhaps I can skip these data all together. You might know, if you're a bit data scientist, you might have heard about this joke that data scientists, they complain that they are spending 80% of their time like curating data instead of analyzing it. Well, that's exactly the problem they are complaining about. The data you receive most of the time is not what you would expect. I'm nearly at the end before the demo. So I have found a standard. I found something called the general transit fit specification. I would expect it to be a public standard. No, it's provided by Google, but at least it's a format that is, that lets you know about transponation schedule and with, of course, with the latitude and longitude. So it's based on a bunch of static file and I won't go through all of them, but basically what we need to remember is that first, you have static model like in general stuff that doesn't change that often. That includes stops. The stops of the bus, they don't change that often. You have a lot of them. And then you have the dynamic model and the dynamic model actually might be the location of a vehicle. So you, this is the model I get. I receive the JSON in the demo. And actually here, the feed message here is the biggest envelope. And then you just need to get to the feed entity and you have all the positions of the vehicle. And if you read the specification, you read that you say, oh, wow, that's pretty cool. Why do I need string processing for? So I find the organization that gives me this data and it has an open data endpoint and like miracle, it's actually a web service. Of course, there is a constraint. I cannot call it more than once per 30 seconds because they are afraid that they will be like, they will receive too many requests, but it's still good. They give you the GTFS static files that you need to download beforehand and they give you this rest endpoint and they say, okay, let's do it. That's easy. And then this is the model that I receive. And where is the position? Where is the position? Well, I mean, the position means that you know exactly the latitude and the longitude of the vehicle. Probably it means you have a GPS chip that always sends the data and it's not the case here. So what they tell you is you have this kind of called a trip updates and here you have this stop time update and what they tell you is how late they will be reaching the stop number X in the sequence. So you have a trip of one, two, three, four, up to 10 and then you have the expected time with the static file and then they tell you, hey, on the third stop I will be late by two minutes and it's up to you to compute the time they will be there. And now there might be some interest in stream processing. So this is the data pipeline I read from the web service. The second step is I will split this like you, Jason, into multiple ports because I need to like process every update separately. Then as I mentioned, I will enrich it. Nicholas, I'm so sorry to interrupt you and cut you there. We want to see the end of the demo, of course, but we are running out of time. Do you think you could just wrap it up quickly? I'm sorry, so I will show you the architecture diagram and I won't show you the demo. You will need to find it on the internet. This is, you can't do that to us. I will show you a movie because, okay, let me get back to my architectural diagram so you understand the concepts. Yep. And so here, is it streamed? Yes. So here I have this initial loader job that will send, as I mentioned, the job. The job will read the data files and put them into IMDG for latex assumption. Then I have this dynamic data loader that will send this new job. And this new job will read data from the rest endpoints and enrich it from the data that was read from the data file. And finally on the last component is to register an application that will register to changes and then it will be warned about those changes and it will update the position of the public transport of each public transportation on the map. So no demo, just a little movie because sorry about the time. I prepared it because I knew there could be an issue. So this is what it looks like. And you can zoom and you will see here are the routes and you can see the stops because they are from the static data file and you can see the little transportation moving between those two points. Of course, I don't know the exact position. I'm just doing interpolation because I know the time it was at the previous stop and I know the time it will reach this stop. So let me wrap this up. So streaming has a lot of benefits compared to batching and it's even better if we can leverage open data but be careful because it's the Wild West. There are no standards and the real world data is not really great. But you can achieve really, really cool stuff. So a couple of references, my blog, my Twitter, JET, the JET website, the repository if you want to play by yourself. Everything is on GitHub and if you're interested in knowing more about Hazelkast and Hazelkast JET, John, I was slack channel. I'm really sorry to be late. In general, I tried to be faster but today I don't know why. I was feeling chatty probably. And you're Swiss. I mean, come on, what's going on here? I'm not Swiss. I'm not Swiss. I'm French. Oh, you're French. Okay, that explains it, you see. No, you're just a couple of minutes over, no problem. Thank you so much for that talk. I thought it was really well structured and great to follow. So thank you so much for that. Unfortunately, we don't have much time for questions which is annoying because I have a few of my own I'd like to ask you. But first of all, for people that want to see this demo where do they have to go? Just leave that. So it's nice and clear. They will go on the Hazelcast Jet Train and I will provide you with the slides on the slides. So on the last slide, it will be here. It will be there. Yes. Fantastic, okay. And I have one very specific question. I'll just fire it at you. It says here, I do not understand why you place Google Pub-Sub as a processing engine. Maybe you confused it with GCP Dataflow. No, no, I'm just... Okay, it was just like to send data and to get data, sorry. Okay, well, that clarifies that. So Nicholas, thank you so much. And sorry to rush the ending here, but we are running out of time. So once again, thank you very much for this talk. If you have any more questions, join our Slack, reach me on Twitter, my DMs are open and thanks for your invites and yeah, see you. And good luck to the next speaker who happens to be a friend, Philippe Kren. Okay, great. Thanks so much, Nicholas, bye. So we'll be taking a short break now. Before we do that, I just want to remind you to go and take a look at the sponsors' exhibitor section, go and show your support for them. If it weren't for them, none of this would be possible. So go and do that and we'll be back in a few minutes.