 Hi folks, thanks to be here for this talk about an introduction to string processing. I'm Nicola Frankel, I've been a developer for two decades and I'm a recent developer. So when I started my career a long time ago, all data was stored in SQL databases. And it has some implications. The first one is, first we don't want duplication. You've learned when you were told SQL that we don't want to learn duplication because when an entity has a relationship with another entity then if multiple entities have relationship with the same entity we can just change the data with the related entity and all will magically appear to have been updated. If you duplicate the data and for example you have, I don't know the address of a company that is referenced like in some other entities and you need to change the records of the address of the company. You will need to change it in every record. And that we don't want to do that. There is also another reason that is let's talk about is at the time when SQL systems were designed, storage was quite expensive and so duplication of data was just taking space from nothing. Another implication of SQL is we actually want that that quality so we could constraint to say, hey, this is a date, this is a timestamp, this is a voucher of 20 or whatever. So you could as much constraints as possible on your model so that you are sure that it conforms to your line design. But actually when we think about it, this is all designed for writing. We think about writing and in some cases that's not really what interests us is when you're reading stuff, that's not interesting. When you read stuff and you probably want to be fast and if you have SQL with joints, especially when you've got many joints over several tables, that slows you down. And so it boils down to I want data to be correct versus I want data to be like quickly available. And let me tell you about two use cases that I believe are quite like interesting. The first one is analytics. I worked at some points in retail and one of the common requirements in retail is to have some analytics over the previous hour in general. So imagine you are a supermarket director and you've got like a big stock of bananas and it's Saturday and on Sunday, you won't be open. And you know that if you don't sell your bananas on Saturday on Monday, you will need to fill them out because they will be like too black for selling. What you want probably to do is to have the analytics of your sales of banana so that if you don't sell enough, you can provide some discounts. So hourly, you will need to ask your IT department for the statistics and then check like I work by I work and you want that data as fast as possible, of course. And then there is a reporting in general, you have a bank and I don't know about your country, but in France, you receive every months, you receive your bank accounts with your like the operations, the credits, the debits and the balance. And for that again, it will be done at the end of the month. It needs to be as fast as possible because we have like a lot of customers and you don't want your accounts take too long to compute and then print. And so for that reason, we invented the concept of ETL of extract transform load because you have different actors and different needs. And in that case, it doesn't stand to reason to use the same database. So you've got your like transactional database, you will extract the data, you will transform it so that there is no more join. You don't care about duplication anymore. The idea is when you load it into these new database, then here you can like execute queries that are meant to be very, very fast. And the thing is there is also another reason is if you have the same database for like transactions and analytics, then you will put additional load on the database and in general databases, they scale up to a point, especially live SQL databases, we'll see later, they were not designed to scale horizontally, they scale vertically and then you are limited by hardware or software. So the idea is what does these extractions from load? How is it implemented? Well, batches, batches are everywhere. I believe even if you are a junior in the industry, you've probably even now I've been confronted by batches. And so they have interesting characteristics. The first time, the thing is they are scheduled at regular intervals. Like I mentioned the supermarket analytics every hour or the monthly statement by the bank monthly and they run in a specific amount of time and you see probably where I'm heading heads. You've got some issues. The first issue is that when you design your batch, you have this schedule in mind. And so you take great care that you don't overlap. So if you need to run every hour, you take care that it runs, I could say 30, 40 minutes top most and gives you a buffer. But in general data increases over time and the ups people, they are checking carefully that the time it takes to execute the job actually increases over like months, years or whatever. And at some point there is a chance that the time it takes to run your job is actually like longer than the scheduled periods. And even if it's not the case, if your batch takes let's say 40 minutes for a batch that is scheduled hourly, what if it stops for whatever reason, if it fails at 35 minutes, like 35 plus 40 means you will again go over the periods. And sometimes also what happens again with the increase of data is that the space it takes in memory exceeds the capacity. For the later points, that's quite easy. Yeah, we keep a cursor, so we don't handle everything. We just handle like a slice of data and we manage chunks of data. And of course it's only possible when data is, we don't need to do validation of data of different chunks. Then there is a new problem is what's about new data? For all those reasons, we wanted something to scale horizontally. And again, the SQL systems, they were not designed to scale horizontally. The idea was, hey, we don't have much data. Well, we didn't at the time, but we want data to be like correct to be true. And so the whole idea behind SQL databases is this, we want data to be correct. And you can scale up to a point. Big data said, okay, we mostly want data to be correct, but the best, the most wanted characteristic we want is we want to scale. And we want to scale again because of software and hardware limitation, we are limited with vertical scaling. We want to scale horizontally as much as possible. So the idea behind that big data and no SQL databases was we scale like web scale. But of course, there is nothing like free. And the idea was we can scale as much as possible to have like right speeds, but at some point you will need to read the data. So if you dump everything into your database and you don't do any checks because of course you have constraints and constraints take time to check, then that means that the job of like managing, checking that the data is in the correct format is when you do the query. And so there is this tension between, hey, is it better to do schema on read or schema on write? If you want to be very, very fast when you write, then you probably do schema on read, but then you like move the problem and the time it takes to validate the data when you read the data. So you are very fast writes, but then you have slower reads. On the opposite side, if you design and depending on your no SQL database can be possible or not, if you implement constraints when you actually write, you slowed on your writes, but you make reading so much fast. So be careful about that. Yes, you can be very, very fast in writes, but that means that your queries will be slower. But recently we saw another thing coming and the idea is, okay, databases are meant to store stuff, but how do we handle data out of it? And the idea comes from an old paradigm in programming which is even driven programming. If you interact with a user interface, you probably have been confronted with events, like, hey, instead of doing something at a time or whatever, the idea is you do something when something happens. Like when the user picks a button, you do something. And so we can leverage that and say, hey, let's make everything even based. And of course there are a couple of benefits with that. The first I told you about the storage space that might be too small to take all the data and memory. Well, if you've got one small event, then it becomes a no-brainer because the event is of course very small. Because of that, it's easily processed. And another side effects, which is actually very beneficial, is instead of like every time like pooling data and pooling data, you actually are reacting to data, your data is pushed toward you and data is pushed as soon as something happens. And so this is as closer as you get to real time. There is no such thing as a real time because even light takes time to move from one point to another. And of course we are talking about IT systems. So there will be like eventual consistency of data coming in, but this is still very close to even time. And it keeps like the derived data. So you've got source of truth and you've got direct data. And if you have even streaming, it will be like as close to the source of truth as possible. So you will have like very much in sync instantly or again near instant in sync properties. Of course, there are like not so good side effects. And one of the worst is how you think about it. Before you were thinking in batches, you were thinking in chunks. And now you think about streaming and it's never ending. So you know when you start to stream, you never know when it ends. Actually it ends when you stop it explicitly. Otherwise it's just an infinite stream of events. You've got also new characteristics before you were able to do aggregation on your batch. Now you can still do aggregation of course, but since it's a flow of events, you need to explicitly design your stream so that if you want some aggregation, you need to define over which window of time you want it. And there are different kinds of windows. You can have like tumbling windows every five minutes or you can have like sliding windows. Again, it depends on your needs but you need to think about. The idea behind streaming is quite easy to understand. The first is you've got sources and the second is you've got sinks. And in the meanwhile, you will do some transform, some enrichments, some ML inference. Now it's pretty a hype to do some machine learning inference. And when you do enrichment, you probably want to have data available around. So not only in databases, because if you have data in databases, it will take time to fetch it but it should be as close to possible to the stream processing engine if you want speed. And it opens a lot of new doors. Real time, I was talking about real time. Let's get back to my example of the supermarket director. Supermarket director wants the data as soon as possible. He actually don't want the data hour by hour. But this is how we design the system because when you only have batches, you need to say, hey, I want the batch to run every hour. If you want to think about correctness, there is a first issue. The first thing is if we run the batch every hour, the batch is not instant. It will take some time to get the results. So let's say you will need to run it every hour and then you will get the result at age, plus one, two minutes, whatever, I don't care. So regarding correctness, this is not correct because we want to have the data as fast as possible. But it's even not that that is the biggest issue. We need to wait the full hour to get the results. Now imagine if we could get the results like five minutes before the hour. Not much can happen in those five minutes unless you've got, I don't know, like somebody who is crazy about bananas and buy 10 kilos of bananas, but probably that won't happen. Let's say that the probability is very small. So most of the time you want data as fast as possible and you don't need to complete the full cycle to act upon it. Now if you've got a stream of events, you will get data as soon as it comes and you can already have some interesting insight about your data. You can see the trends of sending your bananas as it progresses and you can make decisions as soon as you want. And well, again, as I mentioned, imagine you have some learning model, some machine learning model. If you like push this data to your machine learning stuff, then you can make some predictions again in close to real time. So I believe those are huge benefits. There are a couple of in-memory stream processing engines on the markets. There are on-premises one like Flink or Hazelcasts. There are cloud-based ones, such as Kinesis or Betaflow. And there is also an interesting project called FSH beam, which is meant as an abstraction layer over some of the previously named stream processing engine. Of course, it's not perfect because it's an abstraction, so it's leaky and there is no standard regarding stream processing engine, but I think it's an interesting initiative to mention. Couple of words about Hazelcasts, since I will be using it in the demo, what most of what I will tell later can be applied to any stream processing engine. The first is its open source. We have a unified batch and streaming API and of course, we have also an enterprise offering, but what I will show you afterwards is entirely open source and free. There are two concepts and I will use the semantics of Hazelcast, but actually the concepts are very widespread. There is the concept of the pipeline. The pipeline is code or configuration that tells where you will read from, what steps of transformation you will execute the data on and where you will write to. And once you've defined that, you probably have a client that will take this code and send it to the stream processing engine. The stream processing engine receives it, it knows about the topology, it knows about the code and it can distribute the loads over the nodes and depending on the topology. And then when that happens, that becomes a job. Hazelcast has two deployment options. The first one is embedded mode. If you are a Java developer, that's pretty good because that means that you can use Hazelcast just as library, you just use it as a dependency, maybe not gradle or whatever. So you put it on your class bus, you start the nodes and then there is an auto-discovery mechanism that the nodes will like check. There are other nodes on the same network and they will form a cluster. This is good to start with. However, of course, then the loads of the application plus the loads of Hazelcasts might not be compatible in a single JMM. So in general, when you start relying on Hazelcast as an infrastructure component, you probably want to go to the client server mode. Client server mode, you have dedicated nodes and you can, of course, run a shell script but probably you will want to use our containers or even run it on Kubernetes. We have hand charts or operator and then your application acts as a client of the cluster. And the good thing with that is that then you are not bound by Java anymore. We have several client APIs, such as Node.js, C-short, .NET, C++, Python, Angular. And the platform itself is pretty similar to what I've shown you before. You've got really available sources, really available targets because Hazelcast also is an in-memory data grid. You can load data beforehand so that the enrichment process is much, much faster and yeah, something that I forgot to mention is we've got an API, so if there is no connector and in the demo, I will show you how to read data from the web service, for example, you can actually create your own connector. And I will be using for the demo something called the GTFS because I want to have a demo from public transportation and this is interesting because this is not a public specification or let's say it's not provided by a public organization, it's provided by Google, but it's still interesting because it seems some of the organizations are using it nonetheless. It's based on kind of data. The first one is static data and static data is data that doesn't change often. Here, a bus station, for example, a bus station might change sometimes, but not regularly. So for static data, you will download files regularly and you have dynamic data and of course dynamic data in the position of the public transport and this changes regularly. So here is a very, very high overview of dynamic model. This is the message, this is what you get and this has multiple feed entities and each feed entity can have a vehicle position. This is what is interesting to us because this is what I want to show on the map. The data provider, as I mentioned, is from California. Before I use the Swiss one, they changed the format so I cannot use it anymore and that's why the Github repo is called train, jet train and now it will be about buses. The pipeline is the following. First, I will read from a web service. Now afterwards, I need to split, as I mentioned, I've got a huge envelope. I need to split it into updates. Then I transform it into JSON format. I need to filter out my form data. That's very, very important because every time my application fails, it will stop the stream. So I don't want to do that. Then I will do some enrichment and as I mentioned, I have static files for that because the event is normally self-contained and it's referenced with ID or whatever, all the stuff. So here it references the top time to treat the routes and whatever. Then because of front end requirements, I will transform the hours into 10, 10. Then I will need to flatten the JSON. Again, this is for front end requirements. I don't want to have like nested JSON. I want to have as much as possible, everything fits. I want to peak because I want to make sure that everything works as expected and if it doesn't, I want to have some logging. If I log everything, there will be a lot of logs. So I will do some sampling on the logs and then I put everything into a map, in memory map. So I need first to transform it into a map entry to say, hey, this is the key and this is the value. The key is the ID of the event. The value is of course the JSON payload. The architecture overview is the following. So I will have like a stream processing engine. Here I'm using Jet, of course, ready. And then at some point I will have a loader pipeline that will run and it will read the data files that I've previously downloaded. It will transform them, put them in JSON format and put them, each of them in a dedicated map. Then when I'm ready, when this is finished, I will have another loader that will actually launch a new job. The new job will read the open data endpoints and we'll go through the pipeline that I've shown you before with all the steps and then store the data in JSON format. Then a third component will actually like register to changes and every time there is a change, I have some front end magic and I'm not a front end developer. So I relied on the college to actually display that on the map. So now the web app is reading from the in-memory data grids and if we go closer to the map, you can see that it reads from the memory data grid. So there will be on the front end a schedule because if I do it as fast as possible, I will overflow in my front end browser. And so what happens is that it gets the value at time X, it knows the value at time Y and meanwhile, when it doesn't receive data, it will compute the place, the bus or whatever it should be. And then when it receives the new data point, it sets it at the correct place. So here we can see that the schedule is perhaps a bit fast but here seems to be much better, I know. There is also an issue with this demo so far is I'm not leveraging one of the file I'm receiving and the file is the route. So actually, sorry, it's the path. So when I've got like two stops, the stops might not be in the straight line and probably they are not. So in general, you would have a path that tells you where the bus goes through. And here I'm just interpolating that is just like a grid and a straight line coming from one stop to the other. I need to do that since a couple of time but my front end skills are pretty low. So far, I didn't manage to use it. It still gives you a nice approximation of what is happening. Okay, so this is the end of the demo recap. So compared to batch streaming has a lot of benefits as I've tried to show you and if you've got data to leverage so much the data, if you don't have data, just go and check. There is open data endpoints everywhere and you can leverage it to do pretty cool stuff. So thanks a lot for your attention. You can read my blog, you can follow me on Twitter. We've got documentation on what I've shown you. You can check it here. We've got a GitHub repository. So if you are interested in the codes,