 Thanks for having me here today. I work for Hazel Cast as senior social architect. I've been with Hazel Cast for more than five years. Along with me, we've got Paul Teh. Paul, is there any please? You sponsored the pizzas. Paul is based out of Singapore. I'm based out of Brisbane, and we look after everything Hazel Cast in Asia-Pacific region. In today's session, we are going to look into what the real problem of big data is. I'm sure most of you are already aware of it, but then we're going to look into it from a different perspective with how to process the data in regards to real-time analytics, and when the data is flowing in real-time as stream and all those things and where Hazel Cast fits in. Sorry, please stop. All right, opening question. What do we think of when we think of the Concorde, any volunteers? The Concorde. No one? Fast, supersonic speed, right? Anything else? Anybody else? No? Fastest commercial passenger airline of its time, crossed the pond in six hours, yeah, London to New York or Paris to New York. Great value for money, not for us, for the riches. Did I take the Concorde from Brisbane? Of course, no, they stopped doing it in 2003. We'll come back to the Concorde. But now, to the more pressing problem of big data, what is big data? So we know that data is just big in size, large volumes, and all those things is big data. But in a true essence, data that you cannot afford to store for later processing for just for extracting the value out of the data, where the time is inversely proportional to the value. The more time you spend on sitting on the data, the lesser the value becomes. A given example, when I landed in Singapore yesterday, at the airport, when I switched on my phone, I'm using Telstra mobile phone. The moment I switched on my phone, they detected that I'm not in Australia, I'm in Singapore now, and they offered me three GB data for $30, valid for seven days. What I did, I immediately said yes. Now, imagine what would happen if Telstra sent me that message when I'm at the airport on my way back to Australia. It's of no use to me, so no use to Telstra. So they had this data that my location has changed from Australia to Singapore and they were able to react to it, and literally made money out of it. They made $30 out of me. So it becomes important to react to the data to get the right value in time. Another problem, another thing, the Y2K problem. Anyone here knows about the Y2K problem? Yeah, which means a lot of transformation of the data, a lot of data, mainly of the data which was written by COBOL applications because COBOL always used to write data in the format of DDMMYY in two letters. So it became the problem on 31st of December, 1999, which means a lot of data was processed to change the date format later on. In 2014, soccer World Cup, almost half a million tweets in a minute after Goethe's call. Okay. In 2018 World Cup, any idea how many tweets after Moppe's call in World Cup final? Any idea? A few million. A few million, right? Any idea how many tweets in the entire duration of the World Cup? Any guess? Billion. One billion. Any more guesses? 115 billion tweets. That's a lot of data for one tournament. Imagine people made a lot of money out of this, just by latching onto those tweets and making use of those tweets by doing different applications and doing different things to it. That's a lot of data. Now, adding a business perspective to this data, the problem of big data or the challenge of big data, 90 percent of today's data in the world was created in last two years. That's how fast the data is evolving. It's coming from all different directions, be it in our different sources, be it devices, be it sensors, social media, all things. What that gives us, it gives us the opportunity, opportunity to make profit, to make money out of it, to react to it, to get the value out of it. What that leads to, it leads us to, it leads to developing high quality low latency infrastructure so that you can make use of the data in order to extract the value out of it in time. Talking about traditional sources, things like databases, RDBMS or NoSQL databases, which are quite commonly used in today's time to store data, big data and processing. Classic example, Spark, which uses data stored in HDFS. Do we really believe that databases or disk-bound databases are built to address these kind of problems? Reacting fast enough to get the value fast enough? No, they are slow. They are very nature that makes them slow. Time is money, time is of essence. But we have all this data and there are tons of different use cases. For example, in-memory compute use cases, where you want faster access to the data, which is where Hazelcast fits in, and I'll come to that later on. But in-memory computing has become essential to develop the infrastructure, which can help you extract the value. Microservices, you want to build lightweight microservices which are loosely coupled with each other, can be flexible and can be easily deployable. IoT infrastructure, stream processing, we got data flowing in. So, there are 115 billion tweets. That was classic example of a source of a stream processing engine. Machine learning processing, we all know machine learning is, we all know the capabilities of machine learning, the future of machine learning. But then again, it requires us to build high-quality low latency infrastructure. Going to the part where we all develop us, love these things. This is a simple Unix command. LS, grep, my name Rahul, and then WC minus L. So, what this guy is doing, LS is list files. Grep is again a filter operation, and word count is basically literally counting. This is an aggregation function. Sources limited, my filter is an intermediate function and again, the word count is aggregation and being written to an output sync. Now, my source is finite. In this case, I know that I have limited amount of data in my, which basically comes from my LS command. Yeah. My filter is infinite operation. I can have lot of words on the same text file and the same data, and then my sync is finite. What if I put grep minus C in the middle, which basically puts a counter at the end, not a counter, basically breaks the infinite stream and makes it finite, so that now I'm reading each line by line. In other words, the same operation can be described as functions. I've got my source function, which is my LS command, I've got my filter operation, which is as one function and then my aggregation and sync, which is another function. So here, in this example, and these pipes are literally denoting what? Feeding into the next function. So here, one function feeds into another, feeds into another. What about making this? When I use Linux T command, which is basically reading one source and feeding into different other sources. So what I've done is, I've taken this function and split this and feeding into two parallel functions. So I have one source, I've got one filter, now I've split it and now I've got word counting in two functions, parallely. We get one step forward, where instead of having two sources, I've got two sources, which means, I'm going to have two functions of filter and then four functions of word count. And even more interesting, now I'm feedback, I'm feedbacking into one of the previous function, where the output of one function becomes the input of one of the previous functions. So what I've done here is, I've basically chained different functions. In one command, I've changed, I've connected multiple functions with each other. And it's easy. While we're looking at Unix command, it looks damn easy. We all use it in our daily lives and it's super powerful. We can perform literally a word count exercise on a bunch of data with just one command. So that's what stream processing is. Is it? No. There's a lot more to it. So the reason why I'm here today is basically to basically share how Hazelcast can fit into this use case. How you can literally create a powerful stream processing engine is just two simple steps. Before that, just a brief introduction about Hazelcast. Hazelcast has three products. So IMDG, Hazelcast Jet and Cloud. All are open source products with some enterprise features. But I'm not going to talk about enterprise. Being a developer, nobody likes to talk about enterprise features. This brief history of Hazelcast, it was founded in 2008 in Turkey, Istanbul, shifted office headquarter to Palo Alto, few years later down the line, and then we got offices in New York, London, Istanbul, and Palo Alto. As I said, Hazelcast has always been an open source project. We are under Pacha2 license, which means anybody can download Hazelcast code, customize it to your requirement, and put it in production. Then if you need to use commercial features then we got commercial features as well. Now what Hazelcast IMDG is? Hazelcast IMDG is basically a distributed memory cache, where you store data in the memory. Since you're talking about Hazelcast, you're storing data in the memory of Hazelcast servers. So on your screen that you see, these are four Hazelcast instances, which are four Hazelcast servers forming the cluster, and have data stored in their memory. The light green is the primary data, the dark green is basically the backup. Hazelcast instances are Java instances. Hazelcast is purely Java-based technology, and we support not only just Java clients, but also several other clients, .NET C-Shop, Python, Golang, Node.js, and so on. Another way of looking at Hazelcast as an IMDG data grid is think of a filing cabinets. So you put three filing cabinets next to each other, and you put some files in cabinet one, some in two, and some in three, which means no two filing cabinets is going to have the same file. So Hazelcast IMDG is basically what does it do? It's in memory data storage, all distributed, and based on Java. Cloud is purely managed services by Hazelcast. You can deploy Hazelcast on AWS, GCP, or Microsoft Azure. It will literally launched yesterday after three months of beta. So it's all there. Currently, it's open source for a certain amount of usage. You can go to hazelcast.cloud, and start spinning Hazelcast service and using them. Jet. What does it do? Hazelcast Jet is a third generation stream processing engine, all in memory, distributed, highly distributed, highly concurrent, as opposed to several other products in the market up there, and again, Java-based. So Hazelcast Jet is built on top of Hazelcast IMDG, which means it has the heritage of Hazelcast IMDG. It is used for traditional processing, which is based on calculations of storing data. Stream processing is basically about calculations prior to storage. So you have data flowing in the form of stream and you want to process it before you store it, or you want to sync it to an output award. Streams are normally immutable and they are infinite. So you need some sort of mechanisms to perform processing on the infinite stream of data. The architecture of Hazelcast is built on pipeline paradigm, where you create a pipeline of source, and then it basically means input, put your processing in the middle, and then you are writing the output to your sync. Anyone here who does not use Java 8? Do we all use Java 8 or later versions? So we all know of lambdas, right? So pipelines is all lambdas, okay? You can execute, you can run your lambdas, whatever, however complex they are easily on Jet. Now, what is Jet? It's an implementation of directed acyclic graph, DAG, okay? Again, as I said, it is meant for performing stream processing and batch processing calculations. This diagram, do we recall something by looking at this diagram? Does it remind us of something? These black dots, we just saw that in the Unix word count example, right? Remember that? Different functions. So each of these black box is basically a function, right? So in DAG's world, these black box are called vertexes, where vertex means one compute process or a function. So it could be a filter, it could be a word count, it could be aggregator or whatnot, right? These arrows here, which you call them connectors, they are actually called edges. This is all DAG-lingo, okay? Not specific to Hazarkast, just general DAG-lingo. So Jet, as an ecosystem, this picture here basically describes the entire landscape of Jet. Hazarkast Jet is a distributed technology, where you can create a Jet cluster, which means spin up multiple Jet instances to form a cluster. Jet is able to ingest data from different types of sources, be it IoT, be it different sensors, be it social media, be it your actual RDBMS databases, file systems and whatnot, right? You perform all your complex computations on the Jet cluster, and then you write your output to your different kind of syncs. Some of the key features of Jet, basically lossless recovery. So in case of cluster failure, if you lose the cluster because of power failure or a meteorite falls on the data center, so Jet is able to basically restart the execution from its last recorded state, which makes it absolutely lossless recovery. So you're running a complex job which is running across multiple threads in multiple nodes in a large cluster, and you lose the power not to worry. When you restart the Jet cluster, it will start the execution from the last point of execution, right? So it basically persists the state of your job execution on the disk of basically each Jet's cluster node. Which means the data is solved on the disk? Not data, the state of the job. Now, I was talking about we need some sort of mechanism to process the stream of data, the real stream, the time stream of data, or flowing stream of data, right? So for that very purpose, Jet provides a windowing concept of windows where you have sliding tumbling session windows. Do we know what windowing is in this system? Nope. So imagine you have a stream of data, which means you have a pipeline, and you have a lot of data points flowing through the pipeline, right? Now, how do you perform an execution? How do you process that pipeline? You cannot keep just reading one single data entry at a time. That's not a real-time execution. You need a really, really large resource infrastructure to be able to calculate each point when you have millions of entries flowing in every second, right? So how you do that? You basically take a snapshot of that stream and you take snapshot every regular interval, and then you process the snapshot. It's called windowing. That snapshot is called frame, and this whole process is called windowing. So this is what it is basically described here, and this is a general distributed concept. Windowing also allows you to handle unordered data or the data which is arriving late in the sequence and all those kind of things. I'll show more on windowing later on when you get to the demo. Other key features of a JET cluster is basically execution guarantees. So Hezakas JET guarantees at least once or exactly once or no guarantee semantics. Of course, these three semantics have their own performance implications. Basically means exactly once or at least once are slower than no guarantees, and why they are slower? Because JET needs to again record the state of the job execution in both the semantics, which makes it a little slower. The JET clusters are highly resilient in nature. They are able to recover from faults and failures automatically without requiring any manual interventions and lossless recovery, and these job guarantees, they make it even more robust. JET clusters are highly, highly easy and simple to deploy. Anyone here uses Spark? Nope. So to use a Spark cluster, which is basically meant for batch processing, you need to have some infrastructure running, and more commonly is ZooKeeper or Yarn, and you deploy Spark cluster on ZooKeeper or Yarn cluster. That's when you start loading your applications. You've got ZooKeeper or Yarn, you've got Spark, and then that's your applications. With JET, you just need to download JET in a single jar file, which is less than 10 MB in size, put it in your application class path, and off you go. That's it. No fancy infrastructure requirements. We'll see that shortly. Now, a typical example of WordCount, we already saw very briefly in the Unix command. WordCount is basically the hello world of stream processing. The general semantics of WordCount is you have some source and you want to calculate how many times each word has appeared in that source, and the source is basically text. So in this example, what I've done is, basically, I've taken a quote from Hamlet and put it in a Java string, basically an array of string, and then I'm going to calculate how many times each word has appeared in this quote. So during the time of Java 7, when we literally dener-sourced rule the earth, this used to be the coding. This is basically as much as the coding you would need to perform that simple logic of WordCount. But with Java 8, with Lambda, things changed drastically. All of that fancy coding now was confined into just two or three lines of code, where now we have one map, we are streaming into it, then we are taking the filter and performing all aggregation or whatnot. Very similar concept in Jet. Hezekas Jet APIs are the direct extension of Java APIs. So you would see not much of difference in the way you write code, you would write code in Java 8, Lambda or using Jet. So here again, based on the pipeline, you've got a source, you've got your mapping, you've got a filter operation, you've got grouping operation, you've got aggregation operation, all in one line of code. Did anyone spot a mistake here in this code? There's a prize. Spots the mistake. Come on. They're all Java programmers. I'm going to give you a hint, and with that hint, the value of prize decreases here. Any guesses? No? Times running out, one, two, three. Time out. Please don't kill me after this. Just inefficient order. Filtering should come before the mapping operation. Simple. Sorry, I was being cheeky. But again, Jet, you can do this, even if it is inefficient to do this, but you're still able to do this. This is the output of the word count, where we see that these words appeared so many times, not the full output. Now let's look at the pipeline again. The same thing, my functions connected with each other, fitting into each other. Think of these arrows or the edges as the pipes, piping symbol of Unix. So we got source, we got our filter, which is tokenizer, and then we got our accumulate, which is basically aggregating in the output. Now this is all single thread execution. We got one thread, which is doing all of this. It starts with reading the source, and then it gets to the next stage, which is basically filter, next stage, aggregation. What if I put concurrent queues here? If I add the queues, which means now I have a queue between one function and the other, okay? The benefit of adding queues is that I am now able to have multiple consumers, which means I'm now able to have multiple functions, running the same job on different datasets at the same time. It's like the T command of Unix, right? Now I'm bifurcating one operation into two. So that's my process, that's my process. Basically, again, in all DAG world, one, this is called vertex, these are called edges, not specific to Hezarkas, general DAG semantics and DAG language. Now, if I put multiple queues in all my connectors, what I'm able to achieve? I have now achieved a total parallel execution of each of my processes, right? An absolute parallel execution, which means I've got earlier it was all single threaded, now I've got six threads, at least six threads which are doing the same execution all in parallel. I've got one thread which is running as source, I've got two threads for this filter operation, I've got two threads for my aggregation operation, and then I got the six thread running as source aggregation. Sorry, the sync aggregation. Now, this is all one JVM, right? I've got multiple threads running in one JVM, one jet instance. What if I put multiples of JVMs together? This becomes more interesting. Now, I've got multiple JVMs doing the same thing, which means I've got multiple JVMs doing the same execution in parallel. So imagine the concurrency that you achieve. You've got multiple JVMs, each JVMs got multiple threads and each thread is able to do the processing task in parallel. You cannot do that in Spark. You cannot do that fast enough in Kafka Streams. Kafka Streams is all about disk bound processing. They have to persist each and every item to the disk before they process it. This is jet, it's all in memory. That's what makes jet ultra fast and we'll see that in a minute. So what in a sense, jet is basically multiple jet engines, literally jet engines, because they are super fast and the performance of jet cluster directly depends on the infrastructure that you provide it to run. In other words, the more cores you provide for each jet cluster to server node to run, the better it becomes in performance. Now, we've seen how we use jet cluster, how we deploy a jet cluster and what it does and not. A couple of bits on fast processing on the streams, on the speed. So the same word count use case, what we did, we did the benchmark. We did a performance benchmark against Flink, against Spark and jet. So what you see in blue is basically jet performance of jet, the red is Spark and oranges, Flink. So what we did, we basically assimilated a lot of text files, put it somewhere in the file system, in HDFS for Spark and little simple file systems text files for jet and Flink, use the same infrastructure, use the same number of nodes for the cluster formation and with the benchmark of word count. This was the performance. Spark has tungsten enable which basically means which basically is the turbocharger for Spark cluster. So all those high-performing hooks for Spark was enabled, all those high-performing hooks for Flink was enabled and then so was for jet. The simple word count use case jet in blue strip, forget the green strip for a minute. The jet in blue strip was already 2.2.2.3 times faster than Spark. Same data source, same infrastructure, same benchmark, just the technology is different. We are way faster than Flink. We didn't consider streams, Kafka streams for a simple reason that again, Kafka streams is all disk mount, they do not do anything in the memory, so which would not have made it a fair comparison. Now about this green bar here, what this green bar denotes is what we did, we basically, instead of keeping the data on the disk, we moved the source to an memory location, which basically means moved us, we use Hezakast IMDG as in-memory store for the text files, and then we made jet consume from IMDG. So jet now was reading ingesting from IMDG from memory-based data store, and look at the performance. With memory-based data store, ingesting from IMDG data store, jet it was faster or four times faster than jet itself, where the source was disk mount. That's the kind of performance you get when you put everything in memory. Data in memory, source in memory, computation memory, and then sync in memory. That's what it's called in-memory. Are they all, is it the same sentence to repeat it, or is it like totally different text? Different text, different text. So the whole book, let's put it this way. What's the question? So the question was the difference, why 64GB and 640GB and what's the difference? Was the text repeated or a different text? So all different text. So we are fast, right? Hezakast Z is fast, but how? What makes it fast? That's the question, and especially for those who have been using such kind of applications like Spark and all, that's the question why this all looks fake, right? Now, the important thing on benchmark, all these benchmarks are published in public domain, right? So we put the code, we put the detail analysis, the results all in public domain. As a matter of fact, we use public benchmarking framework like Redog and whatnot, okay? So it's for everyone to see there. Now, what does Hezakast that will do? So when I said, we're going to see how Hezakast allows you to build infrastructure where you can build fast processing engine in just two simple steps is basically this. You have source, you put the source into a memory location, which is IMDG, and then let Jet use the ingest from the IMDB on data source. The secret source in memory data locality, with the combination of IMDG and Jet, we are now able to put everything in the memory, which means the data now sits closely to where the processes, okay? The next bit is SPSC queues. So single consumer queues. So now the local edges implement SPSC bonded queues where it employs weight-free algorithm and avoids volatile rights by using lazy sets. But the single most differentiating factor between Jet and the other technologies out there is this cooperative multithreading using green threads. Do we know the concept of green threads? So basically, we literally do core affinity. So we tie our processes, our threads to the cores of the CPU, so that there is no context switching. How it happens, so each vertex in Jet that you saw those dots, is basically a processing unit, okay? And each processor is designed in such a way that it use all the cores of the system. And how the cores are used, is basically Jet creates a thread pool of fixed number of cores, okay? Which basically are used by the processors, individual processors to execute. Now these cores are controlled by a native thread. Native thread is basically a thread which is directly dealing with your CPU core. Each thread in that executor service pool, it does some small amount of work for each processor in round-robin fashion, and then yields back to the Jet engine. It's purely a library level feature. We do not use any of the native APIs, we do not think of it any Java source code. This is purely library level feature, the way our engineers developed Jet multithreading. And what that gives us, I said zero context switching, so no need for the worker thread to involve OS level thread schedulers, which means the core which has started its processing one thread is going to hook down to that thread for its lifetime. And that makes each thread having the exact knowledge of the processing that it needs to do for different processor for different vertexes. So it gives Hezarkast threads final granular level control over what needs to be done and how it needs to be done. And this is the single most differentiating factor between Hezarkast and the other technologies in the market. Let's quickly skip this. Okay. Back to the Concorde, good time. So no idea to take the Concorde of course, but there are other aircraft in the air. There are other airlines in operation and they move. Now we're talking about handling a real-world use case with Jet. Now the use case is that there are many different aircrafts in the air at any given point in time. We're looking at five to six thousand aircrafts at any given point in time being in air. And they all provide data. We got different websites which provide tracking information of those aircrafts. They provide GPS coordinates, they provide carbon CO2 emission, the noise level emission, a lot of different other material for each of the aircrafts. So if we are able to track these GPS coordinates, we are able to track the location of each aircraft. We can predict various things, whether the aircraft is taking off or is landing, it's not being hijacked, it's on its route and different other things. This is basically an example of one of such source or the stream of those aircraft coordinates and the information about each aircraft, which is provided by many different websites on available on internet. So for example, this is stream of one aircraft. Where we have timestamp, latitude, longitude and there are other several supporting data points. Now, if you wanted to use or process this stream, how would you do that? Remember I earlier mentioned that you have a pipeline where you have a lot of data flowing into system, you want to process it. You basically use windows. You process them in chunk. So let's say you configure a window of five seconds or five elements in this context, and then you process this window or you call this frame, and then you move your window by one second, and then you process this frame, and you move your window by another second, you process this frame and goes on. This is an example of tumbling windows. So using the same windows, this windowing concept of windowing, what we're going to do, we're going to see how a single node jet cluster is able to, a single node is not a cluster actually, but single node jet instance is able to ingest a real-time stream of data provided by one of these Internet guys and then how we process it, and then how we visualize the data. So this is the website that I'm going to be using in this example, in this application here, it's called ADSB exchange, and they provide the tracking information about aircrafts, about their location, and all those things, GPS coordinates, CO2 emission, noise emission, and all those things. So the objective of this application is to track all these aircrafts and consider major airports in a demographic location, and then basically calculate whether they are ascending or descending, what's the CO2 emission neighborhood like, and noise level in the neighborhood like, and so on. So that's the flow chart of my processing here. So I'm going to see what my flight data source is, which is basically a URL, which is providing me with the stream of the aircraft information. Then I'm going to filter the aircraft based on the altitudes. So what I'm going to do, I'm going to basically filter out any aircraft which is flying beyond or above 3,000 feet. It makes my life easier for this calculation. So anybody, any aircraft which is flying under 3,000 feet, I'm going to consider, I'm going to measure their altitudes in real time, and see whether they are ascending or descending. Okay. Then we'll assign an airport info to it. So let's say if an aircraft is near Heathrow, London Heathrow, then we know that this aircraft is either ascending or descending at or from Heathrow. Same for other major airports that we're going to consider in this example. Then we're going to calculate the altitudes, and then some further calculation, things like calculating CO2 level, max noise level, and all those things. At the end, we store the result of these calculations in graphite database. So we use Grafana, an open source visualizing tool. We use Grafana to basically visualize the result of our processing here. Grafana uses graphite database. It reads some graphite database. Sir, he looks us like mostly works with Grafana. Is it like Com2 or? No. I'm going to show you this. Now, this is the main code. This is the main method, not the public study void main, but this is the main guy where you build your pipeline. Here, you're literally building your DAG. Now, here's a class jet provides two different ways of coding your job or designing your job. One is using pipeline API that we're looking at the screen right now, and the other one is using basically DAG APIs. Now, DAG is quite low-level API, and pipeline provide an abstraction over DAG. You are free to use DAG if you want to. If it's of your interest, then please feel free to use DAG and not pipeline. But you can do pretty much the same thing with pipeline without writing much of code. So here, we use sliding windows, no tumbling windows where each window is size 60. I'm going to consider 60 elements in it, and then I'm going to slide by 30 seconds, which means every stream of data we slide by 30 seconds. Here, I'm going to draw from the source. Now, my source being the source URL, which is basically this guy here. This is the URL of my stream, all internet-based. I need an internet connection. I think I do have internet connection. Then, I basically do the filtering, where I exclude everything which is flying beyond or above 3,000 feet. Then I do the mapping operation, where I assign airport to the aircraft based on their location, based on their GPS coordinates. Perform my windowing operation, do the grouping of my keys, and perform the aggregation, where basically I'm calculating, measuring the location of each and every aircraft every 30 seconds. Once I've done that, then I know what are my flights are taking off, then I know what are my flight which are landing, and then ultimately, all same with CO2 emission and the max noise, and then I basically drain the output to a graphite database. So here, same pipeline, it basically now, I had the source, I did my aggregation and calculation, and now I'm draining my output to a graphite database. So basically, these are the components of my graphite database. Now, Grafana is going to read from these columns. Grafana is going to read from graphite database from these columns. This is all you need to perform that complex calculation. This is literally all the code you need for flight telemetry. How does it all look like? So basically, I'm going to start this application. Now, what's happening on the screen is basically starting one jet node. You can do multiple of these things and then we'll start the jet cluster of multiple nodes. There you go. So before this, I want to show you the DAG. So what we saw in the code is basically the DAG. This is a DAG for that code. We've got flight data source, we've got filtering, we've got assigning of the aircraft, and then we've got the calculation of linear trend of altitudes, and then draining to the sink. Now, the graph, I'm going to show you now. Again, this is just what is happening in the code. So this application is basically printing everything in the logs, right? We're constantly receiving stream of the data and printing everything in the log, and then writing the output to the sink which is graphite database. So now I've got this Grafana application running, Grafana dashboard. There you go. Let me start a new one so that you don't think that I cheated. There you go. Flight telemetry. So this Grafana is basically reading from its graphite database. I was showing that literally three aircraft took off from Frankfurt in the last 30 seconds, four landed in Paris, six landed in London, one took off from Paris, now two took off from Paris. Now, this is all real time. It's literally real time. If I break my internet now, if I kill the internet now, this data will go away. Let's do that. Now, the application will complain about it that I can't find the source, I can't find the stream, I'll kill the dashboard. See, nothing is moving. 662, nothing is moving, right? Everything remains where they are. It doesn't mean that the aircrafts have become helicopter and they are staying where they are, it's just that we don't have the data to calculate, okay? There you go. Then nothing is moving, right? Back to demos. Okay. A few more things. What about the World Cup? So, while the final was being played, one of my colleague in Czech, he was giving the talk on jet, and he created, basically developed an application which was basically reading the Twitter sentiments, and based on Twitter sentiments, he predicted the result of that match, and this is exactly what happened. He predicted, so this guy, he break the Belgium reading England and exactly what happened, just by latching one of Twitter. Now, I'm not saying that he was, what was the name of that octopus? Paul. He was not octopus Paul, but with jet he was able to create an application which was basically able to leverage the sentiments people were expressing over the Twitter. All right. So, what we saw that it was able to receive data as a live stream. Now, there it was using simple socket connections, right? We opened a socket connection, HTTP connection, and we started receiving stream of the data. You can build your own connectors. Jet provides a lot of different connectors of the shelf, but if you have any type of source where you cannot find a connector of the shelf, you can build your own connector, right? You can build your own processor, build your own connectors, and all those kind of things. So, long story short, in summary, everything is a stream, right? Today's world batching, nobody wants to perform batching. Everybody wants to do streaming. I want to perform, I want to process my data in real time, it's got to be streaming and all those things. Your code has the input, your code has the output. You have the control of what you want to do, right? It's a function. You can temporarily go to the disk to persist it, or not go to disk at all, or go to disk after you have finished your processing. See your call, your code, okay? Here, the class is great. So, just a small thing about lambdas, you just can't get confused with lambdas if you are on huge Java 7, right? Yeah, that's pretty much about it. That's one of my ex-colleagues, by the way. Whatever you saw today, it's actually all available on our website. Yes, I'm going to show you that now. So, occurring the source code of this available on GIG plus a lot of other demos like trading, setting where analysis, there's one that's written for cryptocurrencies. Now, it's the longer hot. So, I actually used the one I could sell off all my BTC before I can even know. So, here is my flight telemetry application complaining that this source is not available because I disconnect the Internet. Please. So, you talked about the number of the more cores you have, the better this project. Yeah. How about memory, especially with Java GC? Yeah. So, your performance index you had like 64 GB, 64 GB. Yeah. So, obviously to process 64 GB on a single, let's say, let's assume a single JET instance. What kind of? It's a very good observation. Not many pay attention to that and I'm glad you did. Can you repeat the question? Just give me a second. You're talking about this part, right? Yeah. So, the question is, how does, so we talked, I did mention about JET being performant and highly performant if you provide more cores for JET instance to run, right? But what about the memory? In this particular benchmark example, we talked about 64 GB, 73 GB and 640 GB. What was the impact of garbage collection and what not? So, with this, we were able to put data in his or her memory. Now, to a certain extent, you can control the impact of the effect of garbage collection by tuning the heap extensively and then being able to live with garbage collection. We know that we're going to have GC pauses and we can't do anything about it. Otherwise, you run your heap size in the region of hundreds of GBs. You don't want to do that because it's super dangerous. So, we'll live with the GC pauses, whatever amount, however long they are, but then we'll deal with it. So, this is why these three charts have the green bars in there. They put data in memory of hezakas servers in simple java heaps. We could not do this here in the second chart, because here all the data was still being read from the disk. In this chart, the data was read from the disk. Now, we didn't want to be partial because hezalkas has this feature called high-density memory store where you are able to store lots of data in the memory of hezakas servers without running into GC pauses. Now, it's not an open-source feature, which is why I didn't talk about this. But since you asked about this, it's a commercial feature, but it allows you to create large caches and without running into GC pauses. Call it off-heap if you may. So, this is off-heap look memory storage outside of JVM heap. Now, for 640 GB of data, if you want to keep that much of data in the memory, that means we are running into GC pauses right after one after another, continues, and then eventually going out of memory error. So, we wanted to use off-heap memory, but because we wanted to be partial, we don't want to partial, we didn't do that. So, generally on an average, what is the spot sort of memory? So, the example that you saw, the flight telemetry, it was a new standard Java heap, one GB. So, it's not what sort of memory does has a cost need. It's basically how much memory do you want to use? If you want to store, let's say, 3 GB of data in a server process running with 4 GB or 5 GB, then yes. But if you wanted to store, let's say, 20 GB or 50 GB of data in a JVM, then it's not recommended to run JVM with large heaps, because the larger the heap, the longer the GC pauses. The processing is different, your IMDG is the store, the processing is not the store. Correct. Obviously, it can run on all small long MBs. So, you can run basically on literally now default. Not even, I would say tens of MBs. It's the JVM. The, if you want to precise answer, the amount of threads that has a cost cluster creates, I'm not talking about nodes. Let's say, we have three or four node cluster, and each node, the number of threaded each node creates is basically the only overhead. Even if one instance, one cluster node creates 10 threads, which are doing different things at different level, then just 10 Java objects. So, the impact is quite minimal. I'm too excited about the delivery, I think. You also talked about Jet not needing a third-party manager. Infrastructure. Absolutely. Yeah. Because I know of Kafka, and I know it needs to keep everybody in control. Yeah. Like everybody in action, all the nodes in the cluster. So, how would Jet kind of, and also, for instance, how would the equal work distribution happen? Like one node could be overclean itself by the next time. Okay. So, the distribution of job, how does the distribution of work happens across a cluster? That's the question. So, what the example that we saw was a single node Jet instance. If I spin up multiple search instances, that means that flight telemetry application is now running across multiple nodes in the cluster. Which means each node is reading, each node is working on one snapshot or one frame of the windows. Okay. Currently, that application has one JVM, it has multiple threads, and each thread is basically reading one frame of that window of streams. Right? If you do the same thing across multiple nodes in the cluster, that means the same operation being performed across multiple nodes, but for the different data set. So, the aggregation is happening in running in multiple threads across the cluster but for different data set. So, one thread has picked up one frame, the other thread has picked up another frame, the other thread has picked up another frame, and same goes throughout the cluster. Then eventually, they converge, the aggregation happens. So, this is my filtering part. Now, the aggregation also has to happen. Aggregation also happens on one of those threads. So, I've got one vertex as my filter, I've got another vertex as my aggregator, and each vertex is also distributed. So, each one vertex itself can be processed by multiple threads within the JVM, within the one cluster node, right? Then you have multiple such JVMs, multiple such cluster nodes. So, let's say you have five JVMs or five cluster nodes, each cluster node has five threads, that means you have one vertex being processed by 25 threads, which is basically what makes this chart possible. Because we are able to process execution of a job in a concurrent environment, a highly concurrent environment by multiple threads. So, in this case, how when I mentioned across JVMs, five threads, like for one node, like filter is one vertex, like filter is happening and like 25 threads for five JVMs, right? So, you mentioned about the windowing concept, right? So, how is the windowing being happening over here in this scenario? So, how it works is basically again, internally it's based on partitioning. So, as it creates partitions and then it creates, let's say it takes a window, one frame and then all that one frame, it will create a key and then store in partition, then each key is processed there, right? So, the windowing by its own logic is basically distributed, it's spread across the cluster. So, it's like, let's say if I have one MB of data, then it is going to do like, maybe let's take a few hundred KBs it will take and give it to one JVM, then another hundred to another JVM, another hundred to another. Think of it as a MapReduce, right? How a MapReduce works? This is just a MapReduce, but not really MapReduce, it's a DAG with implementation, right? So, some part of data is processed by, so yes. So, here in this case, we're going to talk about processing a stream, which is where one frame is one part of the data, right? Frame is another part of the data and each frame is processed by one thread. So, imagine the image which you showed across like various nodes working together, like state by state sort of a thing happening, but some stages could be split across. Now, each of those stages are part of one node. So, you have multiple such stages across, spinning across multiple nodes. Actually, you're mentioning about the state of the job is stored on that disk. When is it happening? Every second, every five second, whatever it is configured. And supposedly, if something weird happens, like it goes down, something, then how is it done? So, that's what is called lossless recovery. So, lossless recovery is based on basically storing the state of the execution on the disk and when the cluster comes back online, it reads from this disk the last execution state and starts executing from that point onwards. So, it's like I walked two steps here and then I stopped and I wrote my journey from here to here somewhere. Now, when I come back tomorrow, I'll start from here. I'll go into my database and see where I was. I was here, then I start from here. So, it's like a retrying happening? No, it's not retrying. Jet consistently and constantly stores the state of its execution, not the data, the execution into the disk. So, that when it comes back online, it knows where I need to pick it from. So, in the case of, okay, that's one thing. So, in the case of, let's say the steam went down, in the case you shut down the internet. So, and now suddenly it comes up. How is Hazelkaz gonna? So, there are multiple things you can make Hazelkaz do. Either wait long enough for the steam to come back online or give up. Give up after trying for several minutes. So, here in this example, did I record time? Yeah, no. Okay, I didn't record time here. But this exception occurred. The application gave up trying to read from the stream after a certain amount of time, right? So, you can configure all of these things as to how much tolerance you want to create in the cluster for failures. This is a failure. This is a failure for my application towards my source. The stream was a source. My application, which is my job running in the job, that cluster is not able to find the source, which means the edge which was connecting my source with my other vertex of computation is missing the link, right? So, you can define the tolerance as to, okay, how long I'm gonna wait for this stream to come back online. Any more questions? Cool. So, sorry, one thing. One more thing I wanted to show. There are a bunch of other demos that we have created for the community. So, if you go to jet.hazelkaz.org, so, basically jet.hazelkaz.org, that's the Jet website homepage. You'll find a lot of details here as to, you know, what the technology is. You can download Jet from here. So, basically, you go to download and you can download in tarball or zip file and put it in a class bar, right? Or, if you want to customize Jet, go to GitHub, do a clone and start building the code yourself. Then there are a bunch of demos. So, what we were looking at was a flight telemetry demo. Here, right? It has, so, all these demos, they have, you know, all the details as to what the demo does, where you can download the code from, how you can run it, right? And all these demos are self executables. So, all I needed it to do was just download the code and do a build and do a May 1 execute Java, that's it. So, for example, you know, real-time sports betting engine or Twitter cryptocurrency sentiment analysis. Now, this is, we created for cryptocurrency sentiment. But the same thing, but this was derived from our World Cup application that we created, right? Same for market data ingest, you know, again, example of how you ingest the market data and all those kind of things. So, yeah, please feel free to take a look, you know, download these demos. They are all free, open source, cost nothing, and easy to run. Okay? Cool, that's all I had for today. Thank you very much. Thank you. Thank you.