 Well, hello and welcome to another Dev Nation Live. I'm excited today to have Gallauder with us. He's gonna deep dive us into big data. I'm hoping you guys are excited about big data. And of course, big data is one of those overloaded terms that has a lot of different technologies and techniques and ideas behind it. But Gallauder's got a really awesome presentation and our series of really awesome demonstrations I think you're gonna really enjoy. And again, for Dev Nation Live, we've run this in very short intervals, about 25 minutes of presentation and demonstration, then a few minutes at the end for questions. Please hit us on the chat with your questions, just type those right in, and then when we get towards the end, I'll be asking Gallauder those same questions. So at this point, let's turn it over to Gallauder. Are you ready to go? Yeah, I'm here. Awesome. Hello, welcome everyone. Thank you for attending this presentation on big data in action with UKinniSpec. So today we're gonna start a little bit looking at the problems of real time big data, the data growth that are happening at the moment and see how we can solve those with InfinisPan, which is a memory data grid. And then we'll be going through a couple of live coding demos that use InfinisPan Vertex and OpenShift. So my objective today is to show you how you can build infrastructure based on InfinisPan to store, search, process and near real time big data, how to do Calcune analytics as well. Now, dealing with real time streaming big data is quite a unique challenge. This is an offline processing of big data has been happening for a long time, whether it's via batching or similar technologies, but you can never really get it to be fully real time. And the reason why this is important is because even a few delay of a few seconds can mean the difference between either keeping or losing a customer or for a financial institution, it could be the difference between increased liquidity or big losses. So real time processing can be really crucial for certain businesses. We've also seen like a huge data growth happening and this is the result of a lot of the smartphones, IoT devices. So with all this big data that is happening, how can we handle it with how can we make some sense out of it? And for these two problems, we think that memory data grids are a perfect solution for this type of problems. So what are the memory data grids? Memory data grids are essentially a way to manage distributed data in memory. So what we have here is servers connected in a mesh with peer-to-peer communication style. So there's no master or slave categories. So there's no kind of single bottleneck or single point of failure. And this kind of grids, data grids are designed to run on commodity hardware. So you don't need expensive hardware. These data grids are linearly scalable. For that what we use, we use a smart data distribution technique. So even if you have like a big cluster, then we can divide the data such that only certain copies of the data are maintained. So what this gives you is this gives you that each node maintains a subset of the data. So by doing that, you get this kind of nice, implicit data problem, which is gonna be very important for us later on in this talk. Data grids are also elastic and handle failures transparently. So they're very well suited for cloud or platform as a service environment. And they can be backed by a database file system or their persistence source. So if you want to give them more life to your data. And they're accessible from any type of application. So this gives you the ability to have this kind of transparent sharing of application data. And we have connectors available for Java, NodeJSC, C++, or NetAccessible. In PhoenixPan can be used in many, many different ways. I'm not gonna go through all of them here because we don't have the time. So you can, for example, you see that distributed cache or as I can on distributed SQL database with querying, transactions, et cetera. But for this presentation, the two very interesting use cases are the one about how can we do data analytics using either distributed data streams or with Spark Hadoop integrations? Or how can we do event driven computations that allows us to do real time processing of data? And the talk today is gonna be focusing on these two last use cases. So the first use case, the one about event driven computation, it's a very important for us on there are many, many ways in which, well, there's several ways in which you can access in PhoenixPan, whether you go like a dictionary key value store or the current API. But if you want to do like real time processing, we need a little bit more advanced. This is what continuous query API gives us. It enables you to be reactive to data changes. So what it does is basically the query API is an extension allowing an application to receive entries that currently match a query and then be continuously notified of any changes that happen to the query data set. So what this gives you is, it gives your application the ability to receive a steady stream of bands instead of you having to all the time execute a query to see if there is any new data already or if there is any database not part of the query. So this leads you to a more efficient use of their resources. So the best way to understand how it works is to really see it in action. So for that, I'm gonna be showing you a demo today which is based around the Swiss rail transport system. Here what you see in the screen is a domain of the objects we're gonna be using. So we've got a train, which is a physical train and then represents its ID type, et cetera. Then we've got the station. So we've got information about the position geographically. And then for each station, we've got a station board. So this is like at this moment in a particular moment in time, which trains are coming at which platforms, et cetera, where are they going? And each of the entries in the station board we call that a stop. The demo that I'm going to show today runs on top of OpenShift. You probably, we've done several presentations already in definition. First shift is RedHats platform as a service that allows developers to pick lead, developed hosts and a scale, applications in the cloud environment. And this is gonna be all the applications are gonna be running on top of that. And the demo also uses Vertex, which is a toolkit for building reactive applications on top of the gradient. It's a bend-driven and not blocking. So it means that your application can handle a lot of concurrency using a very small number of threads. So the first demo that I'm going to show you, what we're going to try to achieve is we're gonna try to create, represent our data grid is storing the state of the station boards at a particular time. And then what we're gonna try to do is create this kind of a dashboard that shows us all the trains are delayed in the system. So of all the trains are going through the country, which ones are delayed? And then that way, say I'm someone in the management of Swiss transport system, I can see where which trains are getting delayed, et cetera. So what we are gonna have, that's gonna be on my laptop or like kind of on my system, that's gonna be on the front end on the right-hand side. And that's basically going to be communicating with a real-time component, which is already on the OpenShift. And then we're gonna have like an injector vertical. Vertical is this kind of concept in OpenShift that allows you to, it's like an actual, like a unit of processing. Then we're gonna be feeding our data grid with data about the station boards. And then we're gonna have on the left a data grid, which is three nodes, we distribute the data around. And then what we're gonna do is we're gonna work on the continuous query vertical, which is the one that allows us to say, okay, of all the station board data, give me the information about those trains that are delayed. So the best, so now I'm gonna show you all of these in action. So I'm just gonna switch to my screen, to my ID. And then what I have here for the dashboard is a JavaFX application. So I'm just gonna run it as is at the moment, so you can see what it looks like. Okay, can you see right now it's just a little dashboard, it's empty, the code is not complete, which is what I'm going to be doing right now. So from a continuous query perspective, the two things we need to do, one is define the query and then define the listener. Here, the most thing from a structural perspective, we've got a cache, which is kind of worse, our base storage. And then what we're doing is for each station, we've got the station board, it can stay. So what we want to do from our query is say, we want to do, from our query factory, we want to do a station forward query. Are we gonna say, of each station board, can we do those station boards, where there is at least one entry that has got more than one stop that is delayed. So the entries that is here comes from, if you look at station board, it's got a list of stops, which are main entries. So that's where this particular code is coming from. Okay, so we have our query. Now, the next part is the listener. Here what we need to say is, when we get a station board that contains this one entry, that is delayed, we get the entries, and we kind of need to pass them over to the front end. So what we first need to do is, we're gonna take out of each station board, we can multiple stops. So we say those stops whose delay is bigger than zero, so it's delayed. And then for each of those, we're gonna be basically pushing it to the event bus. The event bus is the part that underneath is gonna be making sure that this is shipped over a web socket to our front end. So we've got the query now, we've got the listener in place. The final thing we need to do is put the two together. So what we need to do is here is say, okay, continue to query, add for this query, add this listener. That's it. Now, the next step we need to do is we need to publish these changes. So what we need to do first of all is we need to, we're gonna just, so what we need to do is now push these changes over to OpenShift. There are multiple ways to do that. One that I've chosen here is one that just involves building it locally and then pushing it, that's kind of like a binder over to OpenShift. So what I do is I just build it first. It's a few minutes. And then what I can do is I can do called something called OCS style build, which allows me to say, okay, take this binary and push it over to OpenShift. So if we do that, we can now go to our OpenShift. And we start to see now here that we're starting to, this is a second version being built of the real time component. That's our component that's gonna have the update it. We see it here. It's now updating it. Okay. So now the version two of our application is up. So we just could go back to the IDE. From here, we run our effects application again. And within just a few seconds, we should start seeing delays appear. So yeah, we start to see our dashboard, like as we are feeding kind of a station board that we've kind of compiled about two, three hours worth of station board. Then we kind of feeding into the system as delays are happening, we kind of feeding them from OpenShift. The website gets over to our effects application. So we can see like delays are happening. So train system in Switzerland is really good, but as you can see, sometimes delays are still happening. So this is the part about the real time demo that I wanted to show you. So we're gonna move on to the analytics part. And Galdor, we do have a question on the demo that I think would be pertinent right now. Why is there a filter and a stream for minimum delay? Men delay? With, oh yes, because when I get the query, so the query is about a station board. So this first credit gives me giving a station board where at least one of the entries is delayed. So if I have, the station board has got train entries, or one of them is delayed, then I'm gonna have, I get as callback all the station board. So then afterwards I need to say, all the ten entries just extract the one that is really delayed. That's based on the structure of the cache. I could have a structure in a different way. So I didn't have to do the two filters. But at the moment, this is the kind of granularity that I get, understood? Yeah, and that's very cool. The concept of having that continuous stream of data and then being able to filter it and modify it and then react to it as it comes in. Yeah, at the moment it's only implementing the joining, but you could say, if another result leaves, then I should remove it also from the table, which is not yet implemented, but it could be done easily. Okay, so let me just stop this. Let's get back to my presentation. So in the next part, what I want to do is I want to show you a little bit about the analytics part of, in Infinispan, what you can do with big data analytics here. In Infinispan, we primarily have two ways of analyzing the data. On one side, we've got Spark-Hadoop integrations. So what that means is that you can use the very powerful APIs of Hadoop of Spark with an Infinispan backend. So what you do there, you combine the advanced data analytics APIs of Hadoop and Spark with memory data grid technology. So all the kind of benefits of data grid. So peer-to-peer communication, no master slave concepts, et cetera. Now, the kind of bigger problem of Spark or Hadoop is that they are big stacks that require independence resource management. So they have their own technologies for clustering, using ZooKeeper, et cetera. So they have very powerful APIs, but you should really only use them when you really need them. Alternatively, oops, sorry, I've gone too far. Alternatively, what you can do with Infinispan is you can alternatively use distributed data streams, which allows you to transform and analyze data storing a data grid. This is essentially an extension to the Java Streams API so that you can do filter, map calls on Java Streams by the distributed fashion. So this is quite cool. The way it works is that it, one of the key benefit of this is that it allows you for operations to run simultaneously on different nodes at the same time. This is because, as I mentioned earlier, one of the key aspects of Infinispan is that the data is distributed. So for each particular piece, a key value pair, there's always one node that is the owner. So what we do is each node is owner for that particular, for certain keys. So when the lambdas or the functions are executed to these nodes, they're executed locally, only for the subset for which I'm owner. So it means you get all these nodes working together like kind of executing these lambdas in simultaneously. And this is a very powerful way of kind of doing, distributed processing without the need to bring in a Spark or Hadoop. So the next time I'm going to show you is, it's going to use distributed Java streams to answer this question. What is the time of the day of the hour? The hour of the day where you get the biggest ratio of delayed trains in Switzerland. This is not a trivial thing to answer, as you'll see in the demo. Now the way this demo is structured on the front end, we're going to have like a Jupyter notebook. Jupyter is kind of a very cool technology for writing quick technology for writing data visualizations and data science related things. And then what I'm going to do from Jupyter, I'm going to take advantage of its diagramming. And then what I'm going to do from there is call into our OpenShift environment. It's going to go to the analytics vertical. This is like another vertical, which is going to try to calculate to get the data to answer this question. The way it works is, we've got again a data grid from three nodes, but in there we're going to have this kind of server task deployed, and that's where we're going to be writing our distributed Java strings. So the analytics vertical is going to make a call to one of the nodes, and the other node is going to be taking, it's going to, one of the nodes is going to take charts of distributing the strings operations to the other nodes, bringing them back and sending back to the analytics vertical. On the other, what we also have is an injector vertical. This is the one that pushes data into the data grid. Here we're talking about about two million, two and a half million entries, which are three weeks worth of data to basically analyze. So let's see this part in action. So what we need to do now is we're going to go to, the most important part is the server task. This is kind of like a little bit like a PLS curve kind of thing, where you can just deploy it and run it. Now what we need to do here is we need to do, to calculate what's the ratio where we get the biggest delayed trades, we need to get a couple of maps in place. We need to get a map saying, okay, between zero and one o'clock we've got maybe a thousand trades going through. Between one and two we've got maybe 800, et cetera. And then we need to do the same for delayed trades, but those which are only delayed. So maybe between midnight and one o'clock we've got 90, between one o'clock and two o'clock maybe we've got 20, et cetera. So how do we do this with distributed streams? What we need to do if you look at the cache here, the cache here we structure in a slightly different way. Here we're basically recording all the real stops that happened in our station. So what we're interested is about the stops. So what we need to do is we need to do, we need to collect them group in buy. So we need to say in buy. For it to stop, what we're gonna do is get an hour of the day for each of the, okay, amen, amen, amen, amen. So this is going to give us, it's gonna give us per hour how many stops or how many trades are going through. Now the thing that is a little bit tricky at the moment is that this collect takes a, this group in my returns an object that is not serializable. So a very easy trick that you can do is you can kind of force, if something is not serializable, what you can do, we can make the function that creates those collectors serializable. So instead of creating the object right here, send the function to create that object to other nodes and then execute it in its node local. This is one of the tricks that is part of Java 8 where we can say we can kind of, if you take an object and pass it to a method and then you say this method also takes serializable. So this is a supplier that we're passing here. If we say this method takes a supplier that is also serializable, the JVM kind of puts the two together. So it does a little bit of hard work for us without having to do the casting. Now for the delay part, it's pretty much the same but with the difference that we need to say, only those are delayed, we're interested here. Now the cool thing here, we don't have to do any magic here. This is the predicate that we pass to filter. It's already kind of forcing to be serializable. So it's already ready to be shipped to them. And this is all we have to do. And the final bit I need to do is I need to kind of go to the, I need to rebuild the server task and deploy it. The first thing we need to do is we need to build it and then to plug it. This is something that is still a bit that we're working through underneath. We can just create the car and push it into a particular place that then the container picks it from there. But there are all the ways that we could have done it. We could have done it with volumes or with other reps. But for this, I'm just going to do just a simple copy and then the server picks it up. Okay, so our server now contains the VR with the task that we want. So I'm just going to go for the last bit of the demo. I'm just going to go into the screen. So here now you see my, the Jupyter notebook. This allow, Jupyter allows me to kind of go and kind of execute individual calls so I can execute the imports, get the URL. And finally, in this third line, this is when I kind of do the calls. So you see it takes a few seconds and then once I receive the answer, then what I can do is plot. So we see, something that is a slightly surprising. We see like the biggest ratio of delayed trains is at two o'clock in the morning. And you see as the day progresses towards the end of the day, you see more trains delayed, which might surprise you, it also surprised me because this is one of the nice side effects of the Swiss train system is that for the last connections of the day, the connections wait for each other. So that means if you have the entire system, there's one train that is delayed that has a ripple effect through the entire system for the last connections. So if you're stuck somewhere and then you need to get a train and you need to get your last connection, if that train is delayed, then you can be sure they're gonna wait for it. So just for fun, let's have a look at the data so that you see what we're talking about. So even at a rush hour, which maybe will have been what you would expect, you see at the seven o'clock in the morning, you've got 99,000 trains running per hour. Like five o'clock, you've got 113,000. Not only 6,000 are delayed, whereas two o'clock in the morning, there's 2,300, 321 are delayed. So it's kind of a bigger ratio of the day trains at that time. That's all I had to show you today. I just hope, going back to the objective, I just wanted to show you a little bit how you can use InfiniSpan to create an infrastructure that can have the real-time data and run analytics using continuous queries, how we can use them for real-time data processing and InfiniSpan distributed JavaScript or some basic processing. Thanks a lot to these people who allow me to use their icons here. I think the links are going to be put up. This is the demo that I've shown today. They're in a repo, so you can go and try them. They're in instructions on how to do that and how to run them locally, et cetera. That's all I had to talk to say today. And we do have a question, at least we have several questions and more will be coming in. But one that in particular I think is important is could you also discuss again the relationship between Apache Spark and Hadoop and InfiniSpan? Does one include the other or how do you integrate those different platforms and different technologies? Essentially what we do is there is an integration point in Hadoop and Spark which allows you to where does it consume the data from? This is an integration point in the Spark in Hadoop that we've built and then we've got these modules that you can use them in your projects. They basically see that the data they feed it or they take it from InfiniSpan or they can store it into InfiniSpan and then consume it again. I think this is particularly interesting. If you have already data maybe on InfiniSpan then maybe you want to consume from an existing Spark or Hadoop project or maybe you want to take advantage of the elasticity that InfiniSpan gives you as opposed to other working memories for this project. So that's how it normally works. We've got some blog posts and documentation on how to use them and they're very popular as well. And they have very interesting APIs as well for us. And I do think the intersection where these guys integrate is very interesting. Hadoop of course being more disk-based, it's not in memory per se, it's firing off a Hadoop file system, Spark allowing for its mini, what do you call them, the concept where it has these little mini blocks of streams. Right, so it's still in memory but it's still blocked. Yeah, but also Hadoop, that's where you switch. So in Hadoop, instead of consuming your data out of ITFS, you can consume it out of InfiniSpan. So for example, if you already have InfiniSpan data that you've used for something else, like for keeping some live data, then you can hook it in Hadoop and then processing with Spark as well. And the APIs are very advanced. I mean, we're not trying to compete head to head with the APIs. The APIs are fantastic. So there was an interesting question you came up early on, was do you work for, have you ever worked for, profit bricks? No, no. Maybe there's another golder out there in the world. Is there, can you give us a little explanation on how to get started? So what you showed us is it's incredibly advanced, incredibly cool, I love the real time streaming, continuous query. I love how you compared it to PLSQL even, and of course using Vertex to do that, doing that reactive programming model on top. But how would I get started with the basics? You know, just setting up InfiniSpan, getting started, and then understanding how to apply this continuous query action. We, in our InfiniSpan.org, we've got an area with little, little small tutorials which allow you to quickly get started with a particular feature. We've got them for querying a continuous query. We've got them, for example, if you have a JavaScript application, how can I get started with these tutorials? If you still see my website, I can just, if you go to the community, in the learn, we've got these tutorials at the best starting point. Because they are very focused examples on how you can do querying, remote scripting, for example, to do these kind of server tasks. Which you, in my case, I've shown the server tasks written in Java, but you can also write them in JavaScript. And then you've got a lot of libraries to get started. But that's the first phase, too. They're very, very small, so very easy to kind of understand and go through that. All right, well, that is fantastic. And thank you very much for that. And we are out of time for today's session. Thank you guys so much for hanging out with us and spending this 30-minute window with us. The recording will be immediately available as soon as we get disconnected here. But do look forward to other DevNation Live sessions. The next one coming up will be more focused on reactive programming and another deep dive more into that reactive programming world with Vertex. And do look at the actual developersoverhead.com slash DevNation Live for other archives. And we're going to be scheduling many more of these things into October and November and other exciting content like you saw today. Galder, thank you so much. Awesome demonstration. Great introduction to those concepts. Thank you so much. Thanks, Bear. Thanks everyone for attending. And yeah, thank you. Yeah, absolutely.