 So, welcome to this booth session. I actually changed the title for this booth session. Originally, the title was High Performance Apache Spark with in-memory data grid. But I changed it since this conference is all about microservices almost. Everybody talks about microservices and everybody has to have microservices in the title to actually change the title a bit. So, the title today is High Performance Data Storage in a microservice environment. So, I'm not going to talk a bit about that as well. What is it with microservices and why is it interesting to talk about data storage? Well, if you've been at this conference, you can notice that very few people are actually talking about data storage for microservices because it's one of the hardest problems to solve for microservices. In many cases, what we want to do with microservices with data is that we want to, for each service we provide, we want to store the data locally in one local, how do you put that? You want to store data close to the service. So, given an example of how that would look like, for example, this is an example application that we got called CoolStore. CoolStore used to be a monolith, but if you want to, we have actually broken these up and delivered this as a microservice as well. And when you want to break this up and build this as a microservice, you have to do certain things. You have to define, for example, what are your different services in here. So, one of them could, for example, be a product catalog because we have a number of product data coming out here and products showing products on this page. So, product catalog is definitely one of the microservices we want to define. Another one would be pricing because pricing would change depending on which locale you have. You might have pricing rules that are advanced depending on taxes, etc. So, if you log in, your price might change as well. You might have discounts, etc. So, pricing is a service itself. We also have inventory here, a bit hard to see, but inventory shows how many of this is left, at least you can still buy, we have in store. So, that would be an inventory, that would be a microservice as well. And on top, we also have features like shopping cart and authentication, and these might also be microservices. So, I define these a bit like functional microservices. So, we have the product catalog here, inventory service, pricing service, and we also have something called UI service that might be responsible for serving the whole page. And then we have shopping cart and authentication, etc. And on top of that, we probably have another set of microservices functions that help us do things like full tolerance, etc. But this is not what this talk is about. So, for each of these, you would typically have one data store. And when you have one data store for these, let's say you do Postgres, or MuseKL for the product catalog service here. The problem with that is, when you want to scale that, you typically also have to scale your database layer. And you're kind of still limited. It's better than in the model list, but you're still limited to how much you can scale your database as well. So, that's a perfect use case for Jable's data grid, actually. So, with Jable's data grid, it's a distributed data store that can store data both in memory and on file. And by doing that, we can scale and we can meet performance requirements much better specifically for microservices. So, a bit of explanation of Jable's data grid. So, Jable's data grid, as I said, is an in-memory data store. It can be used to accelerate big data analytics as well, but it's also great to use, for example, for microservices storage. So, in that case, you would have something perhaps like Wildfly Swarm or Spring Booth over here, or Jable's EAP, writing directly to the data grid instead, which is a distributed data grid running here. We can scale the data grid independently as well from the services. So, this becomes like a service layer for our microservices that we're building. And we can just request more space if they want to store more data, for example, for their services. So, we can grow this as where microservices environment grows. But the good thing about this is that it also provides different layers so we can connect things like Spark or other applications like rule engines, et cetera, to operate on that data. Also, if we want to, we can do data overflow, et cetera, to a persistent store. So, obviously, in-memory is good. We can have full tolerance in in-memory. So, if one service goes down or even one rack goes down, we can still have replicated data in memory. But you might also want to have saved actual data to some file storage as well. So, we can do that in data grid as well. Can I ask a question? Yes, sure. Yes, exactly. So, this example is using Cassandra and Red Hat Storage, where we support database here as well. So, we can overflow data into a physical database if we want to. Okay, because what I'm overflowing? Yeah. So, is there a circuit overflow? Exactly. So, I'm going to try to repeat your question just for the audience here as well. So, the question is about, can we use data grid to secure overflow basically from databases? And yes, one of the configurations we can use for data grid is that we can say that it's going to persist as after right behind basically. So, it means it's not writing as part of your transaction or part of your initial writing, but it will guard against overflowing your database. So, you get better performance than your database and you can still kind of, it will eventually write to the data store. So, you might be able to write? Yeah. Yeah, exactly. And it will leak extremely fast for reads, etc. So, but one of the problem you'll see, and even in a distributed, in a distributed organization like this, you would typically, it doesn't show in this picture, you will typically have different islands of data. You would have different in instance of, we call them cache instances, different stores in data grid where you store the data. And we'll see examples of this later. So, you will have these different stores and then they are not the same stores. So, it's basically your data becomes islands that are defined by your microservices architecture. So, in our example that we had before, that would be that product catalog is one, while inventory is another one, etc. So, if you want to run a query against both, it's actually hard to do that because you have data in two different data stores. So, this talk is about how to unlock your data as well and how to use, for example, Apache Spark to do that. So, to get value out of the data that we have in different microservices there, we could do things like ETL and other stuff, but ETL has limits and it will affect performance sometimes and the problem is that it's a typical batch. With Apache Spark, we can actually do both events but also do real-time queries here where we can read up data from the microservices and operate on the data from the microservices and create, for example, business reporting, telling us we're combining data from different microservices in here. So, I thought I was going to demonstrate this. So, let's see how that would look like. What I've done is I've looked for big data sets and one of the big data sets available is Stack Overflow, actually. Stack Overflow, you're probably familiar with Stack Overflow. It's a great source if you're looking for, if you have issues and you want help with certain things on programming, languages or other things as well. You can put a question in Stack Overflow and there's a big community responding to questions. So, you would have questions here, people will post things up here. But you have to be registered, so we have users and we have something called Post. If this were implemented on a microservices, which I don't know if it is, it would typically have a user store and a post store. So, that's what we want to... So, I'm going to use this data and one thing is that they are actually publishing the data. Let me show you here. This is Stack Overflow. Somebody posted a question, for example, and put in code here and somebody will have an answer. And the good thing is that, actually, Stack Overflow, they store all that data, not only from the Stack Overflow, actually. This is part of a bigger constellation called Stack Exchange, a lot of different groups for posting, et cetera. So, we have everything like this. We have a lot of data. You can choose a big data set or a small data set. For this example, I chose a small data set, but we actually run this with the biggest data set in here, which is, I think, it's SIPT 44 gigs data. So, that's quite a lot of data. Yeah, compressed. So, it's a lot of data when you put it in there. That's the Stack Overflow. The Stack Overflow is the biggest one. But it works great, but it's more than I can run on this laptop. So, let's get back to the... Oh, actually, let's look at the first example here. So, what I've done here is we have... I need to just change my... I'm not a front-end programmer, I have to say that first. So, this might not look as good as it should be, but technical-wise it will work at least, hopefully. But what you see in the bottom here is actually an illustration of the data grid I'm running. Currently, I'm running three data grid processes in here. Those data grid processes are running locally on my machine here. So, three data grid instances, but they could be distributed in a network. They could be running on different pods and be part of a microservice. For example, and what we have here is you can see that it currently indicates zero entries in each post, and I have a set of different caches of stores here. So, I have the post store. Yeah, this is... Exactly, this is the segment of data spaces that we have. So, posts will be going to the post store. Users will go into the user store, but we still have nothing here. So, the first thing we have to do is to load in data. So, I'm going to do this and I'm going to... I'm going to see if I can be quick here and switch back to my window. You see I'm going to see... Now, we can see it being populated with data, and that was quite quick, actually. You can see I'm using a small data set here for the purpose of demo. And if you calculate that together, you can see that that's somewhere around 6,000 entries, and 3 is a bit more, so 660, something like that. So, if you're really good at math, you could probably calculate it quite quickly, but I'm not as good at calculating stuff out of the blue. But this is the data from the user store. We also have data from the post store. So, this would illustrate the data that we have in here from our microservices. So, what we want to do now is we want to make use of that data. So, one of the things perhaps is to run a query, in Apache Spark, that would calculate the highest ranking users, the users with the highest reputation, because that's part of that stack-over-flow thing, is that when you respond to different things, you get points for that. So, you get ranking points, and the higher you are, the better you are. And that's definitely something they want to promote. So, well, I don't have that on the front page, the highest ranking user. We're missing that, actually. Let's create that. So, to do that, what I do is I run an Apache Spark query. I'm going to show you the code for this later on as well. But it's actually quite easy. So, I hooked up to Apache Spark, and Apache Spark has something called RDDs, and they are resilient distributed data stores. And RDDs, we support using Infinispan, or YablesDataGrade, as an RDD. So, we can do Spark queries directly against this data. So, what I do is I do Spark data here, and I have a... So, when I load the data, I transform them into Java objects. So, the Java objects, they have different values, getters and setters. So, one of them would be display name, which is the name that the user have, and the other thing is the reputation there. That's the data we want to have. We want to select them from the users, and we want to order it by reputation, and we want to describe that we want to, yes, the 10 most. Actually, the interface only shows five, but this runs 10. So, to run that, just to show this is not the one, this is the highest ranking query. I wonder if this is... So, this is a bit more code, as you can see. I create this Java pair RDD, which I get by connecting to DataGrade. I run... I create an SQL context. This is all part of the documentation for Apache Spark. And then, I define the query, and I run the query, and I collect it as a list. The other thing I want to do is, I want to store the result somewhere. I just don't want to print the result to output. That would be boring. So, I want to store the result as well. So, I actually store the result in another data island called, in this case, it's called highest ranking. So, if you go back to the demo, we have a data island here called highest ranking analytic store. It doesn't have any data in it right now. So, when we run this, that's going to populate with data. And then, this one can use that data to show. And that's going to be extremely fast, because all that data from that report is just stored in memory in the Able's DataGrade. So, let's run that. So, let's see if that should be this. Yeah, it's actually... What it does is it sends this Java file to a Spark instance, running locally on this machine. Yeah, so it sends it to Spark, and we can see that later on as well, and it executes that in Spark. And it's, as you can see, we already have data in here. So, if I refresh this browser now, you can see we now have the user ranking in there. So, that's kind of nice. We can get user ranking quite quickly from that. Let's do a bit more advanced. Actually, I've talked about this. We have different data islands. You have a question? Yeah, so this actually represents that on this particular node in the DataGrade, we have nine entries. On this node, we have four, and on this one, we have seven. 20. Yeah. Yeah. And that's because we're actually using... Because we have in-memory data here, we want to use backup copies of the data. So, currently, I'm using two, number of owners, two. So, every entity comes in there, basically, it hands it up on one node and another node. So, that's the reason we have that. And the numbers are so low, we have uneven distribution. The higher the number, the better the distribution are. So, but if you go to a location of... If you want to do some more advanced, so one of the things I've thought about was, I want to know what's the most popular location for users that post here. So, I want to think, is it US users who post it? Or is it Swedish users? Or is it... Some other countries are that more popular because that might be interesting, specifically, if you do targeted advertising in different countries, in different regions, you want to find out, do we have any effect on that? So, we want to know who is actually using this and posting these kind of things. But their location is stored in the user object. And the posts are, of course, stored in the post store. So, I want to count how many entries a user does that has a particular... And I want to sum it up to the number of... I'm going to group it by the location. And then I want to sum it up and compare that and give that back. So, that's something that's actually quite hard to do in Java. But it's something that's actually very easy to do in a scale. We do it a lot of times. We use... So, with a scale, we would use something called any join. So, we would use any join between two different data stores here. So, I use both posts data stores, which is one of the data islands from the data grid. And that doesn't even have to be in the same data grid. It could be on different data grid, on different nodes, just hooking up how Spark is talking to data grid. So, we have posts on one data grid and we have users, it's another one. And then we do an inner join here from the user to the posts. And we do it on by comparing IDs. So, we know that ID from the user ID to the owner ID of the posts. And then we look... We also do it for a certain type because only what certain times are the actual posts. And then we group them by the location and then we order it by the location of our posts, actually. And we describe... And then we limit it to ten entries. Yeah, it does. Yeah, exactly. The comment here was that it looks very much like normal SQL or something. You don't have to learn another thing. It's pure SQL. I'm not sure which version, but it's some NCS SQL, I think, based on that. So, let's run that query scene. And that should be run location query. So now, we should see data coming in the location store. That's going to take a bit longer to run. But the nice thing is, I run this against... This data set is only a couple of thousand here and a couple of thousand there, and it still takes a bit of time to run it. But I run this against the big database. We have 44 gigs of posts, even more. I think it was 300 million posts in there and something like two or three million users. And it took roughly the same time to run as it takes here on a big data screen. Great machine. I'll talk more about that later on. So now, we have data here. So now, if I reload, you can see the pie chart. Somebody said, do not use pie chart. And to be honest, I'm not a good front-end developer, so as you can see, it's a bit hard to see. But it turns out that, actually, the data here is kind of dirty. A lot of people don't have location in there, so a lot of people have undefined here. And then we have Singapore, it says here, before United States, actually. And the reason for that is, actually, that the dataset I'm using is kind of, it was from a smaller dataset. I showed you the list of that. And I like to brew beers at home. So I took one, one of the forums here is about home-brewing beers and how you do and how you reach certain type of bitterness, et cetera, and beers, et cetera. So I thought, what else can I do? What is more? And what is even more powerful here? Well, one of the things you can do is, like, we can use MapReduce in Apache Spark. So I wanted to use MapReduce, and I wanted to use MapReduce to actually see which is the most popular type of beer discussed on this. So I defined this as a keyword. In this case, it was ale, stout, pay, APA, lager. And I wanted to run that keyword settings against the dataset to see. So what it would actually do, it would go into the body of the post and it would break up the body into different words by separating it by space. And if these words are equal to one of the key words, we're going to store that in another RDD and we're then going to use that to map that and then we're going to reduce that to calculate what's the most popular thing to do. That's a rather complex thing and we're not going to spend too much time looking at the code. But in essence, what we do, as I said here, the first thing we do is we map to words and then we map it to pair, meaning word and an integer, so we can count it, but we haven't counted it yet. And then we reduce it by counting how often these words occurred. So we're mapping it into different data stores here. And then we store that again and we will store that in the keyword analytics store here, which is empty now. So let's run that. Oh, I got an exception. Actually, that's an exception while sending messages back from the Spark server to me. So I don't really have to care about that exception. It's not. So keyword map reduce, yeah. So let's run that. So that's going to send, again, it sends a fat yarn, it sends up to Spark, and Spark's going to execute that. And it's already done, actually. So that was fast. So now we have data in there. And now, from that data, we've calculated what's the most popular type of beer. So ale seems to be the most popular type of beer with something like 250 occurrence or 220 occurrence. Then IPA and the logger and the stout. So that basically includes the demo, but it's nice and it's powerful. So let's go back to the presentation. So, and repeat a bit what we did here. So what we did here was we had data in two different data stores, one post store and one user store, which could be microservices, types of stores. And then we used different things like highest reputation, different queries through Spark and stored that back into a database again. That could be stored back into a database if you want to or something else. Anything that Spark actually connects to and writes to, you could connect it to. But you can write it back into Infinispan here. You could also use something like Zeppelin on top of this, which shows nice graphs, et cetera. But I chose to implement it in myself. So this is how it looks like. As I mentioned before, this is a distributed cache with two owners. That's what we have here. So at scale, how would this look like? Well, I run this in our central lab that we have and in a lab environment where I have kind of a very large service, actually. I have a couple of servers with about 150 gigabytes of memory in them. So I started up a couple of those. There's five of them. And I loaded the big dataset in here. So now we're talking something like over 50 gig of Zeppelin data, actually. So the whole dataset in RAM now? Yes. And I wanted that because I wanted the speed of that. But I could say that I only stored 10% of that in RAM and other in an external data store. But since I wanted the speed for doing Spark executions, I had to do that. And then I defined all that I wanted the Spark worker on each of these machines so that I could quickly connect to the data and I could run the data in distributed mode on Spark as well. So this way, with the big dataset, I got proportionally the same performance I got here. The most heavy query which compares the two data, it took a couple of seconds to run. But it's... So that's really powerful to use, for example. So it gives us very good data. So before I leave, well, before we leave here today any more questions around this? Yeah. So the first reason is why, and in this case it was because we were using the able-to-data-grade as a microservices store. So we wanted data from that. So that was one of the things that we could use it for. But the other reason is, of course, that in large datasets, especially if you're using something like file-based, like Hadoop or like a backend store is for the RDDs, the resilient distributed data stores, then it will mean... That query will run kind of slow, actually, with a large dataset. It will... It'll still be fast. You can distribute it to many, many nodes, but compared to running it to memory, we're talking about a big, big performance. I haven't done any real performance comparison yet. We are looking to do that. But we should see, in a factor of 10 to 100 times faster, using it in memory. Yeah. That's a very good point, exactly. So the question here was actually a statement more. Well, it was actually that the good thing is that Java developers can use what they're used to. They could write things into a GMS queue or something like that. It ends up on data read, and then we could use Spark to do that. So that means that the people responsible for creating these reports that are knowledgeable about Spark doesn't have to know about GMS and all that. And Java developers don't have to know about Spark. That's 100% correct. So it layers at that level. Cool. Any other questions? Okay. Then that concludes my demo. Thank you very much.