 Hello everyone, thanks for having me. This conference is amazing. Today I will talk to you about a bunch of data I got from the World of Warcraft API and how I tackled the surprising amount of data. My name is Vincent. I'm a data guy at a company called Go Data Driven. We do like big data Hadoop clusters in Amsterdam. I'm a very avid Python user. I love R, I also really like JavaScript and I often give open training sessions in Amsterdam in case you're interested. I have started using Scala Julia because it's interesting for data but moreover I am a hobbyist gamer and a huge Blizzard fanboy. It should be said for this talk though that I'm in no way affiliated with Blizzard. Everything that I've done here is because I like gaming and I like analyzing data. Again, I'm in no way affiliated with Blizzard whatsoever should be said. Today I'll give a brief description of the task and the data that I have. I will also sort of give an explanation of why this task is a big technical challenge. Then I will go and explain why Spark has an excellent solution to this challenge. I'll show you how the code works. I'll show you how to get set up to make your own Spark cluster. After that I will show you some surprising conclusions that I have from the data. Then if there are no more questions I will also should have enough time to also give you a brief demo of how you can do stuff with Spark on the live cluster that I have running right now. If you are here just to sit and you wanna move on to a different talk that's completely fine. Just remember the main gist of the talk will be that Spark is a very worthwhile and open tool. If you know Python, it's a preferable way to do big data in the cloud. It performs, it scales, and it plays well with the current Python data size stack. Note that the API currently is a little bit limited but the project has gained enormous traction so you can expect way more in the future. So for those of you that haven't heard about it yet there's this awesome game. It's called World of Warcraft. It's amazing. Millions of people play it and there are like, oh he's on dead heroes, you can butcher and there's all these orcs, humans and fighting. Basically, the game always looks something like this which is epic. However, the most epic part of this game isn't necessarily the fighting itself because that's not the reason why people are playing. The reason why people are playing the game is due to the epic loot. So it's not so much that this human will probably sort of kill this orc opponent. It's more so that he has a very shiny sword and that's the whole reason why you're playing. The game of Warcraft is very simple. You keep getting stronger which means that you can fight stronger monsters which means that you can get stronger equipment that you can then use to fight stronger monsters and then you keep getting stronger. Once you get stronger you can fight monsters and it becomes sort of a recursive thing. All of which is fine, but it's again, all about the items. The items are like one of the main parts of the game in essence is that a main reward for killing like the epic evil boss. What you can also do which is an interesting part of the game is you can collect raw materials and then you can make gear from it. So you collect a bunch of flowers and you make a potion or you collect a bunch of iron and you make a good sword. You can use it or you can choose to sell it because the World of Warcraft game has a huge auction house API. You can use it to collect virtual goods, you can trade for virtual gold, you can buy virtual swag, you can use this virtual swag to get better, faster, stronger, which you can then use to collect better virtual goods and then we see the same recursion happening. So this auction house was one of the things that Blizzard opened up. If you go to the Blizzard API, which nowadays works a little bit different due to the expansion of World of Draenor, but about a year ago when I was looking into it, I could see every single auction that was open at that moment. So the data that I have is not the actual auction being sold, but it is a snapshot of all the prices at the moment. Sort of like eBay but then for King's Blood, other items. And it's worth all to know that this data set is extremely cool because I double checked last weekend, we still have about 10 million people playing this game. There's about a hundred plus identical instances of this game. So every server that you play this game on, like in Europe we have about 30, it's an exact copy of another world but people will behave slightly differently maybe. If you're from an economic background, this is kind of interesting because all the economical laws that we have in our normal life should also apply in World of Warcraft. And it's very hard to get a perfect measurement of the economy in such a degree that you can in World of Warcraft. There are different prices for a packet of milk and it's very hard for me to track that in real life. I have a perfect API though, so I have perfect measurements. This experiment is actually very interesting. The slides of this will become online so you also have access to the API description but basically for every auction that I have, I have an ID of a product which you can just type into Google and you'll get an actual picture of the product. I have a current bid price and a buyout price. It is an auction house, so just like in eBay it's very possible that the current price that's currently being bidded is not the same as the price where you can just buy the item right away. I have the amount of the product. It's possible that you don't sell just one leaf of a plant but maybe three. I also know the owner of the product and I also know the server where the product was on. So a lot of questions you can think, so when you have a data set like this you can start thinking of nice hypotheses you could test because it's a cool data set. So do basic economic laws make sense? That's something you can test with data. Is there such a thing as an equilibrium price? There's all these different servers and technically if the laws of economics makes sense you might be able to argue that if the price for a super sort on this server is 10 gold it should also be the same on this server because everyone is playing the same game essentially. And also is there a relationship between how many pieces of a certain product are made and the price? It's very hard to do any of this research in real life because you get the problem of collecting the data but the data is already there in World of Warcraft. So this is a very nice experiment. Here comes the downside. The Blizzard API gives you snapshots every two hours. For what I intended to do with this data I figured only getting a snapshot per day would be good enough. However, it's two gigabytes, it's JSON blob which is pretty big. If I were to analyze one day this would probably fit in memory on one machine so I could use something like pandas but if I wanna do weeks worth of data then we start hitting benchmarks. So what to do? It's not trivial, we cannot throw just an Excel and from pandas we'll have trouble with it. A possible approach if you're sticking on one machine is to use like better file formats and easy one is to drop the JSON, turn the CSV and it'll save you a whole bunch of megabytes. You can try to do stuff like HTC 5 and all this compression techniques will work but they won't scale. It works for today, then in a week from now it won't work anymore because this approach will not scale or this approach scales vertically. You can buy a bigger server every time but at some point it's gonna get too expensive. So this type of problem tends to occur a little bit often. This is what people call a big data problem. The best description I've ever heard for a big data problem came from a guy at work at Cloudera. You are dealing with a big data problem whenever your data is simply too big to fit on a single computer. This seems like this case. So the best analogy of how I decided to tackle this problem is to think about what do I do if I wanna blow up a building? Then I would usually use like a bomb. However, if I wanna blow up a bigger building the non-scaling version would be to buy a bigger or more expensive bomb. An alternative would be to just use many small ones. The idea with big data in sort of the Hadoop sphere is to not say, hey, we're gonna get this one big server and put all the data in that. Instead, we're gonna split the data into small bits and distribute those among a cluster of computers and analyze all this data in parallel. So let's go and take the many small bombs approach. What would that look like? There is this thing called Hadoop which I'm sure you've heard of before. The idea is that you have a distributed disk. How it works in layman's terms, I'm not gonna go into detail here, but in layman's terms on a computer cluster you have one name node or a master node that basically keeps track of where all your files are. It may, for every file, it tends to make chunks out of those and each chunk is replicated across the entire cluster. So should the machine go down, you will always have some data left that you can still do your analysis with. And the only job that this master node has is to keep track where all the files are. So you can connect as many of these slave nodes or data nodes as you wish. And these can work all in parallel. The only thing that the master node has to do is keep in mind where data is. This is sort of the old school way in the scale. So every time I have more data I can just add more slave nodes. It's sort of the basic scaling gist of the prime. And also the idea is we're gonna probably wanna write some sort of map reduce code so we can bring the code, like the analysis to the data instead of bringing all the data to the analysis. That's also sort of a conundrum of the thought here. So why Spark? We have this Hadoop and we can write map reduce jobs and there's this new technology called Spark and why should I spend any effort on it? Basically it's like Hadoop and like all the map reduce queries we've had in the past but the main idea is it tries to do all this computation in memory. If you go to their website they will have these very nice performance benchmarks. This was I think a linear regression benchmark they just did out of the box. At times it can be 100 times faster than Hadoop map reduce if the data fits in memory of the cluster and about 10 times faster if it's on disk. It does a lot of performance optimization for you. You can try this out very easily. If you download Spark and install it locally and you run some Spark jobs on it. Spark is in the end written in Scala. It's built upon a JVM. And if you run a query against it you'll suddenly notice that there's a Java command that's taking up all the CPUs that you have. So even if you just run Spark locally on your machine it will try to do as much parallelization for you as possible. This sort of comes out of the box. If you follow the API that it gives you you don't have to think about parallelism anymore. Spark will do this for you. Which gives me a very nice abstraction layer because I don't really have time to figure out how Spark exactly works but if they give me an API that's kind of like PANAs, sure I can work with it again and I can do lovely, lovely analysis with all the work of data. And the API as it turns out is not too difficult. You can declare a variable called text file. You then tell Spark that there is a text file located somewhere. It can be on HDFS which is the Hadoop file system. It can also be on a local disk. It can also be on S3. All of these approaches will work and they have support for it. And from then on all the commands that you do are basically done in a functional style. So suppose this were a text file and I wanted to do a word count. This is what the code would look like. I would take the text file, I would make a flat map since every line that I have gets split. That means that I now have a list of words. Then I will turn that into a tuple which is sort of like a key value. And then I reduce by key, basically counting everything up. Spark has tons and tons of examples doing the word count and I think at the moment they are the record holder for fastest word count as a technology. They beat Hadoop like two years ago I think. So this is nice. I don't mind the API, it's actually a little bit functional. It's a little bit not like I always write Python but I can definitely live with this. So what are some other nice Spark features? We're not gonna go fully in depth with what you can do here but it's super fast because it uses distributed memory. Only when it needs to will it use the disk. And it does this on the entire computer structure. So if you have a huge cluster you suddenly have distributed memory. It scales linearly just like Hadoop. It has very good Python bindings. And as of recently it also has support for SQL statements and things called data frames. What they have done is they have been able to take a distributed data frame. So basically like pandas but then clustered on... Clustered basically. It plays well with other technologies. So you can run it on mesos. You can run it on top of a dupe. It has a good connection with S3. If you wanna do this and you're like a huge web store you can use Cassandra to also have your model state in it for each user. It's even got machine learning libraries that work in parallel. So even if you have a machine learning algorithm like linear regression that usually only works on one machine because you're doing the X prime X inverted calculation. Spark will give you a grid descent methods that will just work parallel. It even has micro batching. So if you wanna do streaming-ish things it'll just work. The machine learning libraries also just work. It could work on top of a dupe. So if your company's already invested in a dupe it's fairly easy to get started. One of the coolest things that Spark will do, it's lazy. So you have all the commands you are doing to your non-mutable data. And it will then do, you first you say, I wanna map this to say word tuples and then I wanna reduce it. Before it runs this operation it creates a DAG. So the directed acyclic graph of all the commands you're doing to the data and then tries to internally optimize for the query which you don't have to do either. There's also multi-language support. So there's even bindings for R nowadays. I have been able to play with RStudio in the web browser connected to Spark. If you like Scala, Spark is written in Scala. So there's lots of stuff there. And of course Python users can enjoy a pretty decent amount of the API. So how do we get started? The cool thing about Spark is there's a company called Databricks which is heavily invested in building most of the tools. But if you go to GitHub and just go to the EC2 folder on the root directory, there is a command line app where you can just give your keys that you have for Amazon. So here, for example, I have this permissions file called PEMS. I'm saying here are my credentials. In this region, I wanna have a Spark cluster. I wanna have eight machines and I want them to be of this type. Go and launch it. This command will, out of the box, start up a whole cluster for you. Which will take about 10 minutes. With the same command line app, you can then turn it off. So if you want to deploy your cluster, it can be done as easily. Do remember, there's also the least any data that you didn't save back to S3. And if you just wanna log into the machine, this command line app will do that for you as well. So in terms of getting started, this is way easier than anything you wanna do with a dup usually. Sure, there's like Ansible Scripts and there's other ways of optimizing, but Spark does this for you out of the box. And then if you install a notebook, connecting to Spark is also very easy. Just use this information. You basically point to the PySpark library which comes on the master node that is provisioned for you. And then you say, well, my Sparkmaster IP is here and on this port, we can connect to Spark. This URL goes into something called a SparkContext. This will make sure that all the Spark commands work normally. And then you can say, hey, in front of these sort of SQL commands, I can do that from here. All of this also, again, will be online. So I'm just going through it very quickly, but starting a Spark cluster is something basically an Incomboop could do. It's just a one-liner. It's very easy to just get started. You can also just read from S3. If you have your AWS keys and your secret available, you can just say, hey, this is the file path of the huge file that I have. So for example, a 40 gigabyte blob of World of Warcraft auction house data. And then you just say, well, take this SparkContext. I have a file here, make 30 partitions of it. That's taking the entire file, cutting it up into 30 bits and distributing that across the cluster. I could then say, cache it. Again, because Spark is lazily evaluated, what will happen is the first time when I say data.count, it will then only take all this data from S3 and bring it locally, which is why if I run it the second time, it's actually a lot quicker. So what Spark will do is if you tell it to, is it will keep data in memory that you will use later on. This also will give you a huge performance boost over something like Hadoop, which really does require you to write everything to disk first before you can reload it again. Also note that if you just run this, it doesn't really do anything yet. This just creates the operation graph that you're about to run. So these are what in Spark words will be called transactions. And this would actually be an action. Only when you run this does the memory actually flow with data and does it actually start to compute. But yeah, this is like the old school Spark stuff. We can do data frames nowadays, which is even more incredible. So you can take any text file you want. You can describe lambda to parse it. For example, I had a huge JSON file, and then I can declare that there is a row structure in it, sort of like a CSV. And then I can have Spark infer a schema out of this. The moment I do this, I have a distributed data frame that's distributed across the gigantic cluster, which means that any operations that I'll do will be suitable for big data sets. So I'm just gonna go through a couple of very simple queries right now, just to get you guys up to speed of how easy the API might be. If you're used to pandas, it should feel very, very similar. So let's just see the basic, the least commands that I'm about to run were run on an eight-slave node machine, eight-slave nodes on AWS, where each machine had about 7.5 RAM. The total JSON file that we'll be analyzing with these queries is about 20 gigabytes. So I'm just gonna give you guys like a moment to try and figure out what this does, but again, if you're used to pandas, this should feel very familiar. This is a distributed data frame, and this is a column in that data frame that I want to group by. Then SPAR comes with a couple of functions, the basic ones like sum and mean, and what you could then specify in the dictionary is you can say, hey, per this group, I wanna get the buyout value and sum over that per realm. The API can also then tell you not just to collect the data, but to also convert that to a pandas data frame. So if you're doing a huge computation here and the result will be small enough to get into one machine, this is actually a fairly simple way to do big data on the cluster and doing small data again in your own machine, which is a very common use case. Also, pandas is nice because it allows us to do plotting, also something in Spark at the moment doesn't necessarily do out of the box for you, but there's easy ways to translate into that. This is a little bit more of a complex query, but again, if you're used to pandas, it should feel rather familiar. Suppose that I have an item, and this is the item ID, and I only wanna have these items. I just say data frame filter. Then I can say, well, group by realm, and then I can say, well, I'm gonna have two new columns that I aggregate over. I can take all the rows and just wanna count them, and I can maybe calculate the mean for every buyout value. So this will mean that I'm counting only this item across all realms, and then I'm trying to see how many of these items are there on each server, and what's the average price? And then show time, we'll just show you the first 10 results. There is even support for doing a little bit more complicated things, so you can take all the functions that Spark already has and take them out, and then you can put those into the aggregate function. This way you will also be able to give you these columns your own custom name, which is useful if you wanna then, after doing this aggregation, do more pipelining of stuff you wanna do. This is what the DAG looks like. So if you take this query, notice that I'm grouping by, I'm aggregating here, then I'm filtering, and at the very end I'm saying, take only the five first steps. This is what the actual DAG looks like that I've explained to you before. This DAG, so these are all operations that we're doing. You probably can't see from here, but it says map partitions, aggregate, exchange, filter, limit, map, all of these operations that you'll have to do. It'll figure out the best way to have as little computation as necessary in memory to operate this on. And this DAG is also given to you through the Spark UI. So this will automatically come if you install through the one line I showed you before. There's even some support for user-defined functions at the moment because Python is not necessarily a statically typed language, but Scala is, there are some tricks you have to do. The trick right now would be that you define a user-defined function through entering a lambda and then telling what type should come out of it. This function can then be used to create a new column and this is for now the Python way in Spark to use user-defined functions and to create new columns to your data frame. It feels a bit verbose, but if you wanna get the performance boost out of it, you do kind of wanna have types so it makes sense that the API forces you to do this. So, okay, oh, this is cool, but clusters cost more money. It's a common argument I sometimes hear from clients because we're not just buying one computer while doing this, we're actually buying a few. To give you an impression why this is not necessarily the case, big data, super expensive, right? Not really. If you put all of your data on S3 and if you then transfer data from S3 to a computer within the same region, it's free. You don't pay any transaction costs for that. Having like 40 or so gigabytes on S3 times this amount per month leaves me with about a euro or so that I have to pay per month just to keep 40 gigabytes of data on S3. This is nothing. Then if I check the actual CPU cost and the actual EC2 machines that I buy, so basically I'm renting just a couple of machines in Spark that I need for analysis now and I can throw them away later. So I only pay for the hours that I use them. Let's be a little bit conservative. Let's say I'm an analysis person and I'm only gonna be using this cluster for six hours in a work day. Let's say I have nine of these machines and the machines that I've been using with seven and a half gigs cost this amount per hour. Then after one day of doing like hardcore data pumping, it'll have cost me a total of about $15 max. Being able to throw away machines when you don't need them and just getting them back online again makes a whole lot of sense and the start of time is about 15 minutes. So if you have a lot of data like gigabytes of it and you only need to analyze it say once a week for your recommendation batch script or something like that, there's no need to have a Hadoop system anymore. This is an economically seen, this is a way better option. The only downside of this is you will have to be willing to put all of your data on as three, which is not a likely solution for say a bank. But for a lot of you hackers out there who just wanna get big data sets analyzed, this is a very worthwhile venture. I mean, I'm willing to spend 15 euros on my hobby per day. It's nothing. Let's talk about a few results because I think that's the reason why most of you are here anyway. All of these queries have been done with some form of Spark. I should apologize a little bit because recently the R version for Spark came out and I had to sort of contribute and sort of check at that whole sphere out. So the charts that you will see are made with DGplot. These are the most popular items about a year ago. So this was before Warlords of Draenor. I think the, what was the expansion name then? Before Warlords of Draenor, something with a dragon I think. Well, Mr. Pandaria with all the pandas, that's the one. But yeah, there's stuff like Netherweed cloth, which supposedly it has about 10,000, like a hundred, like a million items being sold on the entire World of Warcraft sphere. There's an item called Golden Lotus and there's something called Spirit Dust. So these are all items that you can collect. Some of these you collect by being a herbalist, some of these items you collect by being an alchemist. This got me thinking. So if you are a World of Warcraft player, what profession should you pick? Because there's a couple of professions that you can use where you can collect items and the main use case for collecting these items is that you can then spend that, like spend your items on the auction house and get money back. Doing this is actually a little bit tricky because in the beginning levels there are different items than when you are at the lower level. So if you want to do this completely, you have to sort of analyze every single slot of experience. However, if I just look at the number, if level 10 to 20 items, look at all items you can collect. The mean goal that you can get for doing something with skinning is 2.6. The mean for herbalism is 2.3. And for mining, it's only 1.5. And again, these are like the early level items so it might be a little bit skewed to consider this a reason to pick skinning. But these are some very quick things you can just do in Spark. Another fun thing that turns out that you can look at the buyout value for all products and then relate that to an owner. So what you can then do is you can say, hey, this owner of products on the auction house has about 8,000 gold that he's worth. You do that for all the users and then you can sort of double check what the 1% of Warcraft does. So as it turns out, the 1% of Warcraft owns about 25% of the auction houses, which is an interesting thing. These are simple queries because you can sort of see the chain in front of you. Per user on World of Warcraft, calculate the amount of gold that they have, so that's a sum. Then you arrange those and then you basically list and bucket. In a way, this is basically two lines of codes in Spark, even though you are handling gigabytes of data. Another thing I looked at, because it seemed interesting, if any of you are World of Warcraft players, you have a stack size. So you have maybe one item or five items or 10 items or 20 items and you can have that in the same box on the auction house. So I was wondering, let's take Spirit Dust as an example and let's just look at the stack size, which stack sizes occur more often. So back then it could be anything from one to 20 and it turns out that 20 is kind of popular, one is kind of popular and five and 10 are kind of popular, but anything else doesn't really happen as much. What I could be wondering though is does it matter as a consumer or as a person selling, would it matter if I am selling these or if I am selling these? So the average price per item that we would have here. Will that matter on the average price per pixie dust? And this is a stack of 20 pixie dusts, this is a stack of one pixie dust, so if I am looking at this, I divide it by 20, if I look at this, it's divided by 10. How will the price be distributed? And also it doesn't matter if this is on the Alliance auction house or on the Horde auction house. Turns out it doesn't matter too much. So these are box plots and the average for the item does shift a little bit from item to item, but if I just look at the mean, they all seem to be circled around the two, a little bit less sometimes, but nothing too significant just from looking at it. And it doesn't really matter if it's part of the Alliance or part of the Horde. So this is not something that shocks you if you're an economist, but if you're a World of Warcraft player, this is sort of useful knowledge. It doesn't really matter too much if you're gonna sell a stack of 20 or a stack of one. Again, also a thing to keep in mind, I didn't check if these things actually got sold because I cannot. The only thing I could see is that they were listed at this price for this many. Another interesting thing, and it was my main interest when I was looking at this data, for every server, I can calculate the mean price for the buyout and I can calculate the number of items that they're on on the auction house. Logically, if there's not a lot of pixie dust, then you would expect the prices to go up because pixie dust became rare. Whereas if I have thousands of suppliers of pixie dust, then it becomes more of a price game. Turns out it's very hard to find an item that actually does this. So if you look at this graph, every red dot denotes a Horde auction house and every blue dot denotes an Alliance auction house. And for every server, we have one of these dots. So this is the market quantity and this is the mean buyout price. And it seems like there are a couple of situations where there isn't a lot in the market but there are very high prices. However, these are only three points. So these might as well just be outliers. So, okay, now we're gonna have to start getting a little bit complicated because I do wanna know, does economics hold in World of Warcraft? Can we make sense of this? And a metric that I figured out might be useful is you can calculate the beta one regression coefficient. So if you have linear regression, let's say that my Y variable, the thing I wanna predict is the price and the market is then the X. I'm not really really interested in the constant but what I am interested in is the direction of the slope. If the slope is positive, then it means that if the market is bigger, the price goes up. If it's negative, that means if the market goes up, the price should go down. So if I can calculate for every product, for every server, for every faction, so both for Horde and Alliance, this number, then I should be able to filter out the beta one values that are negative because if it's negative, then it means that we have a downward slope and if we have a downward slope, that means that the quantity on a server has influence on the price. I didn't find any. Not one that did this. Not one single item that had this characteristic. Now I might have made a mistake in the code, that's all possible, but you can also start to wonder like what does this actually mean? Because this is World of Warcraft we're talking about, this is millions of players. So yeah, there's still lots of work that I should be doing with this, but all in all being able to handle this amount of data is something that already is kind of cool. So, quick, quick, quick conclusion. Spark is a warfall tool. It's kind of like pandas. If you know what pandas are ready, it feels very familiar. It's easy to just get started and there's way more things I haven't talked about. There are some real time tools. It's not available in Python yet, but it will be soon. There are graph analysis tools, so you can have this treated graph algorithms working for you. It's like Neo4j, but this treated across many clusters. There's machine learning algorithms that'll just work, even on bigger and bigger datasets. So there will be less need to sample your data before you can stick it in a machine learning algorithm. Some final hints if you're gonna start playing with this. Don't forget to turn your machines off. The whole economical benefit does assume that you're gonna turn the machines off when you're done with it. Don't be like me and leave the machines on over a week because you've gone on holiday in Turkey. My boss didn't hear that, right? No, it's not recorded. This setup is also not really meant for multi-users. If you have a Hadoop cluster that have many analysts on it, what Spark will do is it will go to Yarn and it will say, hey, give me all of these resources and those are mine until I say that I'm done. So if you have a notebook open and it's connected to Spark, we're just claiming resources. If you leave it open for the weekend, your colleagues that come back on Monday don't have resources they can use. So this is something you wanna keep in mind that we're gonna start playing Spark. For single-user, all of this is fine. And then the main thing, only bother with this if your data set is simply too big. The tools we have right now, like pandas or if you use the Rxphere and you do dplyr, those are way more preferable than Spark is at the moment in terms of flexibility. Spark is getting there, but it's still only managed a big data technology. So please save yourself the effort and only use Spark if you have a data set like, I don't know, four gigabytes or bigger and you have a very small machine. Then Spark has a very good use case. Then we can do sort of two things now. I can go ahead and ask, like have you guys asked me questions or I can give you like a quick demo of what Spark looks like in real time. Demo, demo, yay, demo. Okay, cool. So let's, yep. So what I have here, I have this distributed data frame that's already preloaded and this is the world of work app data set that I do have right now. And what I also have is a UI. This is my blog. So this is my Spark UI. As of right now, this was a job I just ran. This is the DAG being visualized. And what I can do here is just go count. It's taking the distributed data frame and I should be able to see that some of these executors are working. So you can see that there's memory being used. Oh, and let's just do it again. This is gigabytes of data being handled in real time. Just a count like that, let's come back. And if I wanna do something that's a little bit more interesting, I think I have an example ready here, somewhere. So I'm grouping by realm and I'm summing over the buyout values. So this obviously takes a little bit longer but then again, if this is gigabytes of data, this is fine. This is very cool to play with. And again, you can cast this result to pandas and do all cool plotting things as well. Also, I'm doing this through a PySpark REPL. You can also just set up notebook and I am probably gonna be trying to see if I can get some comments in to get this automation script to do that as well. I think that is mainly it. I am using a couple of images in my presentation that come from the NAMM project and I do wanna give credit what credit is due. This is something else. So thanks for having me. So any questions you can ask them to me now This is sort of all over. Just meet me there and I will be happily answering questions to anyone. We have a few minutes for questions. So again, I think we can fit one or two in there. Hey, I would like, the example is really cool and all but I'd like to know, isn't for 20 gigabytes, that's pretty small data sets, right? Oh yeah, it definitely is. The thing is I would love to use my clients data to show you when you would wanna use this. However, I am kind of pressed not to. This seemed like a cool example. Okay, so you're not actually advocating. I mean, you could do four gigabytes, 20 gigabytes. You can do that with Postgre. Postgre, you don't need such a replicated. Again, the point is in the example, right? Yes. So again, only bother if your data set is definitely, definitely too big. You don't have to. However, if you're in there like me, you wanna try out new stuff all the time and then this is the easiest way to do it and an example of a cool data set, but you're right. Did you try to compare it with R, Spark? Yes, I have. And in fact, well, this is down, but I recently committed to the Spark project so you can get RStudio work out of books if you provision it. The downside of R, Spark R at the moment is that you only have data frame support. So no support for machine learning libraries as of yet. Okay, and in PySpark, is there like already support for Spark streaming or not yet? So I have, I think they're working on it, but I think it's more something you can expect from one five. I wouldn't, you can double check the JIRA. It's online, you can just check it. I don't think right now, but it might be as of 1.5. Again, I'm not entirely sure, but I do know that people are working on it. Okay, thank you. Well, just one question. Since you have a SQL-like syntax, do you also support joins? Oh yeah, definitely. So just like in Python, I think it's called join or merge. I always mix up R in Python, but you can also do joins with the data frames. Yeah. Does it scale? It depends on the thing you wanna do, of course. So if you're doing a very nasty outer left join that will cause the data to do this, then at some point the memory's gonna overflow and it will have to start doing things to disk. So it scales in a sense that if your data gets bigger, you can add more machines and that's the linear scaling. And it is doing all sorts of clever things to make sure the problem gets no bigger than this, you can think of problems that will just simply always be a bother if you don't have enough machines on it. Again, Spark is a brute force approach. Add more machines if you need to and then it'll probably still work. And the only thing that it's contrasting to is buying a very big, more expensive machine. It's more of, so the reason why a lot of companies find Spark interesting, more has to do with the economics of it than the technology per se. It is more expensive to buy a 600 gigabyte RAM machine like that than to buy maybe 60, 10 gigabyte machines. Thanks. Then I'll be right there and answer your question locally. So thanks guys.