 So, hi everyone. I'm Simran. I'm a data scientist with Math Street then. It's a computer-based AI startup, and we work on a lot of cool stuff, and as re-architecture for one of the projects, we basically analyzed all the tools that we should be considering, and that gave us the idea for this talk. So, let's proceed. One second. Sure. So, what we are going to be doing in this talk is, I'm going to help you develop an idea of what tools you should be considering for what kind of dataset. Basically, that's what the title says, that data comes in different shapes and sizes, so you should pick your tools very wisely. What I mean by data tool here is basically a software or framework that helps you explore and work with your data as simple as that. Moving on. So, there's a lot of content. I'll be rushing through the slides very quickly. So, to help you understand what's coming, let me just brief you through. So, we're going to be looking at few in-memory data tools then followed by benchmarks, their benchmarks. Then we're going to quickly move on to benchmarks for big data tools and then key takeaways. Basically, when to consider them as simple as that. So, the benchmark plots are basically going to have number of data points on the x-axis, which is going to be powers of 10. On the y-axis, you're going to have time taken. So, let's quickly move ahead. So, to help you understand the performance advantage that NumPy gives you, I'll have to help you understand how Python works under the hood. So, as you all know, Python basically keeps everything as an object. The function is an object, your string is an object, your integer is an object. So, when you create a list, which is this one basically, you're not actually holding the values in that list. You're holding objects which are internally referencing the values and the values are not scattered all across the memory. So, you can see there's a bit of a performance overhead with that. So, what NumPy does is it gives you a C array implementation of, it gives you the C array implementation, which you can quickly get it that it's going to be much better with the performance. Moving on. So, we look at Pandas. What Pandas is basically, just a wrapper over NumPy. So, Pandas gives you a lot more flexibility. The way it does this is basically, it lets you work on different data types, not like NumPy, which is just numerical. So, what it does is, the way it achieves this is by the concept of blocks. So, let's say I have three columns, C1, C2, C3 and they're of different data types. So, what this is going to do is, they're going to be a float block and integer block and an object block. So, what that means is, each block is going to store the data of a particular data type. So, you see that you get more features by doing this, but you should also understand that when you're doing this, you're actually scattering your data across different blocks. So, you have a bit of an overhead there. So, it's a trade-off. Let's move on. So, this slide basically shows you what I've just said. If you look at this part, I'm just showing you that the block at index number two stores the string values and is of type NDRA. The next screenshot basically shows you that a block contains all the data of a single data type. Which is the integer C1 and C2. All right. Column C1 and C2. So, what are the implications of this kind of an architecture? Now, what happens is that when you're doing your slicing and dicing, you're basically asking data from different blocks. So, what Panda has to do is, it has to take the data from all those different blocks and it gives you a copy. So, now when you make any changes over the copy, it's going to warn you. That's the warning. Then, the same functionality. Now, this time you're doing it on a homogeneous data. C1 and C2 is of integer type. And so, what it does here is, it leverages the performance optimization that NumPy has under the hood, which is working with references, obviously. Copy is an expensive operation. So, now, if you're slicing and asking for some data, what does it? It just gives you the reference. And now, when you make certain changes, it's going to reflect in your original data frame. So, now we'll move on. So, R has been the favorite of data analytics analysis when you're working with in-memory data. But the native data tool that comes with R, the data.frame, is not that good with its performance. But, well, there's a good news. There's a library called data.table, which is very good. It is optimized in a lot of ways and gives you a good, very good performance boost. So, it's optimized by, it makes a lot of references, avoids copies, makes a lot of function, avoids function calls, lesser variable repetitions, all right? So, we'll move on. So, now we're going to be looking at the benchmarks. And I would say that I would generally, on a broader base, I would classify the data operations into two kinds, SQL and non-SQL, simple. So, I'm going to be moving through a lot of plots. And it'll be better if you move along with me rather than just going by the slides. Let me help you understand. So, the blue line is R.data.table. It seems to be doing pretty well, followed by the green line, which is pandas. Now, let's look at what's not doing the best. You know, what's doing the worst? That's the pink line. That's NumPy. Now, NumPy is very bad when it's reading from raw files, all right? NumPy gives you an option, that if you use NumPy serialized files, you're going to get very good advantage, which is this, okay? So, if you have a NumPy object, you can save it as an NPY file, or an NPZ file, and look at the performance feed that you get with that, okay? But if you don't have that, you might want to read the data through pandas, convert it into NumPy, and maybe save it as NumPy serialized files. All right? So, now we look at group by operation, okay? You can, so you can see, there are two kind of lines, okay? The dotted and the solid, okay? The solid is the group by on string column, and the dotted is the group by on a numerical column, okay? So again, as you can see, the blue line, the data dot table is performing very well, followed by the green, which is the pandas, okay? And you also see a pink line here, but let me tell you, NumPy does not support group by SQL-like operations, so that is basically ITER tools being benched mark there, okay? And data dot frame is not doing so well, all right? Okay? So, now we'll go to merge, which is an inner join, and the blue, which is again, our data dot table is consistently performing well, as an exception, data dot frame has on a string column, has performed well here, okay? But it's an exception, all right? So we'll move on, okay? Now this is something very interesting, okay? You have seen, you've seen and you know that our data dot frame are not so optimized, they don't perform so well, but you see something very strange here, okay? Just to make you understand that graph have basically converted the extra access to a logarithmic scale, and now you can actually understand who's standing where, okay? Very surprisingly, our data dot frame has done very well here, but if you look at the consistency, our data dot table does very well overall, okay? So, we'll look at the core utilization, okay? So if you're working with R, okay? You can see, clearly see you're under utilizing all your resources, okay? Let's look at another example, okay? In this case, I'm doing a NumPy huge matrix multiplication, and as you can see, I've maxed out on my resources, okay? So what does this mean? Basically, you're hitting the limit on what your in-memory data tools can do for you. After a certain point, you can, it's gonna be a point where your returns are gonna diminish. As in, your programs are gonna run slower, there's gonna be more of disk swaps that are gonna happen, and if your memory is full, you're out of RAM. You're basically hitting the limit, all right? And when you're going back, and when you're using just one core, you're basically not parallelizing your stuff. And if you look at the real life data, what you collect in your real life, it's a lot. It's not something that can fit in memory very easily. If you're working with production data, this is not the case. So moving to big data is inevitable, all right? So like I've just said that there are limitations that you start hitting with these tools. Of course, they're your favorites. They perform very well, but they're good just for in-memory data, all right? That can fit in memory. R has its own limitation of 2 power 31 minus 1, vector indices, okay? So you start hitting limitations like that. Now we look at big data tools. Here I have Benchmark, Spark, and Redshift, all right? The Redshift is the red line. You can see the yellow is the DF, and the black is the RDD. They're kind of comparable. That's because they are a distributed framework. So they need to do a lot of work distributing their data. So that behavior is kind of coming through this graph very easily. You can see that similar behavior, all right? Now we look at aggregation, which is group by again, and you can see the red line, which is Redshift, is doing very well, okay? Followed by a blue line. Now what is the blue line? The blue line is basically Spark, but the data is now materialized into Hive, okay? So basically, you can see when it's within the table, they're kind of comparable. Now let's look at what are these two lines? These are basically data held in memory. This is in DF, and the black one is RDD, okay? Of course, so you can see DF is doing better than RDD if you have data in memory. What about inner join, all right? So in inner join as well, you can see Redshift is doing well, okay? Followed by Hive, Spark with Hive, all right? But let's look at what's very interesting here. It's the difference between RDD and DF here, okay? So clearly through the slides, you're getting a feeling that DF is consistently performing better than RDD. If you're working with operations in memory, okay? If you're going for a table, you would have to compare these two, otherwise it would be a not fair comparison, because so that is data materialized in Hive for Spark. Moving on, sort, here again you can see comparable performance, okay? The red is the Redshift again, and the blue is obviously Spark on Hive. What's very interesting is DF has done well over here, okay? So what's the key takeaways? When should you be considering R? R is very good when you have data that you can keep it in your memory and you want to do data analytics on it, because data.table is a brilliant package that gives you a lot of functionality and it's optimized to work very well. You should definitely try it out if you haven't. You'll fall in love with this package, all right? When should you consider NumPy? NumPy you should consider when you have numerical operations in hand, okay? That's because NumPy is a wrapper over optimized C and Fortran libraries. These have been optimized over years. The performance is, you know, you have to go by NumPy if you want to basically work on Python with numerical operation. It gives you, you tell slightly linear algebra and Fourier transform, all right? So that kind of functionality you get with NumPy. When should you consider Pandas? You should consider basically consider Pandas as your Swiss Army knife, okay? It's a wrapper over NumPy, but it gives you so much more, okay? It lets you work with heterogeneous data, gives you SQL non-SQL like operations, okay? But there's a bit of a trade-off because it's a wrapper, you have the wrapper overhead. As you can see, I've just done a matrix multiplication and you can see a constant overhead of the wrapper itself. Otherwise, it's a trade-off. You get lot more functionality, but you lose a bit on the speed. But if you're working a production product, you might want to go with Pandas because it gives you so much more. When should you consider Redshift? Well, Redshift is an OLAP engine, okay? So what it does is it basically distributes your complex BI queries, which basically take slow running queries into smaller chunks. So basically you're parallelizing and speeding up this very slow process, okay? And it can scale very well. It can scale up to petabytes of data with ease and a lot of other benchmarks you can go through them. So basically, you should be considering this when you are working on huge complex BI queries. Spark. So when should you be considering Spark? Spark is another distributed framework, but it's of a different kind than Redshift. It gives you the ease of programmability. Also, it gives you the advantage of cheap cluster costs. How? It lets you build your clusters on spot nodes. And spot nodes are basically very cheap. You can go to the AWS console and look at the current pricing of the spot nodes and bid accordingly. And you'll be amazed at how much you're saving for your cluster. So if you do a cost comparison itself, you can look at the cost per the on-demand nodes cost for a node that are used for Redshift and a similar kind that are used for Spark. There basically, this is double the price of Spark. So clearly, I hope I've given you a good understanding of what you should be considering well. So Spark, Redshift, Pandas, NumPy, and there I've put up all the code that we have benchmarked on the GitHub. It's gonna be a very active repo. There's a lot more work that's in progress and it'll be getting uploaded to the repo soon. Thank you. Any questions? And you're always free to go back to the repo and give suggestions. If there's something that you feel can be optimized further, please feel free to get in touch. Hi, very good talk. Thank you. One question about Spark. So the Spark, the graph that you showed for Spark data frames, RDDs, did you do a memory caching before running those operations or was it before that? Yeah, there was caching involved. The caching involved and it was in memory. Yes. And one more question. Does Amazon Redshift work on Hadoop? I never heard of Redshift, that's the reason I was asking. So this is a different kind of framework in itself. Okay. It's not a Hadoop-based, yeah. Do you see a difference in this Spark when you're running on RC-fights or RC-fights or Sparky-fights? I'm not a Spark expert myself, so I've not worked with ParkWords and I've not considered them in the benchmarks as well. But a lot of members in my team, they work a lot and maybe if you want to talk to them, maybe after the talk, we can, sure, yeah. Ivy Explodes, Spark car, what's your take away from that? Oh, I have not, but I would love to, yeah. Thank you.