 Hola, buenas tardes. Buenas tardes, good afternoon. I know, it's 3.30 in the afternoon. Still a few more talks to go until cocktail happy hour. So really, really excited to be here. As I said, I speak Spanish, so happy to have a discussion. Although my technical terms in Spanish are, eh. All right, so I'm here to talk to you about options, right? There's a lot of talk about in this conference and in general about the big data world about Spark and sort of the whole entire ecosystem. But I'm here to give you some options. And the reason for that is because sometimes you just need them. So show of hands, but since it's dark, I want you to scream. So who has a big data problem? Scream, I can't see you. OK, actually not that many of you, interesting. Who has experience with Hadoop? Fuck off, of course. Who has experience with Spark? OK, Scala, but who prefers our Python? So most of you raise your hand, right? That's what I was expecting. OK, so awesome. That's great. Oralize the same Venezuela charity. This talk is for you. So I'm here to talk about this idea of medium, or mediumish data, right? And the premise is the following. A lot of times you're working with data sets that are just too big to fit on a single machine, right? They're bigger than your memory. But the scale is not as big to basically necessarily go onto a very distributed system or things along those lines. So here's a little bit of a kind of a scale, right? If you have two to four gigs, if it's in RAM, if it's on your local desk, you can probably do most of that on your local machine. You really don't need to use the cloud, or Spark, or any other distributed framework. On the other hand, on the bottom side, if it's greater than two terabytes, unless you have a really, really large server, which is really expensive, it's really hard to fit on a single machine. And I mean, you could fit on a local desk. There are definitely large hard drives today. But I want to focus on that middle tier, that kind of multi-gigabyte range of scale. Maybe dozens of gigabytes, hundreds of gigabytes, maybe a terabyte, which is that size of your data set where you need to do some work with. And you want to do it perhaps locally. So that's really what I want to do over here. I want to talk about over here. As they introduced me, they said, so I teach. And one of the things that I teach my students, and I'll talk about that in a second, is how to use the right tool for the job. There's really a whole array of tools. And most of us are going to default to kind of what we know. But there's no need to use complex tools when a simple tool is good enough. So you've probably heard this before. And I've been trying to think of the saying in Spanish. I don't know what it is. But in English, if given a hammer, everything is a nail, right? So you're not going to build a house with just a hammer. You need a whole set of tools. There's a little bit about me just to put some context. The ideas for this presentation really spawned of me teaching big data tools for the last few years. I started teaching, I've been using Hadoop since 2010, roughly. I did my first simulation using Hadoop streaming with R back in like 2010 on Amazon EC2. And a lot of has progressed since then. But I've been teaching at Georgetown University and at George Washington University, a big data class for the last few years. I teach students how to use cloud computing, how to use Hadoop, also Spark. And I really, I like Spark. I think it's a great tool. But teaching it is actually hard. And even though most of the students prefer to interface with Spark with PySpark, it's still kind of, as you're doing analytics, and the use case is really more about, think about this. It's not a production use case, right? So if you have a production system and you have a whole team of data engineering that sort of takes care of it and manages it, great. But if you're in a small to medium business, or maybe you're in a large organization and you're just doing exploratory stuff, you're thinking of a new idea, right? And you're working with this data set that's just too big, right? It's bigger than 16 gigs that will fit, or even 16 gigs perhaps won't fit in a single laptop. But it's a couple of dozen gigs, something like that. You really just wanna be able to work with that, without necessarily going all this way into interfacing with a cluster, doing Spark, using Java. So I've never quoted a line of Java in my life, so. So again, I work for Microsoft. I'm a technical specialist for data and AI. I focus on our Azure cloud, anything that's related to data science, artificial intelligence, machine learning. And I work with the United States federal government. The US government is my customer. So I work with different US government agencies, helping them transition all of their workloads into the cloud. My background, I have a background in mechanical engineering and an MBA. I've been working in data science and analytics for many, many years. And also just separately, for the last eight years or so, I've been running a group of meetups, just people that get together on data science in the DC area. So if you're ever in Washington, DC, say hi, send me an email, and we'll bring you to a meetup. And we have a community of over 20,000 people. So if you want advice on how to start a community, how to start a meetup, I'm happy to talk to you about that as well. Okay, so one thing is working with this sort of medium-sized data set, right? But I would argue that even when working with larger data sets, in a business context, the vast majority of the time, you're working with what you call long data set, right? It's really a lot, a lot of transactions in the order of billions, maybe trillions of transactions. Truth is, you're never gonna work with all of those transactions at a given point in time. You are always gonna aggregate, right? Even when you're building a data set for modeling, for predictive modeling, for machine learning, and you're building features, you're doing all the sorts of free aggregation, joins, all the sorts of stuff. So at the end of the day, you're gonna end up with a data set that probably fits into this medium-sized, medium-sized, which is what we talk about. So, you know, this is a chart that was from a presentation that Hadley Wickham, who was a great person in the art community, gave some time ago, and he says, and I think it's right, you know, 90% of the data can be reduced to small data. 9%, and I don't know if these numbers are right, but just order of magnitude. 9% can be reduced to multiple small problems, so it lends itself to parallelization pretty easily. And then one is irreducibly big, which means you really have to use your data in a distributed fashion, sort of comb through it, process it. So for that, you know, there's other sets of tools. There's Spark and things like that, which were really not, I mean, I'm gonna talk a little bit about Spark, but in a different context. So think of it this way. You can take these data sets, work with them on a single machine. You can also work with them in a parallel fashion, either using a cluster of machines, use the cloud, use multi-core. And again, I would argue that a lot of the operations that you do when you're starting to do data science and analytics, probably fall under one of these five verbs. These five verbs come from the Dplyr website, which is our package for those that are familiar with R. I hope you're familiar with Dplyr. If you're not familiar with Dplyr, I really encourage it. And the idea is that there's sort of five operations. And even if you're working in databases, you kind of are doing these five types of operations over your data. The first one is what they call mutate. It's basically adding new variables. The second one is select, down-selecting variables. Of course, filtering your records based on some criteria. Summarize is grouping by, taking a whole group of records and summarizing it to a single level. And arrange is reordering stuff. So I think that even before you do any sort of machine learning, a lot of your time is spent doing this. So yes, I see a lot of people nodding their heads. Awesome, good. And the one thing that's missing here is the joins. And that can be a little tricky, depending on the sizes of your data sets. But regardless, you can get 90% of your work done before doing anything else with these types of operations. And these types of operations really lend themselves for parallelization. But again, if we go with the spark, how many people have seen this in spark? Right? How does this make you feel? Not good. I stole the slides from somebody else, but I thought it was great. You know, enough said. So again, the idea of embarrassingly parallel problems is, and you've probably heard this, if you've been working with big data, even if you're not, most of the time, if you're just aggregating, you're counting, you're doing this and all that, like you can really do this with parallelizing. And by parallelizing is you have a big problem, you break it up into small pieces, each piece, work on it individually, and then you kind of aggregate it. Anybody see any connections here on this? What does this sound like? MapReduce, right? Sounds like Hadoop MapReduce. It also sounds, again, if you're from the R world, so I am an R lover, I'm an R advocate. I do everything I can in R. I also know some Python, I'm not as great with Python, but that's okay. So I tend to always think about stuff in sort of R terms. So if you've ever heard of the term split, apply, combine in the R community, sort of same idea, right? Split a big problem into small pieces, apply something over a piece, and then combine the results back together. So here's a couple of examples of things that are, on the left-hand side, things that are embarrassingly, embarrassingly, that's a handful in English, in Spanish, muy paralelos, demasiado paralelos. So group by analysis, reporting simulations, bootstrapping, optimization, prediction, et cetera. There's stuff that is not embarrassingly paralel. So again, just to highlight, if you're working with stuff on the left, then definitely you have options, right? Especially if you prefer to work in R or Python. So what are we gonna look at today? We're gonna look at these a couple of different things. So the first thing, I'm gonna spend a little bit of time talking about Dask, sorry. Dask is a project out of the Python ecosystem. Is anyone familiar with Dask over here? Okay, I see a few hands, all right. And then the second part, we're gonna talk a little bit about some examples with doing kind of parallel computing with R, if you're not familiar with the possibility. So the ones in red, Dask, Sparkly R, and 4-H are kind of the focuses of the different examples that we're gonna talk about today. All right, so Dask. So Dask is an open source, and I'll start by saying I have been interested in Dask since I first saw it. I saw a presentation about Dask about two years ago at a Pi data conference in Washington. I saw Matt Rockland, the developer, the guy who created Dask, give a talk, and I was really intrigued by it. Personally, I still haven't used it myself, but I've been reading a lot about it, and I'm really trying to incorporate it, because I do believe that this is something that I need to bring into, for example, my teaching for next year. Again, it's just an option. So I've been reading a lot about it, I've been learning as much as I can. I still haven't had a practical application, but what I say is the best way to learn something is to teach it, so I'm learning it so that I can actually teach it. So, again, Dask is an ecosystem that's built on, it comes from the scientific site Pi, the scientific Python side of the community, and it's sort of an extension to NumPy arrays, lists, and pandas data frames, and it's really unlike Spark, which is a general kind of distributed computing platform, Dask is not. Dask is just a parallelizing engine for these types of data structures and operations, okay? So, one thing I'm going to do is I'm going to sort of compare and contrast the Spark versus Dask, and again, just for education purposes, right? So, again, you know, Dask, so Spark versus Dask, so languages, right? Spark's written in Scala, and of course it has R and Python APIs, and it has, you can use it interchangeably with Java code. Dask is just pure Python, with C, C++, for trend extensions under the hood. The ecosystem, you know, Spark is an all-in-one project, and if you use Spark, you know what I'm talking about, you know, you can do sort of your entire pipeline. Dask is kind of, it's not a full all-in-one thing, it's just part of the analytics stack, or scientific computing stack, but I think it, it's just, again, it's growing, right? Still kind of, still early, and actually it's missing the, I see that on the slide, so Spark's been around since 2010, Dask, since around 2014, so it's a little less mature, but it has a lot of active development, not a lot of examples out there on the, you know, kind of on stack overflow, on the internet in general, I'm starting to see more and more of these as we go along. They're both very scalable, right? It's both Spark and Dask work from your kind of single node to multi-compute type of engine, and, you know, this is these, and this is actually coming from the documentation website for Dask, and you can see, read a lot more, this is just a quick summary. This is a little bit about the ecosystem, you know, you have these data structures, you have the arrays, the bags, the data frames, and you can actually parallelize scikit-learn with Dask, which is probably something that you really want to consider. It has low-level APIs, it uses pandas-like syntax, I'm not gonna go too deep into Dask, my purpose is really just to present it to you, because I heard there's a lot of interest in it, and really sort of, you know, encourage you to go and explore it and bring it in and see if it actually works for you. I think it has a lot of applications, and if you work in the R and Python ecosystem, you can probably use Dask for some task, and, you know, and interchange them. And also, Dask has its own sort of task scheduling system, unlike Spark, you know, Spark is more a general distributed computing platform. Dask does its own task scheduling for the processes it's gonna do. It also, like Spark, Dask also builds a DAG, a direct-to-day cyclical graph of the tasks that you're doing, and it runs it and it scales it appropriately. Dask actually has a really cool visualization engine, as well. The way it works, again, without getting into too much detail, is it splits, it can work on stuff on disk, it reads, you can work with relatively large files, you can work with files that are bigger than your memory. It breaks it up into small chunks, and it iterates over them. So I saw a talk on YouTube about, it was a recent Python conference, and a person talked about, she has a whole talk, and I have the YouTube video link at the end of the slide, if you wanna go watch it, I really encourage it. So I took some of those slides just to sort of highlight the speed aspect of Dask. But it's actually, so she created some synthetic data, and the order of magnitude of, as you see, 100,000 records, actually 100 million records to a billion records, and just kind of did some comparisons with both Spark and Dask. So, I mean, look at this, where is it? No, sorry. So the Spark took 2.5, this is on a laptop, this is on a MacBook Pro. So Spark took 2.5 minutes, 250 seconds, and Dask takes 37 seconds. For a billion records, you still see a good incremental order of magnitude of increase in speed. And again, this is just in a single machine. So really, the purpose is to highlight that it can be a viable option, especially if you're working with large data sets, if you need to parallelize operations, because even if you have, maybe you have a small data, and the other side of the equation is maybe not so much big data, but you have to do something small many, many, many, many times, right? So I'm gonna show you an example of that in a few minutes. Just some of the way that they set it up. So both were with local mode, kind of different options of the driver for Spark, the executor memory. So again, what happens is, at least comparing Spark to Dask, and I'm not arguing one's better than the other, by any means, I'm just giving you sort of data about this that people have done. And this is just one data point, but I've seen other comparisons, and it is faster for certain things. Spark is heavy, right? Because Spark uses a JVM, Spark has the driver, Spark has all these things, and you see that in these numbers in the overhead. Dask doesn't. So there's really a lot of, so Matt Rocklin has spoken a lot about Dask. He's a developer, certain folks from the Spark community, so even Matei Zaharia, who's here today, and Holden have spoken about Dask at some point as well. I told Matei earlier today that I was into a little bit of Spark bashing, just for fun. But anyway, and see some more performance metrics. So the premise is, if you're using Dask, especially if you're transitioning from Spark to Dask, you don't have to deal with the Java heap space errors, right, because there's Java in the question. You don't have any containers managed by whatever orchestrator you're using. There's some slides about error messages. I mean, the Java error messages you get from using Hadoop and MapReduce or Spark are pretty cryptic. You can't really understand what those are, so with the Python side, you kind of get used more friendlier error messages. And there's the profiling tools, so they're just like kind of the promises that Dask made as compared to Spark. This is just an example of the visual, so the visual profiler, you can run this as you're running a Spark job, and it's all native, it's all built into the package. You can run this on your local machine. Again, this is opinions, right? The good, it's all Python. It uses Panda syntax, if you're familiar with that. It uses a similar Panda's API. It's fast. You can install it with PIP on your local machine. You don't need to install any additional libraries, or PIP, PIP, or KONDA actually. And yes, it's still growing, but like any open source project, right? I mean, that's sort of what happens. People start using it, people start adding to it, people start talking about it. It gets better, so on and so on. So personally, I think that, again, I'm gonna start using this more. I'm gonna start teaching it, because I think it's a good tool for certain use cases. And I believe that this will be a viable option for certain types of tasks. This is at least on the Python side. The kind of, on the flip side, right? So DAS doesn't support SQL like syntax, like Spark does with Spark SQL. So if you come from a SQL background and you don't know Python, you might wanna rethink this. And just, so par K, a lot of times when you're reading from distributed systems, your data might be in par K, ORC, or some other whatever file. So DAS doesn't work with those types of files quite good yet, but I would expect that to change sometime soon. All right, so enough about Python, and now let's talk about sort of the love. My love, R. So who's an R user in the room? Raise your hand, or scream. About half of you? Okay, awesome. R, R, R. Oh, come on guys, you gotta laugh. Thank you. All right, so we're gonna start a little bit of the exercise. Do you know, and I'm sure you've seen this example before, what is the probability that two people in this room have the same birthday? Not year, just like the same day and month. It's roughly, actually, so you need about 23 people in a room to have a likelihood of 50% of at least two people having the same birthday. So we're gonna talk a little about this example with data, again, this is sort of fitting. So as we talk about the R side, I'm switching gears a little bit, so talking more about the parallelization aspect of it. But the idea is, with R, and with R parallelization and some of the other libraries that you can use in R, you can work with these sort of mediumish data. So you can do both, you can do mediumish data processing with parallelization. So when you take those two things together, I think you get the best of both worlds. Okay, so this is a function that is just randomly sampling and predicting the probability that there are two people with the same birthday. So the first things first is we're just gonna replicate this, so we create 100,000 tests, and we have this function, and we apply it, and we're just going to S apply it, so we're just gonna iterate this a hundred different times and just kind of get a couple of results. So I just ran this code earlier today on my laptop, again, on MacBook Pro, it took a little less than three minutes, again, this is linear processing, right, we're running one after the other after the other after the other. Now granted, pretty simple code, not big, by any means, but every iteration takes some time, right? So this is something that's a really good example for kind of parallelization. So let's talk about 4H, so if you're familiar with 4H, great, if you're not, 4H is just, it's a library, it's built into the R ecosystem that really allows you to do all sorts of parallel processing, and then you can actually specify different kinds of engines on the backend, so you can do multi-core processing, you can do distributed processing across multiple machines, you can set up a cloud infrastructure, you can set up a Spark infrastructure, and we'll talk about that in a little bit. So the next thing is we're actually gonna use the 4H, and the way you use this is you actually register the different backends. So you can do, so do MC, does the multi-core, do parallel, there's a couple of different libraries in the R ecosystem that do parallel processing, so you can do a couple of these, and then at the end, I'm gonna talk a little bit about register Azure parallel, so actually doing, shifting off the parallelization off into the cloud. So essentially what you're doing is you're just replicating whatever it is that you're doing over and over and over, using whatever backend that you're doing. So to do this with do parallel, we call the libraries, we register the cluster with a make cluster command, it's just, again, this is a make cluster two, which just means use two local cores, that you make the cluster, then you register the cluster, and what you see here is this, so you do the 4H, right, so I'm passing in the parameters of my simulation, then do parallel, and do my function. So what happens is it starts in a different R process using different cores, and it's just iterating this over time, and then it collects the results. I will caution you, though, when you're working with parallel processing, even on a local machine, because it depends on how you structure your code, so it depends on the type of task, so one thing is something like this, which is sort of something small, doesn't consume up a lot of memory, takes some time, so it takes a lot of time to compute, that's sort of one use case. The other use case is maybe it's still this, but your result is very big, so if your result is very big, you're gonna break your parallelization because you're gonna run out of memory. Remember, R runs in memory, so even though you can spread things out and you can parallel process, you still have to think about the memory implications of what it is that you're working with, even on the cloud, even with using Spark, for example. So that's not necessarily new. So here, we kind of do the same thing. We run it locally with two cores, and actually, is this two cores? Yes, it's two cores, and basically, the time it took, it reduced significantly. These are not my dementia marks, I didn't get to this part to burn it on my laptop, but this is somebody else's dementia mark, but there was a decrease in run. Actually, this is with, never mind, this is with 16, this is with 16, so the example before was registering two, but this is using 16 cores, so you get not a 16x increase or decrease in time, but because there is some overhead and things like that. So here's a couple of things, so about 11 times faster, so 20 seconds versus 220, and you have to be careful to disable multithread in the BLAST, especially if you're using the Microsoft R Open, because it uses the Intel BLAST math libraries under the hood. So something like cross-validation, for example, right? You can do, and if you've seen this before, you're probably familiar with the carrot package in R, which allows you to do all kinds of predictive modeling and specify different models and parameters and things like that, but again, this is just illustrative, but you basically specify the parameters, and then you can actually run this with a backend, with different backends, and I don't have the code, actually, there's a whole blog post, so Max Kuhn, who's the writer of the carrot package, has a blog post that talks about the benchmarks of using carrot with parallel backend on different kinds of platforms, so you can see. So again, we're still talking about, you're training a model, you might have a small data set, but you might be training that model multiple times, or you're bootstrapping, or whatever it is where you're sampling, you're replacing, you're running something over and over and over, so again, still no need to scale up, you can still do this within parameters locally. So the other option, again, by registering a backend is you can go to the cloud, and we have, as you know, Microsoft is a strong backer of the open source community, it has a lot of support for R built in, and one thing you can do is you can actually start up a cluster, and this is not a Hadoop cluster, it's basically just a cluster of bunch of machines that run R, and you're just farming, you're just farming your computations, but the nice thing is with these packages, you can actually do all that from your single machine, you can use your laptop, or whatever terminal you're using as your console, specify your operations, and then register, you create your cluster on Azure, couple of examples, like examples on how to do this, slides coming up, you set your parameters, you set your credentials, and basically, you just farm out the computations out into the cloud. You can also use, you can use combination of types of virtual machines in the cloud, kind of standard ones, or also kind of like low priority machines, which are a little bit of lower costs. But basically, what you do is something like this, you, again, you really sort of, well, the only thing that we're doing differently here is we're just specifying the backend, right? We still have our function called pbirthasim, we start the library, do parallel Azure, we have our credentials, JSON file, which, again, you get that from your Azure account, and then you have to specify the cluster, so this is what a cluster definition looks like, you can specify the type of machine you're gonna use, the size of the machine, and so on and so on, and then you're registering this, right? So you're registering do Azure parallel, so look, the only thing that changed, right? Again, I'm running this, this easily could be one to a million, and do par, and run my function. So all it's doing, it's shipping up, it's shipping off the data and the function into a separate processes, you can scale out, again, not big data, right? Maybe in this case, really not even medium data, but embarrassingly parallel, right? Embarrassingly parallel, meaning that it's something that you can do over and over, and the caveat for embarrassingly parallel is that every operation is independent. If you're working with stuff that you need interconnectivity or you need to exchange information between these processes, then it's not embarrassingly parallel. So, again, running this on the cloud with an eight node cluster, it was even faster than, actually it was slower, but again, you sort of get, you see the scalability. So, long story short is, you can use the 4-H package for different kinds of backends, right? Whether it's local, and again, this is with R. It's local, and in Dask, you can kinda do the same thing, because Dask, you can either run stuff locally, multi-core, or scale out to other Python processes running whether they're in the cloud, or in a cluster, or whatever. And I'll share these slides. I'll make sure that they're shareable. There's a couple of links to blog posts that have links to GitHub, and it shows you the examples, and you can follow to kinda get this going. All right, so now let's talk about sort of the other part of this, which is distributed data processing. So, yes, there are times where you just have to, you just have to go to something like Spark, because your data may already be distributed. If your data already resides in the cloud in a distributed fashion, then probably makes sense to use something like Spark. Whereas, maybe in the earlier examples, if maybe your data's already in the cloud, maybe not, but a lot of times, you're pulling data from maybe databases, or you have flat files lying around, or whatever the source is, but I guess I'll say that if you wanna minimize the movement of data as much as you possibly can. So think about where your data is, how it's stored, if it's distributed or not, and pick the right tool. But in the case that you actually need to run Spark, we're gonna talk about Sparkly R. And Sparkly R is a package written by the RStudio team, whom I personally think they're great. Everything they've done to make R such a great ecosystem with the RStudio IDE and all of this other packages I've written, but what Sparkly R allows you to do is, it allows you to essentially, what we'll see is we're gonna register a Spark cluster instead of just an R set of R processes. But it allows you to shift computations into Spark using R, using R and using the deep plier syntax. So it actually fits very well into the tidyverse, and it just provides for very, very easy integration. And the nice thing is, if you've never done anything in Spark, it really lowers the barrier to entry for using Spark that first time in a distributed mode. I'm not talking about using it in your standalone mode on your laptop or wherever. But it interfaces to Spark, it interfaces to Spark ML, and you can run Spark SQL, but all from within R. So you can, there is a package called a, Microsoft has utility called AZTK, which allows you to start a Spark cluster. And there's examples here on this GitHub. You can do this in AWS, also with H2O. So this is kind of what it looks like. You know, the assumption here is that you know, you know the cluster URL, right? So perhaps either you started, you know it, or maybe you're working with your own organization, Spark cluster, but the idea is the following. Using Spark Liar, you kind of register the cluster with the cluster URL. You make sure, so you create your Spark context with the Spark Connect. And then this is using the Dplyr syntax, but we have a data set and we're just gonna do some summarization and some filtering using Dplyr. So as you see here, we start with the, actually here. So copy2, this is a data set, the Flize 13 data set is actually part of the Dplyr library, but copy2 is actually copying that into the cloud. So that's just for the examples purposes. And then we're actually gonna run some simple operations with Dplyr. So filtering, group by, summarize. All of this stuff actually gets translated into SQL, like Spark SQL code that gets run on the cluster. So your data's already on the cluster, right? And then all the operations that you're doing are actually on the cluster itself. So all the computations take care, you know, happen in the cluster. And you just, and if you use Spark, you know that Spark is lazy, right? You don't really get anything back until you either, you collect your data set, right? Or save to disk, you know, sort of Spark is keeping track of all the different operations and it runs it. So you're not gonna get anything back. Everything's sort of still in the cluster until you actually collect it. But also be careful, right? Because if you have only a few gigabytes of memory on your Spark driver, you know, even when you're using Sparkly art, you're still, you still have a Spark driver. So just keep that in mind. You're a little abstracted from all the job, you know, kind of more core Spark stuff and the kind of the Java Spark stuff. But remember, there's still a Java driver process under the hood. And if you're gonna collect many gigabytes of data from your Spark cluster, but you only have a few gigabytes in your machine, I mean, you're kind of out of luck. So I was gonna say an expression in Spanish, but I'm not going to. You probably know what I'm gonna say. So, and again, this is Dplyr syntax. One thing, so I'll just, I'm just gonna close with this is again, we, Microsoft have been really big proponents of the open source community. As you know, Microsoft just acquired GitHub, but also of the R ecosystem. So I would really encourage you to kind of use some of these different products and services that, especially if you're an R user. So we just published this about two months ago. It's called the R Developers Guide to Azure. And it has a link to all of our different things that basically support R really natively. So the first one is a data science virtual machine. So that's just a virtual machine, which is already preloaded with a lot of open source software that you would use for kind of getting started with any data science exercise. So it has R, it has Python, it has TensorFlow, it has, it has, it basically, you name it, it's in there. And it's already pre-configured to work. You don't have to really do much with it. I would say though, use that more for development. If you're in a, I wouldn't necessarily use that for production because there's so much stuff in it that you, things can basically break. HD Insight, ML services on HD Insight. So HD Insight is our distributed cluster for, our cluster compute for Hadoop. So you start up a Hadoop cluster, you can start Hadoop cluster running Spark on it as well. Azure Databricks, so you know, hopefully we're here earlier today listening to Matei. So Databricks is actually now part, like it's what's called a first party service. So you, it's natively built into the Azure platform. So you can start up a Databricks cluster pretty easily with Azure. And of course Databricks supports, it has the notebooks, notebooks, you can do different kinds of cells in the notebooks. But unlike Spark on Hadoop, which is the HD Insight, Databricks is just a pure Spark cluster. So it's been optimized to run in the cloud. And you might see less types of issues with like memory management, things like that, that are actually taken care for you. We have machine learning studio, and again, some of the things we've talked about before. So Azure Batch, just, you know, for batch processing out in the cloud. Azure notebooks is a new thing, kind of the same idea as Jupyter notebooks. And of course our SQL products. Now with SQL Server, you, I don't know if you knew this, but if you're running SQL Server, especially more recent additions, you can actually run our code embedded in your SQL, in your SQL processes as a, as a, as a stir procedure. So if you already have data in a database, you can actually, you can create an R, you can create an R model and then load it, kind of put it into SQL and actually do the scoring within SQL. You don't have to move the data around. So a couple, kind of in summary. You know, we've talked about embarrassingly parallel for local, so you kind of have, you can use for each, you register a bunch of cores. We talked about embarrassingly parallel, kind of with bigger data. You can do multi-cores or kind of do the backend, distributed data, non-embarrassingly parallel. So again, kind of for those use cases, I mean, use, use the tools that are meant to be used for. So use Spark in that particular case. As I said before, leverage databases when you can. I mean, don't extract data from a database to load it into R, to do your analysis, to like shift it back. You know, try to be more conscious about that. So a lot of times these operations that we talked about here today, you can do in a database as well. So leverage your database when you can. You know, do your aggregations, do your summarization, then extract that, that smaller data set. And again, just really use the, the right tool for the job. So, muchas gracias. Now it's time for the question. And I know you're gonna have many, many questions. Okay, there's, I'm actually doing, I think ask the expert session after this. Yes, but they can ask you here. Yeah, yeah, yeah, of course, but if we don't have time. So if someone has a hand over there, I can't really see. There's one hand here, one hand over there. Hi, thank you for your presentation. I have a question that unfortunately is more related to Python than to R. You've mentioned, you're talking about parallelization and my understanding, especially when talking about R, sorry, is that it's good to have a cluster in order to get most of the parallelization. But you haven't mentioned anything about Dask. Is it easy? Do you know if it's easy, it's complex is using a cluster with Dask? No, it's actually, again, you can register a cluster of machines as your backend. So Dask schedules, not schedules, scales, wherever, right? Locally in a cluster, in the cloud. And it has, so there's other packages, other Python packages that you need to use that are part of kind of the Dask ecosystem to basically register your backend and shift off those computations, but. They are part of the Dask ecosystem. I believe so, yes. Thanks. We keep going. There is any further question? There's a hand over there. Probably is gonna be more question. Hi. Hi. I have another question. I frankly is related as well with Dask. I've been playing with it a little bit for the last months and I would like to know from your perspective to what extent it fits well with some formats as. Park it, I saw in the slide you say we work with Parkit, Abro, or maybe other, data storage formats that, where we can extract the info work from. I'm not sure. I don't know the extent of all of the formats. It supports sort of your typical ones, right? Your typical text-based formats, you know, the limited, JSON. I'm not sure. It's probably in the documentation, but I'm not fully familiar with it. So, yeah. Last question. Because later you can go to the ASD Esper and they can continue asking you. Yeah, of course. But we're gonna lend a chance to make the last one here. Okay. The microphone is moving, no one. In half, three, two, one. Come. Muchas gracias.