 Good afternoon everyone can everyone here everyone the back. All right, cool. Let's just do a quick level set I want to get an idea of what everyone's skill set is so who's familiar with pandas just quick. Oh, wow everybody Okay, so I can't learn How about machine learning fundamentals like gradient descent error metrics things like all right cool All right, so good shape. All right So yes, I will be talking to you take about rapid prototyping and data science with big data and Python So a little bit about me and the company I work for I am a real live data that sorry senior data scientist I'm an instructor at Metis as well. So we have a couple of different offerings. All we do is data science training So we offer a sorry full-time boot camp. It's a three month boot camp, and that's what I teach We have evening courses for people who want to learn a specific skill or just want to dabble aren't ready for the Full thing we have online courses same kind of setup, and we do corporate training as well. So number of offerings All right, so a couple key assumptions about this talk again ideally you have some sort of machine learning experience and Familiar familiarity specifically with gradient descent if you don't that's okay You'll still get a lot out of this talk, but there might be some Aspects of this pipeline that I talked about that might seem like magic, but don't worry. It's just math All right, so I am going to assume that you're familiar with sci-kit and pandas So when I get to my code, you'll see that this looks exactly like the stuff that you've seen and modeled with and We're going to assume that all prototyping from beginning end is done with this thing called the pi data stack So this is your typical num pi pandas sci-pi sci-kit learn All right, so I want to parse this title up a little bit Let's talk about this thing called rapid prototyping. So basically the process always starts the same way I have some sort of problem. I'm trying to solve I generate some ideas and then you know, we're all here We all need to build some sort of prototype some sort of software to answer that question well, this can be very difficult in data science because You know, we don't always know the right solution. So it's sort of this iterative process So we build our first prototype we create some sort of metrics or constraints to test it with And then we analyze those metrics and say did we answer the problem or did we not? And then usually that leads to some sort of refinement. So again rarely do we get that on the first go This leads to new ideas. We build another prototype and we do this as rapidly as we can We don't want anything to blog down our process All right, so this idea of data science. How do we apply it to this rapid prototyping scheme? First question is always, you know, what is the problem in this case? It might be we are some sort of online advertiser and up to this point We've been showing everyone the same ad well We have this hypothesis that if we can segment markets and we show them a targeted ad we'll get a better response rate So what we do is we collect some data. Maybe what age group are these people? What are their interests? Do they cars or they into I don't know outdoor adventures and then location? So geographical location. Are you in the Midwest? Are you in Ohio? Are you in California? And then we use that to try and separate these different groups So you can do that number of different ways actually you can use a supervised approach like classification or you can use an unsupervised like clustering So pretend that we had some sort of oracle that let us know that there were actually two groups And we tried one of these out say classification and we we got two groups So we're on the right track, but we had misclassified one of these red X's Well, we would reiterate and hopefully by making some adjustments capture the correct groups or at least a better version So again, it's iterative process rapid prototyping with regards to data science All right, so now let's talk about the slip return big data All right, so a couple common definitions Gartner basically if I distill this down for you They're talking about high volume and or high velocity and or a high variety information So large coming in fast and then either coming in from disparate data sources or some sort of nested data Wikipedia a little bit more simply just says data. That's so large or complex that traditional techniques completely break down So I'm gonna talk about these guys, which are synonyms here large data and then a solution I'm gonna show you at the end is gonna be able to Allow you to handle high velocity data. What am I not gonna talk about this idea of complex data? That's a completely separate issue. I'm not gonna discuss that today So just to wrap that up talk about data. That's large in size. Yes Coming in fast not complex All right, so we have this pyramid here, right, you know data at different scales So the question is fundamentally, where do we draw the line? What is big data? It seems like this changes every day. So do we draw it here when we get into mega or gigabytes and above? Do we draw it somewhere here here or beyond? So, why don't we take a quick poll? Who thinks it's gigabytes and beyond raise your hand? One two, okay two souls. How about what's that? I ask you though. Is that true? So gigabytes plus we had two brave souls. How about terabytes and above? more people and then petabytes and above What if I told you you're all right and wrong It's a trick question. Okay, so when it comes down to let me define it for you It's data that is too big to fit in RAM. That's fundamentally what big data is Oops wrong way, so let me give you an example So it could be at the kilobyte scale if you're using Arduino this particular Arduino, you know It's only got 256 kilobytes of memory My laptop here 16 gigs. I'm at the gigabyte scale. That is a big data problem So it really all depends But at this point you're probably asking yourself. Why does it have to fit in RAM in the first place? And he actually anybody now Before I get there It's half the equation All right, so let's talk about it So if I was looking purely at speed if you look at your memory hierarchy here You know I go from hard drive or solid state. I move up a little bit in speed I've got RAM cache CPU registers if I wanted purely speed. Where would I be? Yes, so that's not the answer So if we take a look at this in a different way and we look at capacity You know how much data can I hold if you look at the hard drive and solid state? This is where you're going to have maximum capacity, right? RAM a little bit less. So maybe on the gigabyte scale typically Cache you're looking at maybe 12 megabytes and then CPU registers like half a megabyte typically from an order of magnitude perspective But again if you guys are right if I look at speed It's the complete opposite relationship CPU registers wicked fast and then I go down the pyramid here But I also pick up cost. So I can't just have an array of CPU registers. That's not going to be cost effective So to capture this in a really simplistic diagram what I can say is cache and registers have great speed awful capacity On disk awesome capacity, but no speed So RAM this nice balance of capacity and speed and that's why this is the sweet spot. This is where we want to be All right, so a couple common solutions on how to address this problem of big data Number one we can just probably the most obvious We can just sample our data if I can't ingest the whole thing. Let me ingest a subset a couple issues though, right? You know if I'm going to sample, how do I know that my sample is representative? Give me a giving example here. So we've got some sort of cubic function, right? This is reality, but if we take a sample I just randomly sampled maybe it's somewhere in the middle could be somewhere else It looks linear. So if I build a model on that, I'm going to be way off What happens is my data continues to grow well, you know, maybe early on when I don't have a lot of data Okay, you know it might pick up the trends It might do a good job capturing the signal But over time you can see that sample because my memory is not increased my RAM is not increasing I'm capturing a smaller and smaller percentage of the overall data. So am I really capturing trends or or Signal in that data probably not and it gets worse over time And then three how do you capture? How do you quickly capture trends that change over time? So sort of a simplistic view of this would be you know pretend that we're pushing out some sort of product It's new to the market generates a lot of interest. We see a spike in sales Maybe it's been out for a while it sort of tails off and then maybe we introduce a new feature or a new model And it jumps up again and actually will actually a better example is think of stock prices So this happens very very quickly. We have massive changes very quickly. So, you know, how do we capture that with sampling? Probably not going to do that effectively So summary of the sampling here, you know, you can use sampling to get data to fit into RAM The pros are I can use my pie data stack. So thumbs up there I don't have to recode anything. So yes, I've got this rapid prototyping happening But everything comes with a but right cons. You have to make some extremely strong assumptions that really usually don't hold one that your Your sample is representative of the population To you're somehow capturing signal in this mountain of data using a tiny little piece of it three Your population Is not changing over time. So think stock prices again Another very common approach is to completely move away from the pie data stack or at least portions of it and use a patchy spark So a patchy spark and easily be its own talk I'm not going to go into a lot of detail for our purposes All you need to know is that it takes the data That's too big to fit into RAM and breaks it into chunks So each one of those chunks can be fed into memory Processed then pushed out and probably process the next one and so on so we get to use all the data. That's great Summary, so what are the pros here? You can continue using Python Spark is built in Scala. So it's not native to Python But we can still use Python via the PySpark API We can use all the data for modeling. So this addresses some of the issues that we saw with sampling. So that's awesome. That's great What are the cons? You have to say goodbye to scikit-learn You have to completely change your machine learning library So if you've done a lot of work Maybe you've put in a month or even a couple of weeks of code and work and you've spun up a number of Jupyter notebooks And you feel comfortable with your models Too bad you had to start all over you have to switch from say scikit-learn to spark MLib And it's not a one-for-one translation So there's a lot of scikit-learn algorithms not included in MLib MLib is pretty feature rich But not to the extent that scikit-learn is and Of course, you know you have this learning curve So if you're in the middle of your project, you're coming up to a deadline I need to push a model out into production. Oh wait time out guys I need to revamp for the next two months and redo everything. I just did not so great to tell your manager, right? All right, so a big problem here it addresses all the issues that we saw with sampling, but it kills our rapid prototyping process All right, so here's the big question, right? What if there was a way to take our Pi data stack and Somehow get the same capabilities we saw with Apache spark. That'd be awesome. Who wouldn't be for that? Guess what? There's a way. That's why I'm here today So enter desk in combination with scikit-learn So these are both Python tools are both native to Python They both work really nicely together and they provide some really cool capabilities Only some of which I'm going to talk about today But if you have never seen desk before if it's new to you You should definitely go home immediately and read about it because it's a really powerful tool All right, so I'm going to give you a high-level overview of what's going to happen So imagine this rectangular box here is the Pi data stack So again, I'm going to start out with data. It's too big for RAM I need to split it up into these chunks So each one of those chunks can fit in a RAM and then I'm going to push it into a model So what's happening on the left hand side here all of this is being controlled by desk So desk is going to split the data into chunks It's going to control the flow of these chunks in and out of our modeling piece And of course the modeling all the machine learning heavy lifting is being done by scikit-learn So I can actually put these two pieces together and they can work as one All right, so a little bit more detail about what's happening. It's gave you a high-level overview Let's go down a level So we split our data just like we did before We instantiate our model, you know, this is going to be some sort of Stochastic gradient descent regress or a classifier more on that later I'm going to ask is going to push the first chunk into the model The model is going to update and see how we're going to be a better model So I'm feeding it training data And if we were to zoom in on this particular piece, we can see what's happening So if I have training data Because I'm doing this stochastically what I'm doing is I'm feeding in one observation at a time So the whole point of this if you're familiar with gradient descent instead of doing batch Where I feed in the whole partition and then update the weights once I can do this very quickly I can update the weights after every observation so I converge much more quickly onto what the weights actually should be So I pushed the first observation into the model model updates by using the gradient It adjusts the weights and then it gets pushed back out and then the second observation comes into the model We make updates because of our gradient and then it pushes back out And we continue that process until we've exhausted all of the observations Now I should probably mention that this is called if you're familiar with online learning or stochastic gradient descent This is a single pass this only happens one time You could definitely have this happen multiple times What you would want to do is just shuffle all these you wouldn't want to send them in in the exact same way that you Did before otherwise you run into some some localization issues All right, so we've read through chunk number one We've updated our model desk pulls that chunk out feeds in chunk number two the exact same process happens So our model updates and then chunk number two is taken back Chunk number three goes in the model updates and so on All right, so the big question how do we actually implement this So you've seen theoretically how it works So here we go For the pandas users out there the beautiful thing is that desk is actually built on top of pandas So all the calls are virtually exactly the same all I do is I call desk dot data frame and then you know here I'm reading in with HDF 5 You don't have to use HDF 5 I happen to like it because of its on-the-fly Compression capabilities, you know specifically with Blask if you're familiar with with compression But you can use CSVs you can use parquet file formats sequel table or be calls Like I mentioned earlier, you know here we've got on the top we have desk on the bottom We have scikit-learn, so we've got SGD classifier SGD regressor. These are going to be what are called online learning algorithms The model is a number of parameters There's way more than what I have listed here But these are some of the the key ones that you should be aware of of course We have to provide some sort of loss function some sort of way for the model to know is it moving in the right direction? You know, what is it that we're chasing? Penalty is regularization whether you want it or not in which flavor the learning rates This is think of step size for those of you who are familiar with gradient descent Power T. I should probably explain this to you. This might be new to some of you think of power T as How you learn for example if power T was one this would be this would work great on an iid data set In other words if you were playing a game like checkers where the rules never change This works really well. You'd want a power T of one so very quickly you can figure out the rules and then I can essentially stop learning Power T of zero would be I'm playing against an adversary. That's constantly changing the rules So I have to be very adaptive to what what's happening in the world power T of point five is a great balance because you don't want to be overly eager from a Modeling perspective and chase noise, but you still want to be adaptive to changes and trends All right, so there's a number of loss functions I haven't listed them all here for you, but some of the key ones say you want to do classification You can do on the fly linear SVM. You can do logistic regression So you can use hinge squared hinge log loss Of course if you want to do regression there's good old OLS. You can use Hubert loss This is a sort of a robust version of mean squared error It's much better about not chasing outliers And then of course Epsilon and sense if you want to support vector machine regression So like I said, you know reference the docs for there's even more functions in this But these are the key ones you should be aware of Some regularization options. Obviously if you don't want regularization, you can just set that to none L1 if you want Typically it leads to sparse models L2 or elastic net elastic net is just some combination of L1 and L2 some proportionality Which I can set with this L1 ratio And then alphas my regularization strength. All right, so we've talked about a lot of Concepts at this point, but let's motivate with a real example. I'll show you real code and output So has anybody worked with the Higgs data set before just out of curiosity Nobody Okay Who's familiar with the UCI machine learning repo? Okay, just a few hands Well, this is an excellent resource if you want to find some curated data sets kind of like Kaggle But just you know, I want to try out some classification algorithms. I want to try some clustering There's a whole host of of data sets for for the taking that you can work on So I chose this one called Higgs. It's pretty famous a little bit about the details So it's got 28 features the first 21 are just sensor measurements And then the remaining seven are just Think feature engineering derivative features that are helped are designed to help you discriminate between the two classes And this is a binary classification problem just saying okay, you know zero would be the sensors are just picking up noise One is something interesting actually happened So for the purposes of this talk just to keep it brief I'm not going to go through the entire data science pipeline show you EDA and splitting into training and test sets just know that the the data has been split already and The training set has 8.8 million examples the test that has 2.2 million just for reference All right, so let's go ahead and get our data So we import desk data frame and then we just pull out our training test set You may have noticed here that I have the exact same file Higgs data. H5 What's cool about HDF 5 you haven't used it before it kind of you call it kind of like using a dictionary Where you provide a key and I can have data sets all within one master data set. It's really cool All right, so I need a couple other key libraries So I am because I'm using gradient descent. I need to scale my features. I need to be able to do that on the fly I import standard scaler. So this is mean of zero unit standard deviation. I of course Pull in my SGD classifier and then I'm going to be looking at the log loss on the test set more on that to come But I need to import that as well All right, so I need to instantiate my standard scalar object I create an empty list for log loss again This is just going to tell me at the end of a training and when I send in a chunk I update my model. I'm going to see how that model predicts on the test set over time just to see what's happening So we'll actually see a nice smooth curve counter zero and Total partitions This just says okay when when you tell desk, okay I've got this data set or multiple data sets that are too big to fit in and to ram it'll split it up It'll parse it for you into what are called partitions or chunks This is just capturing this trained at n partitions is just desk telling you there's 10 or there's 15 or there's nine So essentially I'm going to use this to just keep track of you know What partition number is it and to show me my progress over time? All right, so we have to instantiate our model for this particular one This is logistic regression. You can tell by the loss being log. I am using elastic net It's my regularization with an L1 ratio perfect split An alpha of point oh one, you know, you can you can tweak all these things But I set the random state to 42 anybody know why I set the random state Looking at you Ali That that's actually a good answer That's I should probably leave it on that one But okay Right Yeah, reproducibility so you know if I run this on my machine and then Ali runs it on his machine We'll get the exact same answer so you can imagine you know there might be a little bit of shuffling It happens, so we'll get very similar answers, but they might be just a little bit different if we don't set the seed I like that though All right, so let's actually go through Running through the different partitions now. So simple for loop using range on the number of partitions All right So this data set has the features and the target all as part of one data set I need to split those for those of you who are familiar with scikit-learn, which is pretty much everybody And then I need to standardize so I need to use this partial fit function. Does anyone use this before? No, so it's like fit except that I can do it iteratively So partial fit not all of the algorithms use partial fits a small subset, but you can with Standardization so I fit on the training set a little aside here So a lot of people make this mistake where you take the data set at the beginning and you standardize all the features And then you split into training tests. Don't do that. That's awful practice. What happens is you have information leakage You really should split first Train on or fit on your training set and then use that to transform your test set. That's the proper way to do it So here I'm fitting on the training set and Then let me explain this. There's a little bit going on here. So notice that I Have to provide, you know, what are the what's the ground truth? You know, what is the actual targets and then because I'm doing this on the fly I have to tell the algorithm up front how many classes do I have so you can imagine if I didn't do this See my first partition had two of the classes But they were really three and then it makes you know it updates and then the next partition comes in all of a sudden it has three it's gonna break So you just have to let it know up front how many how many classes So I'm fitting a model on the transform the standardized training set I'm also providing at the ground truth and letting it know how many classes to look for so I've made an update on the first partition Then what I'm gonna do here is I'm just gonna use predict Prabha on the Transform test set to give me back probabilities. So logistic regression is gonna say there's say a 70% chance that this is class One versus 30% chance. It's class zero So that's an action. It's actually a pretty cool thing a big differentiator between logistic regression and SVMs SVMs, you just get class back. You don't get probabilities. So something to keep in mind Is anyone use log loss before I guess maybe I shouldn't gloss over that So, okay, so maybe I explain log loss to you a little bit So you're all familiar with accuracy, you know, like did I get the class right or wrong? This log loss is a better measurement. Your algorithm has to output probabilities, but essentially what it says is how Your algorithm is gonna say the probability of this class. It's gonna look at is it correct Is it, you know skewing in the right direction and how confident is your model? And it's going to penalize it depending on how it does. So if it's very confidence the wrong class it's gonna get hammered and Then I just append that to my log loss list So of course, you know while I'm while I'm training this I want to see some printout tell me how I'm doing on the test log loss And then what partition number out of the total are we at right now and then of course I increment my counter All right, so here are the results. This is what it's gonna spit out. So this is epic one I'm not gonna explain all of these terms to you, but I'll hit on most of them This t value over here is the number of partition or sorry the number of observations in this partition So remember the training set at 8.8 million observations here. It's brought in a million This is the average log loss on the on the training set here You can see the training time was about half a second and The test log loss. This is this crucial to keep in mind is about 1.75 which is pretty awful So it's the first partition not surprising that it hasn't done a very good job yet Now I bring in the next partition So the first one is the readout we just saw the bottom ones what we want to focus on here You can see that the test log loss is dropped from 1.75 to 1.14 and you can see I have partition 1 of 8 So I'm moving in the right direction So I keep doing this so now I'm on the third partition and I've dropped below 1 on the test log loss So my model is getting better and better over time So I'll skip to the punch line here. I won't take you through each one So we get to the 8th partition here and you can see note the t value here So the t value is 800,000 So you can see how desks split that up The test log loss point seven so we dropped a whole point now. We're getting into a much better range and This whole thing the whole training time took just over a minute On an 8.8 million examples not too shabby What's that like the the data set so this data sets not particularly big it's about a two gigabyte data set But this whole process can scale to whatever you need. This is on 28 features, right? Yeah, 8.8 million observations So no data science presentation would be good without at least one graph So on the bottom here, let me orient you the x-axis here is the partition number So when the model read in that data it trained and then how did it do from a test set perspective looking at log loss You can see no surprise at the beginning. It does a pretty bad job over time Everything's moving in the right direction and we actually start to sort of converge asymptotically towards the 8th partition already All right, so let's recap what we've talked about here So we split or in other words chunk the data so we can consume all of it. We did that with desk We built a model by ingesting one observation at a time so stochastically with scikit-learn and then Really the crucial piece from a rapid prototyping perspective all the code even if we didn't start out using this Maybe we tried random for us and a couple other methods We can we don't need to completely change all of the coding that we've done We essentially just swap out certain pieces. We're still in the pie data stack So just minor tweaks and we did that with a combination of desk and scikit-learn So in other words if I combine these two Rapid prototyping continue check. We're in good shape All right, so I've gone through a lot of material here today I just want to step back and just say okay. Where have we come from and where are we now? So we started out with this idea that we want to follow this rapid prototyping process with data science because Solving problems and data science is hard. It's almost impossible to get it right the first go It's a series of trial and error, you know There are stock algorithms stock approaches that we take and sometimes those work and sometimes they don't but it's this rapid iterative process that gets us towards an answer as We saw you know from a scaling perspective Usually we start with something small maybe a small data set a couple stock algorithms build on that But ultimately we want to adjust larger and larger volumes of data and we can get bogged down So if I'm working on my laptop I get to a point where I've maxed out my RAM now What do I do? Do I pivot do I use a subset? So too common what I'm saying are suboptimal solutions is Sampling we saw the issues associated with that and then you could pivot to a patchy spark, which is an amazing framework So if you're not familiar with it and you you start scaling into data It's a great framework, but again, you know if we make the assumption that we're trying to stay within the pie data stack That pivoting is going to cause massive problems So both come with baggage And then really that here's the big key takeaway So there's this new library desk in combination with an already familiar scikit-learn gives us really powerful tools to find better solutions faster and Allows us to tackle problems that before we're just completely impossible So if you want to connect with me if you have questions You know you can find me on LinkedIn you can find me on Twitter at David's again To email me directly. I've got a bunch of stuff up on github bunch of repos I'm always adding to this So if you're interested in some code that you can rip right from today's talk check out out of core computation Just know it's a work in progress. I have a lot of other things in the work So I'm going to keep adding to it and I do have a blog I blog about all things data science machine learning If you're interested in things like online learning deep learning How to ace a data science interview check out my blog and Big shout out to Metis who gives me time to work on projects like this Supports me coming out to speak to all of you and of course Thank you all of you for being here at the end of the conference on a Sunday getting out in the afternoon and That's about it for me. Why don't we open it up for questions? dimensionality reduction So dimensionality reduction is a great tool the one problem you'll run into is if your data set to begin with It's too big to get in memory. How do you process it to even do that? So that's one issue There are other things you can do too like say you have a sparse data set maybe you're doing NLP right and you have you know Bag of words model or something where you have some columns with some ones, but it's mostly zeros There are some approaches you can use You can use like CSR as an algorithm to compress all that information So you basically just say like I'm just going to memorize where the ones are and I don't care about the zeros So I can compress a big data set into a small one and sometimes you can use that to actually get your data set in the memory But as far as like once the data is in memory If you use some sort of data compression There are there are advantages to doing that usually processing time Sometimes removing noise because you know, you don't want your model chasing noise. So it's it can be a really good approach Yeah, so actually the the underlying theme between both of them is this idea of gradient descent So as you know in a neural net you push your data forward just like we saw in desk You know, we push the data in and then the model updates you have back propagation Which is just a fancy term for gradient descent. It could be stochastic or mini batch, but fundamentally It's just gradient descent So if anyone's sort of new to machine learning the best algorithm You can ever learn is gradient descent because it runs pretty much every algorithm Except for like the analytic version of linear regression Did I answer your question? It is the exact same process. Yeah, there are now there are different flavors of gradient descent Like you can have momentum and some other things, but from a principal perspective, it's the exact same thing. Yeah Yes, I'm glad you asked that question because there's a whole other side of desk I didn't touch on just for time constraints, but desk allows you to move into distributed environments It scales to hundreds of machines with thousands of cores It has this idea of a task scheduler under the hood it uses DAGs if you're familiar with DAGs So really what does it takes the same code that I create on my laptop here It allows me to scale when I'm ready to a cluster and I could take that exact same code with all my code base the Pi data stack and now I can do things that Apache Spark does Yeah, so everything's kept on disk it partitions it on disk and then it pulls into memory what it needs And then it pushes that out and then brings in the next partition So you can think of what's happening with these partitions that partition is really just a panda's data frame The desk implementation uses the vast majority of Pandas calls that you're used to like group buys and such there are some operations that are just truly hard to parallelize Those are going to be the operations that you're going to see desk doesn't have those capabilities for obvious reasons But so basically anything you do with pandas for the most part you can do with desk about Well, so the main thing you have to remember is that desk and Apache Spark were designed for two different scenarios So Apache Spark really really shines when you have massive data sets when you get into like Pedabyte level even like terabyte level like it's going to really shine just by the way it breaks down tasks It breaks them into like these big tasks whereas desk breaks everything down into like really really little tiny Tasks and it's all built natively in Python So first of all if you're not a Python developer, you don't need desk, right? But yeah, I mean There are trade-offs, of course depending on your use case Desk may not be the right thing, but if you're a Python developer and you're familiar with scikit-learn and pandas Unless you're dealing with massive data sets consider this first Yeah, so yeah, that's an area. I didn't talk about so there is there are ways to optimize Like for example, like if you use HDFS Hadoop typically what it does is it breaks it down into like 128 megabyte chunks That may or may not be useful for you desk I think I don't quote me on this But I think it breaks it down like a hundred megabyte chunks somewhere along those lines You can set that you can play with those parameters. You can repartition. You can do all these things. That's where I Don't have any good rules of thumb. That's sort of a trial-and-error process. Maybe somebody knows You know how to but that starts to move into more like the data engineering realm, which I'm already kind of pushing into the gray area here I mean the only limitation is the capacity of your disk and how quickly you need the processing to happen So again, you can move to a distributed environment to fix the the capacity issue But if you're using CPUs, you know, maybe you want to think about GPUs or something other questions Can I use SAS as a data set? Yeah, you mean Yeah, I mean how as long as you push the data out in some sort of format that desk can handle whether it's a CSV or a SQL table or something along those lines. I don't know if there's a direct connection There may be I'm not up on Docker swarm. So I don't know that I can Is it just are you just capturing like are you pushing model? What are you pushing to the different nodes? Tell about the data Yeah, I mean you could that's one way to do I don't know how that performs against a desk which does it can do the exact same thing Yeah, yeah, I don't know. I haven't run a performance test on that. That's it's an interesting concept Okay, shoot So you're talking about like one hot encoding So you're talking about going from say like a long format to a wide format using desk and on disk. Um, I Am trying I haven't done it yet. I'm trying to think if there's any reason why you wouldn't be able to do that I Don't think there'd be any reason why you couldn't do it. The only thing is you it might be a slow process No, it's a great question. These are all great questions It kind of depends how you read it in so like I used HDFI which has its own chunking operation under the hood If you read in like a CSV or you read in Parquet file or something you can actually set like the number of chunks you want or I think there might be some other parameters where you set the size that you want to pull in But yeah, you're starting to stretch into the limits of my knowledge for sure Yes Yeah, you can you there's a repartition function and you can do that It's an I believe it's an expensive operation, but it's it's definitely a capability Thanks, everybody