 Hello everyone. Okay, this works. So today I'm going to talk about a new approach that we've been developing about how to handle relatively uncomfortable data sets. I don't want to really use the term big data because everybody has their own definition of big data and sometimes this is volume, sometimes it's velocity, sometimes it's complex structures that you need to maybe join 100 tables together to make one cohesive unit. But I'm going to talk about volume, handling volume on a simple machine that is probably not as big. For example, if you work at, I don't know, Google or Facebook and you have maybe different idea of big data is, but data that is maybe making you uncomfortable if you're data scientist that you want to work on your local laptop or on your local CPU at your employer. So for example, how many, well, I assume that most of the people here are data scientist or engineers or interacting with people like this. So you have a few options here that I've listed. So I think everybody can use data set that has about a million or so samples. We can use the standard tools out of the box and we can even be sloppy with our programming and just don't care. How about if we move to 10 million samples? I think modern machines can handle this. We have plenty of RAM now. RAM is getting cheaper every day. How about 100 million samples? This maybe starts to get a little uncomfortable. Maybe we need to be really careful how we can look at the goricles. Maybe we need to not load all the columns. Maybe we need to use binary format. And how about a billion samples? I'm talking about not going to the cloud just using your normal laptop or your normal work station either at home or at work. So this becomes a little bit of a problem, especially if you don't, if you have more than two columns, let's say you have 50, 60, maybe 100 rows and you want to work and explore and build models and maybe even deploy these models in production. And how about larger data sets? So today I'm going to talk about a solution that we worked, that we are developing and actively using actually that will enable us to work with this kind of data sets without us being uncomfortable. So I mean, I'm sure there are tools, not I'm sure, but I know that there are tools that you can go to the cloud and just rent a beastie machine or spin up a cluster. But it's already incurred additional costs, time management. And if you just want to explore your idea and make sure that this works and if it's worthy, you don't always want to go to the cloud. And sometimes I came from academia, you cannot always justify to your employer, oh, give me this much time on some Amazon server and maybe this will work, but I'm not sure. So I'm Jovan. Hi. I used to be an astrophysicist in Groningen, which is a small town in the north of the Netherlands. Now I'm a data scientist at CVL Labs where I work on DevOps pipelines. And when I was an astronomer, I met Martin Bridles and we were postdocs together in the same group and we were developing this VEX package because we needed it for our work as I'll show in a bit. And then we met Jonathan who is a more experienced data scientist and he thought that what we're doing was really cool and help us transition from academia to industry. So making our package more standardized, following pandas and cycle learning API and standards to basically enable data scientists to use this, not just a bunch of scientists in some little corner of the world. And Mario is really using our tools to create cool dashboards without caring about scalability and data sets. So me and Martin were really lucky when we started working together, we worked on this data set that came from this gas satellite that the European Space Agency launched. It was 20 years in the making and we were really, really fortunate to be the first ones to explore it. But the thing is it was kind of big for a bunch of academics. It had over one billion stars and when you give stars to astronomers, we want to plot them and see how the sky looks like. And we tried to do that and this is what happened in, yeah, so our boss was not very happy because she spent 20 years trying to produce this data set and we couldn't even visualize it. So what actually we thought about, okay, instead of just plotting points that will end up overlapping each other, how about making histograms? And when we were starting to make histograms, we started to see structure and as the more data we added, the more structure we could see. And we figured out that actually you can build histograms really fast and you don't really need that much memory. So this is the potential of the VEX library. Really fast computations, aggregate data that will now enable us to use as big data as we can fit in our hard drive. So this is where my presentation ends and I will switch to doing actually a live demo because I think the best way to showcase something is to basically show it how it works in practice. So is this visible enough? Yes? No? Okay. So we took great inspiration from pandas. Really pandas showed us the way how data from library should look like and really tried to follow their example in the way the API design and as we developed and we basically needed more things to do our jobs, we started to add machine learning capability and that's where we followed the scikit-learns convention. So we're almost fully scikit-learn compatible. So let's try to let me try to show you to demo VEX for you. So VEX is built on few concepts. These are memory mapable storage, expression systems, delayed computations and efficient algorithms. So as I will go through this demo, I will try to explain this concept and later on we will see how this works in practice. So for this demo I will use this very large dataset from the New York taxi, yellow cab taxi company. It basically stores all the trips that all the yellow cabs did in New York City between 2009, 2015. I mean, you can get the data until I think 2018 but something changed in their data system after 2015. So I decided to ignore that portion to the data. But this is big enough. So let me just show you how large this file is. When I merge all this data into one file, it is about 100 gigabytes. And even though I have a very nice Macbook, but it's kind of off the shelf and I definitely don't have 100 gigabytes of RAM to play around with it. But I will play around with it. So let me open this dataset. Because this is HTML5 allows memory mapping, all what this means is that when I try to open a file, when I opened it, it doesn't really do anything. It just points the operative system and says, oh, look, the file is here and this is the structure. There are a bunch of columns and a bunch of rows. And when I want to display it, all it does, it gives me a preview. So I get the first, by default, the first five and the last five rows for the entire thing. So even though I have over a million rows, I can see them instantaneously because I actually do not need to see all of them. I just want to see a preview. So you can open in quotes a 100 gigabyte file in, well, instantly. Well, we have, you can do the data frame.info. It gives you a little bit more info. You can put custom metadata information. You get information about the column structure, their type, and again, a preview. So Vex follows the standard data frame API, as I said, showcased by pandas. So for example, you can see a specific column in this file. Let's say the trip distance. And again, even though I have a billion columns, billion rows, sorry, I see this instantly because all I see, because of the memory mapping, I see just a preview, the first and the last five examples. We support all standard data types. So for example, this is daytime. This is the pickup daytime of each taxi trip in the data frame. But what happens when I do certain operations? These are also instantaneous. And this leads me to expressions. So when I do certain, certain expression, by expression, I mean a really mathematical expression. So here I take a log of something and I add the square root of something else. I do this on the fly. What Vex does is memorizes the expression, the mathematical formula that is required to generate the result and only evaluates it when needed. This is kind of like a computational graph, something what neural network libraries have been doing for quite some time. So this is instant because all I need to do is actually calculate 10 values, not the whole billion. I will only calculate the whole billion when I actually need to know the whole billion, for example, calculating the mean or some other aggregate statistics. And this expression is really the core, the core system of Vex where Vex drives. So for example, let's put us ourselves in a position of a taxi driver or a CEO of a taxi company and we want to know what is the total amount divided by the trip distance. This is nearly instantaneous because, again, I'm just calculating 10 values. I can save this as if I would do a, oops, I would do with a normal data frame with pandas. And here, let's say few over distance is what I call a virtual column because it doesn't store the entire output, but it only stores the mathematical expression needed to generate this. So now if I scroll to the far right, I see this here. So what happens if I want to calculate the mean? Well, then I actually do use the full billion dataset. And because we use fast algorithms and everything is fully parallelized, I can calculate the mean of the distance in, I'd say, three seconds. So how about mean of few over distance which is now a virtual expression? So now I do this thing simultaneously where first I need to compute this, few over distance, and then from that calculate the mean. But there is a none. And because this is fee divided by distance, I see a none, so probably there was zero somewhere in the distance. So let's do a filtering. Just as you would do in pandas, let's select all the trip distances that are bigger than zero and compute this. And now everything is done for a billion rows and now we get the super fast four seconds, billion rows, no memory used whatsoever. So we can now do name selections. Name selections are really cool. This is what hooked me on to Vex. You can specify a filter, let's say a trip distance bigger than something and bigger than 10. Give it a name and then for every method that Vex supports, you can basically say what the selection is and get the result for this particular filter. So you can imagine the flexibility. Where Vex really shines, as I show you with this sky plot slide at the beginning, is bin statistics. So we can count really, really efficiently as we saw. So this is just a number of rows. But we can count on a grid. So on one grid, counting, this is basically creating a histogram. We can say what the limits are and the shape of the grid, basically the number of bins. And then we can use standard plotting tools, your favorite tool, mudpot lip or whatever to plot this. We can do this in two-dimension. So two-dimensional grid. And I'm doing this in real time for the whole billion sets of points. So it took about seven seconds. And now I can visualize it and you can kind of see New York and this is kind of Manhattan. But we offer wrappers around mudpot lip. So you can do this nicely and get a histogram really quickly with all the correct axes and labels and shapes. And also for two-dimensions. And now you can see New York much nicer and you can see where people took their taxes from. These are all the pickup locations and you can see all the streets. It's a really nice graphic. You can also pass the selections that I made earlier. And how long would it take to make two images with billion points with different filters? Around three seconds. So this really enables us to do interactive filtering selections with basically a billion points. Okay, so I'm going to stop there. And this basically gives you the overall idea behind the X and how one would use it. But what I would like to do in the next part is actually let's do a data science example. Because maybe this works really fast in a very curated, because I was home and I was preparing to simulate this presentation. But then I thought, okay, it would be more maybe convincing and informative if we actually do a quick data science project together on a big data center on my local machine in real time. So now that actual scary demo starts will scary for me, fun for you. And this is going to be a disclaimer. So this is what I call an honest demo. I'm going to do some data science project with this taxi data center in front of you and it's going to take as long as it takes. I mean hopefully, oops, hopefully within the time that I have allocated. So some steps may take a minute or two, because I'm going to be using the real the full data set in real time. And because I'm going, I'm partly doing this to show you a nice data science project and partly to show VEX. I'm not going to may always make the best data science decisions in terms of cross-validation and maybe building the right features and so on. Okay, so let's start. By the way, if you have any questions in the meantime, please, you can ask at the end, but you can also stop me in the middle. So I'm going to load the same data set again. There are 1.1 billion of samples. I can print head and tail and see what the data frame contains. We're going to go back to that later. So the data frame you can see here is roughly ordered by date. So pick up date. So I'm going to try to make a machine learning model out of this. So I'm going to split it by time and here I've been really lazy. So I'm just, I know basically the instance is when 2015 starts and so I'm going to take that as a test set and everything before 2015 I'm going to use as a train set. So I have this data set described and this is the longest part of this presentation. So what it does, for all these columns, I think there are about 12 or 11, different data types. So we have the vendor ID, pick up time, drop of time, pick up date and drop of date, number of passengers, how the passengers are paid, trip distance, where they were picked up from, long into the latitude. I actually not sure what this rate code is and this other store flag is whether they, if they paid with a credit card, whether the card was charged right away or whether the taxi had to go to basically a place with Wi-Fi and then the transaction occurred. Drop of longitude, latitude, fair amount, additional charges, taxes, if there was any tip, if they had any tools to pay and the total amount paid. So for all of these columns with this data frame described, I'm using, I'm basically computing the mean, median, computing if there are any missing values. Mean max really, really gives you a nice overview of the data frame and this takes a bit of time because I'm doing it for a billion points but at the same time I can use this as an opportunity to show you how much RAM I'm actually using. So so far I'm only using two gigabytes even though my full file is over a hundred gigabytes in size. So from time to time I will switch here and you see that my CPU is running at full throttle, it's, everything is fully parallelized and it's going to take a bit of time. Maybe we can use this time if anybody has any questions so far or everything is clear or everybody is fully lost. Yes please, microphone please. Right at the start you said HDF5 allows you to memory map it. You can memory map any file so it is HDF5, is that like specially designed to be easy with memory map applications? Well I don't know if that was the plan initially but it allows us to do this. We also, I forgot to say, we also support Aero so the Apache Aero format also allows you to do this and we're fully compatible. So it is that I had this file a while ago so I used HDF5 but you can equally use Apache Aero and actually that with development Apache Aero it allows us to do more interesting things like store lists and dicks which are kind of tricky to store in HDF5 to make it memory mapable. But yeah it's not there HDF5 is not the only option so we actually would like to move more and more to Aero as it gets better and more stable. Okay so this is done, we calculated all the statistics for all the 11 columns in two and a half minutes which is kind of impressive given that we have a billion points data frame and everything was done out of core so now we can do a little bit of a data science by exploring what everything is here. So we don't really know much about this about this data set so let's go slowly over it. First let's see if there are any missing values and this is basically what pandas define as a missing value and A so either num p none or none and we see here there is nothing there is one pickup latitude that's missing out of all this rate code I really don't know what it is I'm going to ignore this column. For people that pay with credit cards I guess this is not important if they paid right away or they were charged a bit later drop of longitude and latitude and some of the tools are missing so okay I'm going to drop the missing values from the longitude from the drop of longitude and latitude and pickup latitude and because every operation they do is a virtual column or it's an expression nothing gets down right away as you see I've dropped columns immediately but I'm just storing the information I'm storing the command saying drop missing values when you actually need to do some calculation for now just save the command okay so now I look at the number of passengers and I see the minimum number of passenger is zero so maybe somebody was shipping something that people were not traveling but the maximum is 255 so that's kind of tricky so if you're a data scientist and you're using pandas all the time probably your favorite method is value counts and I know it's mine and so it's really important to have a fast value counts so let's do that and because VEX is really aimed at working with really big data we've implemented this progress bar so you can know you know more or less how long you're going to wait if you have ridiculously big data on a single computer at least so this is what we're doing now so okay 20 seconds and we see okay there most most uh oh sorry this is the trip distance oh wrong thing let's do it again all right so mostly we had passengers with car car taxi trips with what passengers two five three four six and after six we have many with zero so maybe taxis were used to also ship things and after that we have these spurious random numbers there were other mistakes or the taxi drivers were having fun taking notes so what I'm gonna do is just uh take the taxi trips that uh had uh well some passengers and less than seven so then we want to look at the trip distances so I already calculated this so this is just to show you the performance when we don't have a kind of categorical variable but the trip distances are just floats and it also works really fast so it took 20 seconds and we noticed that there are lots of trip distances that are zero so that's kind of weird so let's make a plot we want to see the distribution of distances on the log scale and while there are even some negative distances and some very large distances and in fact if I just take the maximum this is a ridiculously big number in terms of miles this comes from the US so in fact this is 67 times the distance to the moon or half the distance to mars so kind of kind of large so probably fake so let's give us a limit so let's go from 0 to 20 and this is how the histogram looks like now and here I decided to let's say take all the trips within 10 miles because that's kind of where the histogram flattens out I really want to focus on the core on the core of the taxi data set so what is the extent of New York this is the New York City cast but they really covered the whole New York I want to see so I'm going to plot the taxi trips the pickup locations but I really want to kind of play around so we have this widget with the help of ipi widgets and material design so this should be New York so there is nothing much to be seen because we're really affected by some outliers so I can interactively zoom oops I don't have a mouse so I have to do this a bunch of times and this is not pre-computed as I'm zooming the grid the 2d histogram effectively is being computed on the fly you can see a nice progress bar so I'm zooming, zooming, zooming you can kind of start to see New York and there it is and we can see that now if you're familiar with New York I personally have not been this is Manhattan this is the rest of New York so really city caps cover mostly Manhattan rather than the full New York but that's fine so now we can zoom in even more if we want to explore these hotspots and if you look at Google Maps or some other map provider you can see these hotspots are actually either major hotels or train stations or intersections between buses and subways and so on so this really allows you to interactively look at some plots and even for this big data without using much memory okay so then I played around with this and decided this will be my the edges of my box of New York I'm gonna make a filter just like I normally would now I'm gonna create some features I'm gonna look at the mean speed so this is trip distance divided by the time it took and this is gonna be in hours kind of a natural oh sorry the natural unit then I'm gonna look at the trip duration in minutes so this is just the difference between the drop-off and pickup time in minutes and I'm gonna look at this metric that we looked at the fare divided by distance because sometimes you can think of oh but more money is great yeah but maybe you need to travel somewhere very far and then you have expenses you lose time you lose petrol and so on so really fare divided by distance seems to be like a good metric to look at profit okay so now look at the histogram of trip durations also in a log plot so now basically I'm computing on the fly that this expression that I've just defined above and binning it let's check oh okay it's done so wow this there are some really really long trips that make no sense I'm gonna make another plot now a bit zooming in between zero and 100 minutes and this is how it looks like so there is not really a clear cut-off here where I can say okay from here spurious results started to see but it's New York I highly doubt somebody would be a thousand minutes in a taxi so I honestly don't know how patient New Yorkers are but let's say I'll be generous and say between zero and two hours so if somebody really wants to spend more than two hours in a taxi in New York then yeah my model will probably not take account into that so I'll do that filtering as well now I want to let's say I'm a CEO of a taxi company I want to see when are the most profitable times so I'm gonna just execute this so we don't wait while I explain I'm gonna extract from the time date the hour of pickup same as you would do in pandas I'm gonna extract the day of week the month here I subtract one so I get this ordinal encoding by default so January is not one but zero and then I will make a simple feature just to see to check if if a taxi trip occurred on a weekend or not so basically if if the pickup day was Saturday or Sunday make it a weekend otherwise in zero so then I'm using this categorized feature method basically what it does is even though I have integers one two three two seven two six four days and two eleven four months I'm using these categories so I can do easily binning on ins rather than trying to find splits and so on and then I can quickly plot because vex knows that these are in our categories I can quickly plot a map of pickup time versus day of week so we can see that in the early hours on the working days there are not that many trips while in the evening as we go to the evening side there are more more more taxi trips and on Saturday and Sunday well this is actually Friday Thursday night because this is midnight and Saturday this is Friday night basically in the sundays Saturday night we have the most most taxatrips we can do the same per month and for some reason Saturday in March is really popular for taxis in October I have not figured out why and in July and August there are not that many taxatrips maybe people are on holiday but I thought maybe there would be tourists but anyway you can do this kind of grids really quickly took 30 seconds to bend this in real time basically okay so now I want to do some more drill down work okay I want to group by standard thing in pandas by hour and see okay maybe if I'm a taxi driver I want to see when it's most maybe I'm a part time taxi driver and I want to see when it's most worth it for me to do this work so where can I get which times of hours of day I can get the most tip where I get this metric to maximize or shortest or longer strips where I can drive the fastest or slowest some ideas like this so this is the same way as you would do group by pandas and here I'm just this is just a default plotting code so now we we're able to do this in just under just over a minute for a billion points so we can see mean tip amount people in the morning tend to tip a bit more or people that travel just after midnight fare by distance probably this is affected by traffic jams around three or four when people leave work kind of makes sense mean trip duration or it's kind of inverse of trip speed that makes sense if there is if you drive fast you get there faster and yeah makes sense in the early early hours of the day four or five the taxis drive fastest because yes there is no traffic and in the peak hours the traffic jam hours of day well they earn more money but they drive slower we can do this same now per day of week and we see the speed is largest on Sunday I guess probably less traffic and if you're a taxi driver it's most worth it for you to do this on Thursday however I've not computed the error or standard deviation so I don't know how significant this thing is but keep in mind that this is computed for more than a billion points so maybe average is out density maps we already saw we can do this I'm just gonna plot another density map so now it takes a bit longer because I've actually generated a lot of features and I've made a lot of filtering so we can see how the pickup density of New York looks like after we did all this filtering this looks like this I'm gonna skip this but for us maybe as taxi drivers we want to know where it's most worth it to pick up people so what I'm doing here is I'm actually being pick up long into the latitude but instead of just simply displaying the counts I'm trying to display the mean of fare by distance because if this value is maximum that's it's most worth it for me so I really now see the streets where I get the maximum of fare divided by distance so if I try to pick up passengers along this routes and this is one of the airports this is another airport and this is kind of a main vein of the city get the most profit so this is kind of part of the exploration phase so now let's do a little bit of machine learning it's fun so we have a module in vex called vex.ml where we provide machine learning APIs that are fully out of core that can help us pre-process data very quickly so I've taken some inspiration from Kegel there was a similar competition to predict taxi trips durations for example or taxi trip tip amounts I can't remember so here the idea is I'm gonna make a custom function that calculates the arc distance between two points and to test it out we'll calculate the distance between Basel and Utrecht where I work and this is in miles because it's an American data set and this is about right if you check Google you will see that this is but you have to be careful it's not the road distance but it's the actual arc distance so now I can add this as a virtual column in vex and if I scroll far right this is a complicated expression it has sine, cosine bunch of them are arithmetics it can take a while when we actually compute it so what vex does is also it supports jitting it stands for just in time compiling and we can use Numba so basically this expression is translated into C so then when you execute it it basically you get almost two times faster because all these low-level computations are done in C we actually support CUDA as well so if you do instead of just Numba, just CUDA you can use if you have Nvidia GPU you can use GPU for parallelized low-level computations as well I cannot demo it because I have a Mac unfortunately so this is one virtual column that I'm adding I'm going to add another this is just to kind of calculate the angle another complicated expression that I have that can also use JIT to speed up and now I can just display the head the data frame to make sure that I have all these new columns here that I've created and I can check how much memory I'm using 3 gigabytes so far nothing much that no other well, standard laptops can handle it so now I want to do PCA the reason why I'm doing PCA I'm going to execute this as well is if I look at the streets you can see that New York or most American cities are kind of building a grid and this is a tabular dataset and we learned from the scikit-learn talk earlier that really boosted trees really work well with tabular datasets so we kind of know that the first thing we're going to try is going to be some sort of a tree model and this is a grid so if the grid was like a chessboard it would really help the tree splitting algorithm decide the best way to do splits faster so PCA does this it transforms the data well, trying to align the data along the principal axis and we're hoping that because the map of New York is kind of like a grid it will align the city on more regular grid so everything is done here out of court it took about 30 seconds to compute and transform the PCA for the pickup location the pickup location and the drop-off location and now I'm just painting a little bit of time generating plots I'm actually generating four plots all with a billion points in real time so it might take a little bit of time to do but I think it will really illustrate the idea behind doing this any questions at this point? yes please so I have a question so maybe on the PCA you will get the same result just on a much smaller sub-sample do you have an efficient way to extract or run the algorithm on a sub-sample of the data? yeah, yeah so you can just slice your data frame as you normally would and instead of slicing sampling uniformly at random because maybe it has some time dependency so the other way oh you mean the two sub-samples yeah randomly sub-sample we don't at the moment because randomly sub-sampling means generating in-memory some index that you need to track and at the moment we really try to avoid doing this but you can we don't have an in-built method but I can add a prescription of how you would do this so you will just need to pay the price of having one in-memory in-colon okay so this is how it looks like and you see the city is now transformed into a more linear well the grid is transformed so it's not it's more along the principal axis if I was doing this for let's say to get optimal performance I would probably isolate Manhattan more and do the PCA just on this so I can get this street straightened out even more so then I want to for some reason I want to use this is a feature the payment type and I see that this is basically text and sometimes it's all capital letters and sometimes it's it's mix of capital and non-capital letter so we're going to do value counts on this column but first I'm going to do string lower so VEC supports also string operations in a very fast way because we're using C C-bidings to do all these operations in a fully parallel and extremely fast so I'm going to I'm trying to just get all the possible values for the payment type and from the documentation website of the taxi dataset I saw that these are these are the meanings behind well this is what the categories that they already had placed so there is something cash, card I guess this is cash as well credit, credit, unknown and so on and because there are finite numbers of these things that I found in the dataset and very few documented types what I do I can just make a map just like you would do in Bandas and apply a map transformation really quickly and this is using a hash table in the background so the mapping goes extremely fast and it's also fully parallelized and it's fully out of core so now if I scroll to the right I see the PCA columns transformed and this type and once again I'm not really using so this is this one of these two actually this one is from the previous notebook that I showed and this is the current one just because the question fits here how do you restrict memory usage? Can you run out of memory still? How do you restrict memory usage? I'm fully doing a lot of memory because everything is done in chunks all these operations support chunking so I don't have to hold anything in memory I hold as much as around I have I calculate the statistics that I need select the next chunk update the statistics and so on so there is no actually you don't if I had eight gigabytes or it would make no difference actually most of these things are CPU reading well CPU limited or SSD read speed limited okay so now I want to use some okay some some estimator I'm going to use light GBM so in VEX we provide bindings to various this is some because I've not installed it properly or something so this is not a fork it's just a binding to light GBM if you install it using KONDA or PIP or from source it will just work so I'm using some default parameters I'm instantiating the wrapper it supports all the default the default parameters that you would use normally or if you use the secular API you don't have to use a wrapper you can just use the secular API and because light GBM is still not out of course so this I'm going to do I'm just going to take the first million samples so I can train this in real time and show it to you and this is me training it so it will take a second or two it's really fast for a million and I'm going to do a prediction so the booster has dot predict method standard you get in memory output of your predictions so this is I'm trying to predict the duration of the taxi trips but I have something transform that usually scikit-learn uses for transformations of data with VEX what it means it adds a virtual column to the end of the data frame so you don't actually have to have it in memory you can only have it when you need it and this is really cool because now if you're a kegler or if you're if you're suspicious maybe you want to add additional additional models to this it's quite simple because everything well here we're just storing basically the model that generates the prediction and we can execute it anytime so we can let's say estimate this doesn't matter these numbers are more or less meaningless because we've not done cross validation but we can easily add a second estimator let's say we want to try a Lidia model just for fun but for a linear model we cannot use well trees work really well with label encoding and integers values but because they have ordinality in them let's one-hot encode all this all this payment type hour day of month very quickly and we can also support we can also standard scale let's say the distance direction trick duration and so on so we support a number of transformers not exactly the scikit-learn full scikit-learn list because well they have a big list they've done a great job we support a subset of that that is fully out of core so it doesn't matter how many samples you have all this is if you limited oh you can see my well this is computer crunching numbers and this is the memory that I'm using only three gigabytes for the entire data set and now once this is done I will print the the head of the data frame and you can see all the columns that are added so basically when I do the fitting all it does it memorizes the command and the meta parameters like mean, median the number of unique categories that you have and only when you do transform it's when the actual transformation happens in chunks so that's how you get away from the memory problem as I said this is an onus demo so sometimes we have to wait but you get really good insight of the performance when you work with a billion size data set how long typical operations that you may do will take and here it is the computations were finished as a nice scroll to the right and you see we have because I'm keeping everything in the same data frame I have nice labels of all the transformation I did so you can know all times what is what and here are the standard scale things so we also support not the transfer the transformers because they are quite complicated but we support a wrapper for all scikit-learn estimators so you can just install scikit-learn through conda or pip or your favorite way of installing it and you can use all scikit-learn estimator with index they will not be out of core yet but you can use them once you do your aggregation so here I'm just importing the linear model I'm going to select the features that I want and again I'm training only on one million samples now but when I do the transformation here I'm actually applying this train model on my full one billion data frame and here if when I scroll very much to the right I have the latest linear prediction but I also have the other prediction without wasting any memory well lots of columns right here from light gbm so what this allows me to do is quick examples because all these are virtual columns I can just do a simple mean like this so yeah so this is the final part of the data frame of the of the demo so if you were bored until now and this is all kind of yeah yeah lots of data frame expression and so on if I want to put this in production it now I have to go back and leave the notebook that we all know and love and create a real production ready code with pipelines and transformers and all the rest right but because basically what we're doing is creating a computational graph every operation that we saved so far is written it's an expression it's a mathematical formula and that's what we call a state so we can simply save the state into a JSON format like this and now I'm gonna this is only gonna take a second I'm gonna open a new notebook new kernel fully independent of the previous one I'm gonna load vex just vex I'm gonna open the data set I'm gonna open just the test set and you can see here that there is nothing applied it's just the road data set that we start with the test set and all I need to do is load now this is like gbm loading load the test the state and all the operations that we've done so far will be transferred onto the test set so now everything is done for the first time so this is the test set now I'm printing the 21 things 21 elements and we see all the transformations pca the one holding coding all the way to the final ensemble prediction so now we can also push this with RSTPI so this is another way of creating a pipeline without actually explicitly creating a pipeline so in principle you don't even have to leave the Jupyter notebooks where you just can experiment get immediate feedback and just try out more models and parameters I hope you enjoyed it thank you very much now a few questions and we'll move on talk thank you oh by the way this is all open source so for research, personal, commercial use we also run a consultancy but the software itself is fully open source and you can use and experiment with okay great talk thank you I have a few oh I actually just one question about this last piece with the pipelines because you say you've stored all the computational steps one of those computational steps was for example to train and transform with a PCA so if I apply this same state to a new dataset am I retraining a new PCA? no because basically you do fit as if you do with a scikit-learn fit so you've memorized the eigenvalues the eigenvectors all those things are stored you're just applying the transformation to a new to a new dataset as if you use scikit-learn so does the export then contain the scikit-learn pipeline? yes so the export is just a JSON file that contains the well the orders of what you need to do but also these meta-values like means eigenvectors for example when we do when we did the encoding the number of unique well not the number but the well the unique categories all you need to do to transform the dataset that's great thank you how does it compare to Dask? so do you mean Dask or Dask DataFrame? where are you against? yeah here just in front so you mean Dask or Dask DataFrame? both I mean how how does it compare to Dask because it can also do the Dask is like Noompi but distributed so maybe Vex will use Dask one day to become distributed so then they're not actually competing it's just so far we're using Noompi for most of the computations but we have plans to maybe support Dask so you can run this on a cluster if you really want to go to hundreds of terabytes of data Dask DataFrame is really nice but it's a pain to install and it's a bit well I found it hard to manage on benchmark and I from my tests on this dataset and on a standard well there was another big dataset with airline companies I think we were a bit faster but without any clusters just I mean on a single machine you can't compare because you just wouldn't be able to run all the information all the stuff that I run and on a cluster we're just talking about different things right so yeah thank you yeah hey nice talk thanks do you do some sort of predicate pushdown on Vex operations sorry you have lazy graph right or execution so do you do some predicate pushdown or query optimization at some point yes there so all the operations because now I'm interactively going and deciding I want to calculate this I want to calculate that there is a algorithm I mean I did not mention this because I only have so much time but there is an algorithm in the background that calculate evaluates the computational graph and tries to go over the data in as little passes as possible so you're basically trying to optimize everything well sometimes it's not possible depending on the steps that you want to take but for simpler computational graphs we try to do it with one puzzle the data and if that's not possible with as little as few parcels over the data as possible you can even visualize the computational graph but this is still experimental and I didn't show it but you can see like what what is dependent on what and like a standard computational graph in a neural network let's say okay maybe I missed it but they also have joins yes joins we have basic joins so kind of a SQL live join the proper joins are currently in a PR state where we're trying to use hash tables to make it really fast and also out of core it's coming over the next few months but it's it's planned and you can check out the branch and play around with that if you want cool very nice thanks excuse me is the next speaker much earlier so you can catch me I'm here today tomorrow we are looking for the next speaker oh okay in that case so if he's around here please come up and set up your computer okay okay I guess we have time for one more question if any well my question goes into the same or I can ask the previous one do you support indexes on the data side to speed up for example filtering so indexes are sort of depending on the type of filtering the way that pandas does it the explicit index we don't support yet at least we basically didn't want to add another row into memory we want to make this as memory efficient as possible and it comes with some drawbacks because you cannot do indexes but really if you have big data sets in the terms of what I just showed in our experiences this tends to be more detrimental and or then helpful because we're really focusing on aggregation and group buy and value counts and histograms and so on but we are looking into having very fast ways to just find the data that you're looking for without explicit index there are lots of methods that you use hash maps in the background pandas does this as well for value counts we've taken inspiration from there so we're trying to follow their lead but not explicitly just to avoid this well if you have a billion rows that will add eight gigabytes of RAM just to have it in memory and we really really don't want to do that unless absolutely necessary but you can add it if you want yourself if you if you have the money or the RAM money to pay you can do it yourself we have to go to the next yes thank you very much thank you very much for your talk