 Hello. Good afternoon. Welcome to my talk. Scale your data and not your process. Welcome to the blaze ecosystem So a brief introduction. I'm a data scientist at continuum analytics We have a booth out there. So we've any time during the week you want to come talk to me I'll be there most of the time. I'm from Barcelona, but I'm currently living in Austin, Texas So also if you're from Spain, you can talk to me in Spanish, Catalan If you're not you can also talk to me in English or German. This is my website I have a couple of the talks I've given at other bison conferences. You can also check them out Just a little bit a brief thing about continuum analytics Gido mentioned the company in his keynote today We offer a free python distribution called anaconda It's very popular in the sci-fi community for libraries that have C and for trend bindings It makes it very easy to install them. We are very Integrated in the open source community. We have we sponsor several projects conda blaze dash bouquet number And we're a proud sponsor of a lot of python conferences You're a python pycon sci-fi py data and we're also hiring. So we're going to be tomorrow at the hiring event If anyone is interested also come talk to us at the booth that's our website A little bit about this talk I'm going to be organizing it in five in three different areas First a little bit about what what data science is and what the stack that i'm presenting today brings to the data science Community Then a little bit of what they call the data science trifers And we'll hear a little bit about that later and then inside the blaze ecosystem There are many projects and i'm going to be mainly talking about four blaze data-shaped odon desk and how the each one relates to each other You can follow the slides online If for some reason you're not able to see it from back there and there's also a github repository where I have the Guide and read me for reproducing the examples that i'm going to show in my slides So you can also try that So first area five areas of data science So many people have their own definition of what data science means For me data science is more just than just machine learning and stats Actually data science is just a rebranding of five fields coming together to solve data problems A lot of people in the scientific community community have already been Solving large large scale analytic problems Scientists will deal with large amounts of data. So they have already worked with it Then there's this group of machine learning and stats people The analytics community with the databases and queries Web is where we find a lot of the data nowadays and then there's the distributed systems Community with all the hadoop and spark That are trying to scale those problems too If we try to find what personas are working in each of the different fields There's some terms right there people calling data scientists people that are in the machine learning stats That are actually most more concerned on modeling We have people in the data business analysts web developers And in the distributed systems a lot of architects data engineers and then in the scientific computing all the research and computational scientists If we have to find one word Of what each of these personas cares about maybe these ones are a proposal of words that identify them So machine learning people in stats care about models by finding the right models to solve their problems People in the analytics community are mainly concerned with reporting right building rather than reports the metrics In the web development building an application In in terms of relation with data science applications that portray accurately that your data problem In the distributed systems you're concerned in your pipeline your architecture that you're building And in the scientific computing world in the algorithms So if we use more than one word, what's the vocabulary of those people in those areas? We see data scientists use words like models supervised unsupervised clustering dimensionality reduction cross validation In the analytics world people concerned with joining with databases with Finding filtering getting summary statistics in the web We have scraping and crawling to gather data Gather information things like interactive data visualizations In the distributed distributed systems, we have all the hadoops spark ecosystem working about clusters stream processing etc In the scientific computing people are concerned with gpu's with graphs with algorithms with compute computation power What are the tools that each of these personas and fields are working with So in the machine learning you find we find our siano cycle learn In the analytics community all the databases sql all community people working with excel 2 In the web community, we have all the web frameworks. We have these three We have bokeh for interactive visualizations. We have scrapers We have a way to share our code with um jupyter notebooks In the distributed systems as I mentioned, we have spark. We have hadoops. We have luigi We have kafka. We have all these also tools being built around it In the scientific computing. We have the core of a lot of the A lot of the libraries that are used by the machine learning and stats Libraries like numpy scipy xray py tables Sison number etc So this is to provide a general picture of what the the status of the data science ecosystem is right now So if we take a look at those tools What what are the three edges that they're bringing together? There's three things There's data There's the computational engine behind it and then there's the expression Expression is how you ask what are you asking for? So data is all about metadata of information on that data and how you store it and containers containers Meaning the how the data is stored in either your memory or your disk We then have Engine that the computation power what what gets executed? And then we have expression meaning the api the syntax the language How rich is that to allow you to express what you want to compute? So what are we looking for in each of these edges? In metadata, we're looking for semantics In in storage and containers Compression and accessibility to your data In engine we're looking for performance being able to do that as fast as possible And in expressions we want one simplicity We want to be able to express what we want to do in a language that's That's very close to our human language So just to like have an example. What does all those things mean? In in other languages, um, for example, we have different file formats, right hd f5 Next cdf json csv sql id calls But we also have memories containers like pandas data frames or numpy arrays In terms of semantics, we have a lot of like we have types We have fields we have names we have description of your data We have relationships between the fields of your data In terms of computation, we have Different different computation engines that perform those like spark like Scython Fortran Python itself or the libraries that are built on top of it In in terms of the api syntax language, I'm talking about the api The numpy api the penis npi the bindings that we have to other libraries that allows us to express that in an easy way We also have many of the sql dialects, etc So In the core all those libraries numpy pandas databases and spark have Have somehow each of those three edges in this triangle Let's let's put a simple example imagine numpy So numpy we have numpy d types, right that allows us to express the types of the fields in our in our data We also have a way to contain The the data for numpy with numpy and the arrays But numpy itself Needs to compute things needs to compute what the the user is asking for and and it has you know bindings to c and fortran also python And in terms of the api We have a numpy npi, right that's how do you express what you want? How do you express the the fact that you want to create a numpy array? So in all of these systems, they're happy and like You know sad faces Numpy and pandas mainly are limited but by the memory of your your laptop or your your your device But people scientists like to express their things with arrays It's an api that has attract a lot of attention in the scientific community Data scientists analysts also really like the pandas api the fact that they can deal with data frames tabular manner In the database world well, we have a lot of dialects SQL. There's a lot of overhead to set up And kind of the spark wall. Yes, it has come to like expand the Hadoop ecosystem to more of the Data scientists and people that are further away of the engineering side, but it hasn't quite yet Reached the gap to help you in all the cases of your data in smaller sets. You still have a lot of overhead to perform it So let's take a look at what the blaze ecosystem brings To this ecosystem that I've just mentioned so blaze started out as how do we expand numpy and pandas To the to out of core computing to not be limited just by the ram that your laptop has And from there, there's several spin-off projects that have come along the way with things that we've learned So first we needed to expand some of the numpy and pandas limitations of expressing The metadata that's that's in them. And that's what where data shape Came to play. There's a data description language That's more general than what numpy and pandas implemented Then we had dine which is a dynamic multidimensional array, which is a library written in c and that has python bindings We also found that there was a lot of need to move data around people data scientists were working with different file formats and There there wasn't always an easy way to move from one format to another one from one place to the other one So that's where otto Was a spin-off project that came out of the blaze repository Um, we then have numba, which is a code optimization adjusting time compiler Inside blaze we have what we call blaze as a as a project which has been kind of the core Which is an just an interface to query data in different back ends We have desk that allows us to do easy parallel computing Castro that's a column store on this column store partitioned and because also column store and and also query language that allows us to To get data out of it. So if we If we place all these projects in this table that we had before Here kind of it's where it comes from now Data shape is this made metadata extraction of expressing your data in different formats We have dine which stores data in a multi-dimensional array We have otto that allows us to switch from one Container to another container from one from numpy arrays to pandas to a lot of the back ends that blaze uses We have number that allows us to optimize the code Dast for parallel computing And blaze being this common interface that allows us to query everything in a unify Unified manner without having to learn each of the different apis So if we put those packages in our triangle We find data shape as the metadata In the meta data section. We have castra in the storage as at one of those Containers otto that allows us to switch from one to the other one. We have On the engine computational The the power of parallelizing with desk and to optimizing the code with numba And expressing everything with blaze and then dine and because which also are Part data containers and also Have computation power to resolve your Whatever you want need to compute If we now place those projects i'm kind of giving you the overall picture of where they all fit So a lot of the analytics people are you know interested in tabular data formats like Like pandas data frames so that data frame is there for them Blaze as a unified query interface We have otto that can be used mainly for everyone. It's just like a utility function to move Data around we have a big support in like the scientific computing world And we're solving a lot of those underlying problems that then are used by many of the Libraries in the python ecosystem for machine learning and stats But we're kind of bringing like the solving the the underlying problems And then we are also engaging with the That with a distributed system world with what's called that's distributed And then we're also going to see blaze server which allows us to serve All this data in different formats through a unified api So kind of like the ideal blaze was Disconnector to all these different Fields in the data size community and bringing everything In a unified manner So if we just remove some of the rest and just going to focus off on what we're going to talk about In this talk, we're mainly going to be talking about four of our projects otto blaze desk and data shape And here's just where they each of them are So the first one's going to be blaze Blaze is just an interface to query data on different storage systems So from blaze you import data and the same way You load data the same way That you do with the csv sql databases mongo db json s3 Hive whatever there is you just call data and pass these uri And then you can do all these squares with all those different Backends select columns filter operate reduce Do split up by combined operations like group by things like this annual columns reliable columns do text matching One of the features that we've just added to blaze is the the blaze server is this way of Building a uniform interface that allows you to host data in all of these backends through a json web api That it's the same for all those databases So you you can write your yaml files specifying All the data that you want to serve where they're where they're located You can also pass the what we're going to see next as data shape And you spin up the blaze server with all of them there and you have a An endpoint that allows you to perform all the computations that we've just mentioned before through the api So this will look like something like this with We have the data available through the api and then we can query we can get the fields and we can get all the different Data sets and inside each of the data sets. We can also get We can compute the same queries that I've just mentioned So as you see we can you can just use things like curl We have an expressive language compute something and And return it There's also the option to use the blaze server to tree and just like something like request But i'm going to explain data shape first So data shape is just a way to to describe structured data and there's the URL to the docs and it's basically basically What's called unit types and unit types is just a dimension and a d type And that's what forms a data shape Then we also can combine those in what's called an order structured d type which is a record Um, and then that record it's a very extensible language. So even if it's You we can use it and actually blaze uses it as a to express tabular Tabular data formats you can actually combine it to have more of like unstructured or semi-structured Data and and express it nested fields and things like that Um, so for example that bar XY said it's going to be our our fields Bar is going to be the length of our our table which can be known or unknown And then the types of those of those fields So in our previous case where we had the several ideas data sets in different formats We can get the data shape. It will look something like that We have a database inside the database. We have a table the table has this Data shape with the the different fields and Types and the same for all of them So what's the connection between blaze query mechanism and data shape? Well blaze uses data shape as its type system. So when we When we call data iris.json, we have access to the data shape and we can explore it Um, so now that we can go back to the blaze server and see okay If I know the the the data shape of my Of my Whatever I've put in the blaze server, which I can get because We just saw that I can just do this shape and I'll get the data shape of that I can then express my query And just use request to query that and the return is going to be a json with the With whatever I ask the computation to do it's going to return the data And it's going to return the data shape and the name of the the field So auto Auto is data migration, which is like of cp with types for data So it has a very simple api from auto import auto and I just have to put Plus my source in target. So if I want to get a json from a csv file just do auto iris csv iris json and that's going to create the json for me and that's a pretty simple case, but you know It can get more complex Like putting things to MongoDB or hdfs or moving things from one hive csv to porquet Things like that So how does auto does that under the hood? It's just a network of different formats and conversions So if I want to go from x to y the most efficient way Auto computes that and and executes that to you and returns the final container. That's your target So imagine you wanted to put something that you have in s3 to postgres You will have to go maybe to do get photo get the file Read Turn it into csv put it to postgres auto just simplifies that By just having this, you know, this uri Where you can specify your s3 wherever it lives and your postgres database Blaze depends on auto because it uses it to handle the uri's So the same uri's that are valid for blaze are also the ones bad for auto Dask we've mentioned dask enables parallel computing. Um, so In the in data science, we have different sizes of data, right? We have things that are around a gigabyte that can fit in memory can fit in your laptop And but then we move to the scale of terabyte, right? And that does not fit in your memory but can fit on your disk and you still want to compute Be able to compute that because it does fit in your your laptop Why couldn't you just compute it with it? Um, and then we have things that are in the petabyte scale where it fits in many disks So in single with single core computing, we can compute things that in the gigabyte scale with parallel computing If we use share memory, we can compute things that fit on disk If you use a distributed cluster, we can compute things that fit in many disks So NumPy pandas have solved the single core computing And dask is bringing the parallel computation power to the users of numpy and pandas So we have The share memory and dask distributed for the distributed cluster And inside of share memory. We have two ways of scheduling multi-threading and multi-processing So what would dask look for an end user? So we have um numpy that like looks like Your image in The left So we create a numpy array of ones And we return some kind of computation. We return it. So dask does lazy evaluation So you have to call compute on it to get your return to return it and you also have to specify The chunk size of the arrays how you want to partition that you have more information on the documentation page of what are like good numbers in terms of You know areas of the megabyte size of arrays that you should Target to so in this case there's those those two changes you need to specify the chunks You need to call compute to actually perform the computation And then you have two two output results right your output can fit in memory So you can just call an umpy array and keep treating it like that Or if your result doesn't fit in memory, you can actually store it to disk with an htf file If you're more of a pandas user Dask data frame looks a lot like a pandas data frame But it allows to compute things that don't fit in memory That's like the big change without you having to change Much of the the flow that you're already used to So this uh pandas we load the nyris um dataset. We use do head and we query something Imagine dask cannot just load one csv But can load multiple csv that don't fit in your Don't fit in your memory. You can still do head. You can still do The queries, but you also have to call compute because it does lazy evaluation Then we have also another another Dask collection that's called dask back that allows you to Work with semi-structured data like json blobs or log files and we have Imagine we have tweets that we want to load as a dask back And you can just call from palm names Like sterix json compress map to a load the json files and then query it with like take the first two compute user location frequencies And and turn that into a data frame because you know the result will fit in your memory So it feels a lot like What users are already used to and does the pluralism under the hood without you having to worry about that? Imagine that you're now in like the scale of petabytes or things that don't fit in your disk and you actually want to use a cluster of computers We have that distributed for that So you can see The only difference with between using the dask Dask in the multi multi threaded or multi processing single known Versus the distributed manner is that you have to import this client Tell the client where the the Where what is it located and then when you call compute you have to pass this get And then the client the the dask client dot get and that's going to make the the computation in your cluster of computers The relationship between dask and blaze is can dask can actually be a back end or an engine for blaze So you can use blaze as your query language and have dask drive those computations So you make a dask array And then you wrap it around the data the same data Data object that we mentioned in the In the in the blaze section perform the computation get the result with compute so Right now my my talk was mainly focused on Users right you can know more about developers What would be if you don't want to use those tools as a then user But you actually want to develop with them. There's some good resources that i'm going to mention On blaze, there's a good talk on blaze in the real world at pi data dial-up to cloudcape Which goes more in the internals of how blaze works under the hood also So if you actually want to know how to build your own converter if you have one that's not already Built as a back end in the system How can you create one? Let's also explain in auto or in both the blaze auto talk at sci-pi last week By philip club and then one another one that ben salin gave at pi data dial-up There's a lot of good talks on dask. It's a very You know six six months old eight months old project that's been off of blaze and there's also very good resources from james christ at sci-pi and matt rocklin at Pi data berlin and those talks go more in depth of the implementation details of those libraries There's many talks i have in Many libraries that are in this ecosystem that i haven't Mentioned but i have mentioned but i have not gone into the details of explaining what they do And there's already developers of those libraries that have given good talks and then if you are interested Dine mark we gave a talk at sci-pi also in austin two weeks ago Stanley cyber gave one on numba On accelerating python with a numba jit compiler Okay, that's The euro python We have antoine oscar and graham who are the numba One of they're a team of the the numba team and they're going to be here all week So if you have questions relying on how to use numba They'll be happy to help you And because a francisco dead is also here and he gave a talk yesterday and it's going to give a tutorial tomorrow So you're interested in in in the trends of storing data data data containers Memory disk he's going to be giving the tutorial so you can check it out So just to summarize The goals of these talks were Have you rethink the term data science instead of being just machine learning models actually Building the connections with those five areas and how we can bring everything together moving forward Also think in terms of not just one library But inside each library we have data we have engines and we have expressions And encourage you to start using any of those place projects if it's something that you you know, you can benefit from And this place products now it's possible. Thanks to a very talented team That are working in all of these projects Mark Irwin and dined And in data shape Ben has done a lot of work in otto and blaze eric also And philip cloud who's also a pandas core developer. It's also working desk The desk team is matt jim and And blake and we also have Some connections with the b-colbs team In the blast team with valentine and francisco So Reach out to any of them if you have interest in any particular library And I think I have five minutes for For questions five or ten minutes for questions. So Yes, my question is on the relationship between these projects and the other projects in the scientific python community like x-ray pandas or Do you see this? Part of them replacing them or merging them or complementing them I don't have a good view on the future. Okay, so There's a lot of work on connecting those libraries inside the other ones we have several Already success stories with things like psychic image So dask had a pull request in psychic image to speed up some of the computations that they were doing So There's different layers, right? There's the the user layer that is bringing That it's kind of extending the use case for The limitations that some of the end user facing libraries have like pandas and numpy. So right now Pandas and numpy cannot solve some of the the the problems they're faced because of how they're built So in that case, I would say if you are in the size of a terabyte petabyte Then Dask data frame and that's great are going to be an alternative to pandas and numpy on the other side. There's the developer layer of Improving computations that already exist in other libraries. So there the connection is going to be in merging in making dask for example a dependency on those libraries and Improving the the performance of them Does that answer your question? Then there's also there's a good right in the documentation that on the dask documentation that compares dask to things in the distributed systems and whether is an alternative or not and I would Say that is um, they're targeting. I think different users so I would think I would say that some of the benefits of having a low overhead To perform distributed Distributed computations. It's kind of the good alternative For for things that exist in that world But but still other people, you know, it's not for Might not solve all the all the problems But I would encourage you to to read that comparison So a short question. I dask Distributed looks a bit like Spark Rdd's or data frames. Where is the advantage of using one of another? Why should I choose dask distribution? over Spark Okay, so that question has been asked a lot actually Matt Walken wrote Extended the dask documentation because we were asked so many times the comparison between spark and and dask and And of course spark is a more mature project It uses the jvm and And it has a Higher overhead of setting up right dask. It's just a python library. You can pip install dask. You can con install dask And it has bring some benefits to the core scientific and machine learning libraries That can use it And as an end user, I would say it brings much lower overhead especially for people in the in the python community who don't want to mess up with setting up Spark cluster and and dealing with all the With all that, you know performance, but being said that You know, you can also integrate well The blade for example blaze Can can use spark so spark is one of the back ends used by by blaze So if you want to perform A performance comparison comparison between dask and spark for your specific use case It's very easy to do that with blaze because you have the same Your code is going to look pretty much identical You're just going to change the string that you pass to your data data class But the same in there's a very extended Section in the dask documentation that goes into all the details of that comparison Any more questions? No, so thanks. Thank you