 Good afternoon, everyone. Today I'll be talking about S-frame and S-graph, about scalable external memory data structures for tables and graphs. My name is Jay Gu. I'm co-founder and now engineer at Datel. Prior to that, I have a master degree in machine learning from Carnegie Mellor University, studying under Professor Carlos Gueschen. I was part of, I worked on distributed graph partitioning and distributed graph algorithms. I was the co-author of the Power Graph Paper for SDI 2012. A few words about Datel. We were formerly known as Graph Lab. We changed our name this January. I was founded in 2013, two years ago in Seattle, and by Carlos and his grad students including me. And our mission is to accelerate the creation of intelligent applications. So a little bit of history for this project, for the Graph Lab project. The first version of Graph Lab started in 2010 with the goal of paralyzing graph analytics algorithms and graphical model propagation algorithms in a single machine using shared memory architecture. And two years later, we extend the shared memory architecture into a distributed setting with a different focus on natural graphs with power load degrees. And at the same time, another student from the lab, Apple, who developed a Graph Chi, which is the same idea of power graph but implemented on a single Mac mini with external memory support. And the Graph Chi becomes extremely popular because it's so easy to use without any set up for cluster. Just run on single Mac mini, slightly slower, but scale to the same size of the problem. And after that, Joey Joseph Kanfleis came here to the M-flap, developed a power graph abstraction on top of Spark, which is known as Graph X today. And today at Datal, our product is called Graph Lab Create, which does away more beyond graph. That's also part of the reason why we changed the name from Graph Lab to Datal. We deal with scalable tables and graphs, images and text data. We provide high level machine learning algorithms. We also provide production tools for deploying machine learning in production. Okay, this is about the history. Let me begin with the rest of the talk with a tweet from the Datal conference or the Data Science Summit. A tweet begins with that data scientists spend a lot of time claiming about data and the rest of them complaining about claiming data. There are clear pain points about dealing with the size of data today. The first pain point is that the data is normally bigger than the memory. And when you run out of memory, your machine freezes. It's swapping data back and forth. There's absolutely nothing you can do about it. So some people deal with this problem by sub-sample the data. But this introduced unnecessary uncertainties to your model. It also introduced overhead of data management. How do I manage this version of sample, that version of sample? What do I call the version where I do the average at the end? Another approach is to use big data systems, distributed systems. However, it comes with big overhead. Every time you do Hadoop FS-LS, you waste three seconds. Every time you submit a Hive query, you know it's time to go for a long walk and take a break. And also the cluster has a shared resource with a limited size of containers. So if your job exceeds memory, it gets killed. When you come back from a long walk, you find out your job got killed or failed, which is extremely annoying and unproductive. So I think these pain points are because of the lack of fast, scalable, and easy to use external memory data structures. Ask yourself this question. Would I be more productive if I can analyze lots of data without doing self-sampling, using just my laptop at an interactive speed? Is it even possible? If we want to make this possible, we need to push the limit of single machine in terms of its performance, scalability, and usability. And that's the theme of today's talk. To understand the single machine scalability, it's important to understand the storage hierarchy. So here we have spinning disk, SSD, and memory. The throughput of them is on memory, and you can get about 10 gigs of throughput in on disks. It's usually about orders of hundreds of megabytes per second. However, the capacity of them goes in the reverse order. On disk, you can store up to 10 terabytes of data, which is usually enough for a lot of data sets for a single machine. The caveat is that random access on disks is extremely slow, so we want to avoid that. A good external memory data structure for machine learning needs to incorporate this knowledge about a system storage hierarchy. So in the rest of the talk, it's divided into three parts. I first talk about S-MIM, which is scalable tabular manipulation, and then followed by S-Graph, which extends the table architecture to deal with graph data. And lastly, I will talk about how do we extend the single machine external memory data structure into a distributed setting. So your data usually begins with tables, rows inside tables. For example, Netflix data, user movie rating is a record. However, when you deal with data, doing data engineering or cleaning up, typically you're doing columnar transformation. For instance, we can take the rating column and divide by the total sum to normalize it. We can create a new column called rating squared, which is the square value of each of the ratings. We can also stop select two columns through the rest of our way and call it a new dataset. These are all columnar operations. So in S-Frame, column is the first class citizen we call S array, a scalable array. It represents a single typed column. It's backed by disk and it's immutable. By immutable, what I mean is, once the S array is generated, you write to the disk once and never changes it. This is extremely important and I will talk about that later. So today we live in an exciting moment where the data becomes extremely rich. We have text data, image data, we have voice data, different kinds. And data can come in structured format like CSV or semi-structured format as JSON. So S array supports integer type, flow type, the traditional numerical type, array of loads where you want to analyze a chunk of numerical values, strings for text data, daytime for time series, and image for image analysis or deep learning. In addition to all those single type, they also support a list type of arbitrary types that I mentioned above. So you can have a list type of S array where each cell is a list of like integers, strings, integer and strings, et cetera. And we have dictionaries which is mainly used for support back-of-words, sparse representations, and JSON. Here's an example. This is the Yelp review dataset. First column is business ID. It's a hash string and we have daytime. We have integer values. This is the text about reviews. This is, and we have votes, which is a dictionary type with string as key and integer as value. It's funny, useful, the counts of how many votes to each of different type of categories. And here we have list, a list of strings, which tells you what are the types that a single restaurant falls into, right? So having a S array, which can support a different type of data, give you the capability, reduce the needs for doing data engineering or type conversion. Basically, you can just store the data right there and perform operations on the data. So knowing what an S array is, an S frame is just simply, the concept of S frame is just a dictionary which maps your column name to the S array. So this is the internal representation. You have an S frame with three columns, user column points to the S array, which stores the data for user ID, similar for movie and the rating. Let's have another S frame, similar three columns. So if I want to assign a new column in the first S frame with a column from the second S frame, all I do is create a new entry in the dictionary and points to the other S array on this. It's essentially free. When I want to take a difference of two S arrays, I perform a vectorized operation minus taking these two S array and generate a new anonymous S array called diff. I can assign this diff back to the first S frame and basically adding a new entry which points to this anonymous S array. So this is where the immutability of S array becomes very important because once S array is written, you know it's never going to change. Therefore, you can just use these pointer operations to do the proper assignment. What's the scalability of S frame? Today, the largest synthetic S frame we have created or played with has 950 columns and 10 billion rows. That amounts to 10 trillion dense numerical values. And in fact, there's really no row limits on the number of row limits on S frame. You can have as many rows as you want as long as you have enough disk space. And there is a small column limitation which is between 100 and 1000 of columns. The reason of that is each column or each group of columns actually corresponds to a file, right? S array is backed by a file. Therefore, there is a number of file handle limit on the OS so we cannot have that many columns. However, when you have more than 1000 columns, it's either machine generated or it can be stored as a list or sparse dictionary format. So essentially reduce 1000 column to one column is good for most of the problem. Okay, let me show you this biggest S frame in real. Is the font okay? Okay, so this is on a single EC2 instance, R3.8X large. It has quite a lot of memory and disk space for this demo purpose. If you only have like maybe one billion rows, it can run this demo on the laptop. So let's, we begin by load this S frame. This is a file or essentially a directory. It's a big biological dataset, synthetic regenerated and assigned it equals to data. It finished instantly because it doesn't perform any operation rather than opening a file handle. It has 950 columns and 93 billion rows. And then let's take a quick look at the contents. It's a bunch of numerical columns. Why is the response columns? These are the different type of annotations. We have integer columns, flow columns, et cetera. When you type data, all it does is it prints the first 10 rows so it doesn't read entire data at all. We can try to calculate the variance of a particular column on 10 billion rows. It finished really quickly because the data is stored in one column or you do a scan that column on this and compute that variance. We can create a new feature by taking the all operation of two columns and perform a sum on it. And we can remove a column, which is essentially free. Okay, so this is the scalability officer. So our chief architecture, Yuchen Lo, actually did a crazy thing, which scales all of the NumPy array to the same scalability of an S-frame. So this is what it looks like in real. Import NumPy as NP. So this is the NumPy you would normally use, pipping dot NumPy. And all we do is import graph lab dot NumPy. This line doesn't do much. I will tell you about the magic it does later. What we do is we dump the entire S-frame, the data with 10 billion rows and 900 columns into a NumPy array. When we print out the NumPy data, it's actually NumPy array. And it has shape, 93 billion columns, nine billion rows and a thousand columns. This is the largest NumPy array ever created. If you calculate the amount of memory it would took for a NumPy array to be this big, it would cost six terabytes of data, six terabytes of memory. We can do a sub-selecting and indexing of NumPy array as you would. We can change the values of a NumPy array. This changes all the values in the first row to be one. And we can run cycle models on it. For demo purpose, we sub-selected a part of the NumPy array, but actually it runs on the full data as well. Just take a little bit longer. So remember this option, shuffle equals false. This is very important, which I will cover later. All right, we have a, we automatically scale NumPy array to the size of the S-frame we have. And we can run cycle model on it. This entire NumPy array is backed by S-frame on disk. Going back to the presentation. Now let's do a deep dive into the actual, just some technical details about S-frame and the secret sources of how it's fast, scalable, and easy to use. The first secret source is the lazy evaluation and query optimization. Essentially when you have data that big and when S-frame, when S-array is an immutable file, you want to minimize the number of times you create a new S-array, actually a physical S-array on disk. You don't want to write S-array as often as you could. This is different from if you have a data structure in memory. And what lazy evaluation does is when you build up, when you call a series of operational transformation on external memory data structure, it actually remembers the operation. It does not evaluate it immediately until you actually want to consume it. You want to see the output. And even when you ask for the result at the end, it only does as much compute as you asked. For instance, if I have S-array I do plus two, plus three, and plus five, they will build a chain graph with all these operations. And at the end, if I only asked for the top 10, the first 10 rows of this S-frame or S-array, it only does plus two, plus five, plus three on the first 10 elements. So there's essentially no materialization of the S-array during this operation. This is what we call lazy evaluation. So this is a more sophisticated example. We load a S-frame which corresponds to an operator called S-frame source. And if you select a column called S-frameRating, and this is a projection operator, and you do ratings times two, this is a transformation operator on an array. And finally, if you let's assign it back, this corresponds to a projection and a union. It's a binary operator. Now this is our S-frame. Nothing has been created or written yet. So finally ask about what's the sum of this rating? And it corresponds to another project and the radius. So what we can do is we just stream all the data from S-frame source, pump them through all these operators. And the final result is just a single value. During this process, no S-array has been created on this. Furthermore, we can simplify this query graph or query plan like what database people do. We can combine the projection and the union. We can combine project with this source node. And this is the simplified graph. It avoids a lot of unnecessary computation because all you ask is about sum of one S-array after it's being transformed. So there is a very rich set of query optimization we can borrow from a database community. For example, if you have a filter operator, like selecting a few rows for criteria, we want to push this filter operator as up as possible so that the downstream has only a few data, has fewer data. We also want to project, pushing the project of operator upward so that you end up with less columns and also drawing all the ring, et cetera. So this is another crazy example of an unoptimized query plan in an optimized query plan. This is generated for stress testing our query planner. So comparing to other system, the S-frame from graph that I've created is actually quite fast. So this is one of the benchmark from the M-plug. What it does is it selects two columns and there's a filter outlet. And then we can perform as fast as we're shifting power of those in-memory system. And this is not the end of story. We run on one machine where all of them run from five machines. So the second secret source is what we call type-aware compression. The benefit of storing your data in a column order is that all the values in the columns are of the same type. So you can actually have a specific compression algorithms that to that type. For integer values we use a frame of referencing coding basically what's the difference to the minimum value. We also use a delta encoding which is the incremental difference. So if you only have a binary column we only store one or zeros. For floating point values we cast it to integer and just reuse the integer encoding. For string values, if there are a number of unique values in that column it's small, we use dictionary encoding. Essentially you just assign each unique value a number and then it becomes integer. Of course for image column we can use a JPEG or PNG encoding which is much more efficient than storing the image as a 2D dancer. So these compression will bring down the data size dramatically. If you have a looking at this Netflix data set a few years ago which is really big data 100 million rows and three integer columns. If you store it in raw format, in raw text format it's about 1.4 gigabytes. If you load them in Panda's data frame it's actually three gigabytes because Panda's data frame has all those indexing overhead. In S-frame it only costs 160 megabytes. It's about one-tenth of the size and also S-frame is on disk. So it consumes absolutely zero memory. S-frame's compression actually does better than G-Zip because G-Zip does not have type aware compression. It only treats your data as rows. In this Netflix data set it's actually sorted by the movie ID. Therefore the movie column itself is extremely efficient after compressed. So now let's look back at the storage hierarchy picture that we had two slides ago. These are the actual throughput but with a compression that has 10X brings them down your data size to 10X smaller. The effective throughput of them actually increased by an augurus of magnitude which narrows the gap between external memory speed and the in-memory speed. Okay, however with just the external memory data structure is not enough. Remember that shuffle equals false option in the demo and disk a random access in disk is very expensive. Well we know that existing machine learning algorithms usually contains a lot of random access which kills the performance even when you have this efficient external memory data structure. This calls for a rethink of a lot of machine learning algorithms and data engineering operators. For instance, do you need sample? We don't want random access. We want sequential, we prefer sequential access. Now do you need sampling of your data? If you do, how do you implement the sampling algorithm by using sequential access only? We want to avoid sort and shuffle operators as much as possible. Try very hard not to do sort and shuffle. How about space decomposition tree? Tree external memory tree data structure. How do you implement that? That's a very interesting research question. So for machine learning tasks on a single machine we actually performed better than one of the Berkeley project bitmark. They use the GPU node. For this competition Kaggle competition for add click through rate prediction on the critical data set with 46 million rows and 34 million sparse coefficient the bitmark runs 10 iterations of SGD in 800 seconds whereas we run 10 iterations of LBFGS which is more expensive than SGD in only about 500 seconds. This is not a compute bound task. Therefore GPU doesn't help much here and the throughput of external memory efficient external memory data structure is the key to the performance. So we talked about performance and scalability. We built S-frame also with usability in mind. It has a Python interface and the R interface is coming soon. It's built with a lightening fast CSV parser to help you bring the data in as quickly as possible with a crazy JSON support. If you have a CSV column which is a JSON string it automatically parsed it into a dictionary type. It has nice integration with NumPython pandas you guys have seen it. Also can bring the data from database using ODBC driver and Spark RDDs. We also implement a lot of common operations that data scientists would like for example the mean max and quantiles of the column and a number of unique values of a categorical column using a one-pass streaming sketch. Essentially you make one pass of the column and compute all these values at once. Also comes with nice visualization. So in summary S-frame is this compressed column table which we built for fast and scalable easy to use. I suppose a rich data types for the easier for use and the technical wise we use a column architecture with the benefit of type of work compression lazy evaluation and for easy feature engineering. So let me take a quick pause here and if you guys have any questions about S-frames I can try to answer before I move on to the S-graph. You can read the data from ODBC into S-frame and dump into the, you mean the Python binding? Yes. Yeah in the Python binding. Yes. Yeah S-frame the entire code is the backend is implemented C plus plus. Python is our main front-end language. No. Yes. Not yet. Supporting different languages is always a common request but we actually have a queue of prioritization. And also Java is a little bit different than R or Julia those kind of which are easier to bind than Java. Oh the language is written in C plus plus. Yeah. Okay. Let's talk about graph. So tables are not the only data format you would encounter in analysis. Graph is becoming extremely more and more popular thanks to those social network. Graphing calls relationship between people, facts, products, interests and ideas. Graphs can be really big. For instance in 2012 Facebook has one billion user and 144 billion in friendship. Twitter had 15 billion follower edges. And S-graph is built for immutable they suspect graph representation. It can store arbitrary attributes of vertices and edges. It is optimized for bulk access but not for fine-grained queries. When you compare S-graph to graph databases like Neo4j or Titan which supports fine-grained queries they like give me all the neighbors of this single vertices. S-graph is not built for that. It is not optimized for small queries and fine-grained queries. However it's built for batch queries really efficient for doing batch query really efficient. This query will be very efficient using S-graph I get a neighborhood of five million vertices instead of one vertex. So the layout of S-graph is very simple. It's built using S-frame. For vertex data we partitioned into a bunch of segments we call it partitions in this case four partitions. Each of the partition is essentially an S-frame. Remember S-frame is columnar, this fact therefore the S-graph is also columnar and this fact. So vertex data we can store IDs for each vertex can associate each ID with different types of metadata like address, zip code. All the column types that supported by S-ray is automatically supported by S-graph. For edges we similarly partition them into p-square S-frames. Each partition can store source ID and destination ID and extra metadata like an alias for that message. So the benefit of this partition is that for example if I only, for example partition two and four on the edge only contains vertices from partition two and partition four. This allows for instance computing a sub-graph. This defines a sub-graph and any computation on this sub-graph only requires to load at most two vertex partitions. And similarly if I want to look at the neighborhood of vertices in partition two, I only need these edge partitions. The rest can be ignored. So finally because S-graph is built on top of S-frame a direct concatenation of all these vertex S partitions actually gives us the same view as a tabular data. So we can do our Gito-Vertice's page rank essentially selecting the page rank columns of all these S-frames and perform a sum. So you have this seamless transitions between graphs and tables. You can do feature engineering on graphs, vertex data and edge data like you would do in a single table with zero cost because concatenation or a pen of two S-frames is essentially free. And we build this nice API which minimize the friction of graph feature engineering. So as you can see these APIs looks exactly like tables. Gito-Vertice's although it's a collection of four S-frames gives you the same view of a single S-frame. However, this layout is not the most efficient. It's still very wasteful. We can see why. Because all these IDs on the edge data are still strings. The strings can be extremely complex and expressive to work with. It's hard to compress. And they are actually a duplicate record of what we already know about them. So what we can do about it is the implicit number, implicit numbering of the source ID and destination IDs. So each vertex partition actually has an implicit row ID of each record. Therefore, instead of using the actual value of the ID we use its row number local to its own partition. So in this case, Alice is row zero of partition one and Bob is row one of partition one. So in this partition one one, Alice and Bob is replaced by zero and one. We can do the same thing for the rest of record. Now the entire edge table source ID and destination ID columns becomes the integer which is much easy to compress. So with this compression and S-frames the efficient external memory table data structure we have we can actually compress the largest available public graph common program which has 3.5 billion vertices and the 128 billion edges from a two terabyte data set into 200 gigabyte data set with similar compression ratio of 10 to one. Since this data set is all about hyperlinks between pages with no metadata. And it also gives us a very high throughput over the edges. So to compute page rank on a single machine for this big graph we can achieve the performance of about nine minutes per iteration and we can compute the entire connected components in less than an hour on a single machine using SSD. There is not any general purpose library today that is capable of doing this. Okay, that's about this graph, yes. Oh, you're talking about how do we express computation on graph? Yes, so I actually skipped a few slides here about graph computation for the interest of time but essentially you do a full sweep over the edges. For each edge we load two vertices we have access of two vertex data in the memory and the vertex data is the current page rank value and as you iterate through the edges you propagate, you send a message or you perform an operation which changes the value of the target vertex and this is what we call a triple apply abstraction. In Python you can define a user defined function called triple apply which takes three arguments source vertex, edge data and destination vertex and the return at triplet like that and that's the computation you will do it's defined for a one part of the graph. You essentially do iterate over this triple apply operation until that page rank is converged and we have this native implementation in C++ but for quickly prototype the Python interface is the easiest one to use. Yeah, we have a bunch of building toolkits for graph analytics including page rank, label propagation, connected components, triangle counting, k core, shortest path. They are all implemented in C++ which runs at native speed. For user defined computation you have two choices one is you write a Python function which is easy to prototype but run at a slower speed or you can use our SDK which is essentially the same abstraction and same function signature but just implement in C++ but you have to compile and import it as a module as a shared library module. Well, since this experiment is as graph the backend will all be open source because if you want to write the most efficient code and contribute to the open source library your model will come to. This is a very interesting question overhead of Python function code. We all know that Python has this annoying GIL, right? So you can, within one process you can run the same, you can run a Python function in parallel. So what we do is we serialize your Python function and to the backend we spawn a bunch of new processes we call it pylon.worker. Each worker will receive this serialized function deserialized it and then we send data from the main process the graph lab create process to the Python process they run computation return the result back so the overhead is actually quite large. It's about, I can say it's about 100 x slower than the native C++ but for a graph size of less than let's say 100,000 edges one iteration of the triple apply even implement in Python can finish in less than 10 seconds. Yes? It's really sort of a hybrid thing, it's not a gap. It's not a graph database at all. Yeah, so is there something like Python that works? Ah, NetworkX is it? So NetworkX is this nice library which implements a lot of graph algorithms and graph visualizations. They essentially have keep the entire edge data structure and the vertex data structure in memory and then they can do a lot of in core computation efficiently whereas we are, since we're dealing with external memory we have to be very careful about how do we partition the graph, how do we access? There's no random access of the vertices and edges will be as bad as random access on S3. So getting a single neighborhood if you want to implement a graph visualization library on top of S graph you have to do additional caching yourself because jumping around on the text table is usually not the best thing. Okay, now we have talked about efficient data structures for single machine. How do we extend them to distributed? Say for single machine given that this data set with two columns in X or Y, for example, say we can do one pass over it in a hundred seconds and when you have distributed multiple machines each machine gets half of the data they essentially have parallel disk and the one pass cost is reduced to half. So having an external good external memory data structure for single machine will still be very helpful even when you go to a distributed setting. But not all, a lot of machine learning algorithms are not embarrassing in parallel like this. For example, in distributed complex optimization like Newton algorithm, LBFGS gradient descent what you have is you do a parallel sweep over the data to compute a gradient. Then you exchange the parameters to synchronize and then you do this over and over again. So in addition to have efficient external memory data structure to speed up the first part we also need to make sure the data is evenly split and also we need to push the performance of the network communication layer so that they talk as less as possible and talk as quickly as possible. So we begin with disk on HDFS and each machine will first save part of the data onto its local SSD for more efficient iterative passing. And then we implement a very high performance intercommunication layer, RPC to do the communication quickly. So the benchmark on the full critical data set was 4.4 billion rows and 14 coefficients. We achieve nearly linear speedup. You can finish, this is run through convergence by the way this is not a number for one iteration. On 16 machine we can run through two convergence in just a few minutes. How do we distribute the graphs? This is even more interesting because it's even less embarrassingly parallel. Graphs has these constraints about vertex and edges and we have to partition the graph. So graph partitioning the objective is to minimize the communication. Essentially the number of machines that a vertex spans defines how much communication you need to do for that vertex. So we want to have as few replication of each vertex as possible. And the graph partitioning is a really, really difficult problem. And just to do that alone may be more expensive than your original task. So what we do is a simple heuristic and effective partitioning strategy with zero cost. So remember we have S graphs edges as partitioning to P square numbers. Now each machine only takes one partition or one fourth of the partition. Because of this constraint, grid constraint, each vertex can span no more than two square root of P partition. For example, the vertices in partition vertex partition one only exists on these partitions. So they will show up on no more than two square root of P minus one machine. Assuming you have as many machines as the number of partitions you have. So we found out in practice this is good enough than running a complicated optimization algorithm for distributed graph partitioning just to bring that replication factor even lower. So this is the slide from two or three years ago where the PowerGraph claims that it does page rank on a Yahoo AutoVista graph with six billion edges in under seven seconds per iteration. That's about one billion edges processed per second. And today with the new external memory data structure and graph partitioning ideas we can run on an even bigger graph common graph with 180 billion edges with 16 machines only at 45 seconds per iteration. That's three billion edges per second. So in summary, we talked about external memory data structures that is the key to the productivity in data science. We have talked about ways to scale to terabytes of data on just a single machine. And when you extend these learnings from single machine to distributed it can run even faster and scale to even larger data sets. So people often ask me about what do you learn from system that can apply to machine learning and what do you learn from machine learning that applies to system? These are, this is a very good example which I think I want to share with you. So we want to understand the system storage hierarchy that's from system side. And we also want to understand the memory access pattern of graph, of algorithms that's from the algorithm and machine learning side. We want to combine them together so that we can, we want to combine them together so that we can have, we can build systems that is designed for machine learning. So finally, this is my final slide. So we have gone back, we have gone from a distributed graph computation five years ago back to a single machine S frame and now back to distributed. What we have learned. So speeding up performance on a single machine we call it scale up and adding more machines during distributed we call it scale out. It's not just about speed or scaling. It is easy to throw more and more machines to tackle bigger and bigger problems and or run faster to be some benchmark. At data we are not doing that, it's all about doing more with what you already have. Okay, thanks. The data structure S frame and S graph S array will be open source. The machine learning algorithms we build on top of it will be known for example, recommender system deep learning neural net are not going to open source. But we, but we'll, yeah. Distributed. Distributed will not be open source. NumPy is a good question. We're still heavily debated debating it internally. I hope it will be open source, but I can't say it right now. So with these good data structure open source we are looking for machine learning community to develop more and more external memory friendly algorithms. There are a few of them right now for instance the XG boost by Tianqi Chen which is essentially a super fast open source library for gradient boosted tree. He actually has an external memory version. I'm talking to him for integrating with S frames. We're looking for more and more things like that. Also cycle learn. What an interesting story about this NumPy is after we did this NumPy hack, we have this very large NumPy array and we want to do some experiments on it with cycle learn. We had a hard time finding interesting algorithms that can actually scale on that NumPy array. We have found SGD without shuffle works, PCA actually works, but the other algorithms are either the complexity is too high by themselves like N square algorithms we're not even gonna try them, but even for N algorithms some of them are heavily have heavy random access which just doesn't work. Even you have a good external memory data structure. Oh, if you want to learn about that crazy NumPy hack that you changed it, I can talk about it afterward. I'll sign. All right. Thanks.