 The Carnegie Mellon vaccination database talks are made possible by Autotune. Learn how to automatically optimise your MySeq call and post-grace configurations at autotune.com. And by the Stephen Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. We're studying today to have Hishan Chitani, who is the CEO and co-founder of BOTO, a new Perico pillow competing platform that supports Python and SQL, which he's going to talk about today. Prior to starting BOTO, he was a researcher with Intel Labs, but actually was sitting on site here at Carnegie Mellon as part of a collaboration with Intel as DMU. And then prior before that, he did his PhD at UIUSD. And prior before that, he got his undergrad degree in computer engineering at the best school in Iran, which should be university. So with that, Hishan, the floor is yours. Again, for everyone else in the audience, if you have any questions for Hishan, if he's given the talk, I'll meet yourself, say who you are, and ask your question. And feel free to do this anytime. Please interrupt him. That way, he's not talking to himself for now. Okay. All right. Thanks so much for being here, man. The floor is yours. Thank you, Andy. Thanks everyone for taking the time, excited to be here. And thanks for the kind introduction, if I can change the slide. So just a little bit about the story of how BOTO and this work came about. I got, as Andy said, I got my PhD from UIUC, which is sort of the hub for HBC. I was working on parallel programming systems, energy efficiency of supercomputers, working with national labs on kind of supercomputers. Everything is good. But the course we're working on, where MPI, FORTRAN, C++, very low-level codes that are not accessible to anyone other than the experts at national labs. So everyone kind of any science domain data scientists we saw were writing Python, MATLAB, these kind of codes. So I joined Intel Labs and for some portion, I was at CMU to work on demarcatizing HBC and compute for data analytics and average kind of everyday developer. So we built various things in Julia and Python. After four years of research, when it was clearly successful and works on real applications, started BOTO in May 2019, building a Pell computing platform for data analytics which supports Python and SQL, which has a very interesting story of why SQL, which I will discuss in a little bit. So general outline of the talk, a little bit of background and motivation for what we do, how we solve the problem, the BOTO approach, what is BOTO SQL and how does it work internally using BOTO, a little bit about optimizations, a little bit about resiliency. There's a lot of misconceptions I would like to discuss and conclusion of the talk. So let's get going. At a high level, there's a lot of data in the world and all organizations, small and big, have some sort of data and they would like to take advantage of it to solve new problems, improve efficiency, competitive advantage, so on and so forth. And therefore, they are hiring a lot of data scientists. However, according to Gartner, 86% of data science projects fail and don't go to production. From our point of view, the challenges that the applications that data scientists develop on their laptops on small data don't scale to production easily. And there are a lot of barriers to production from that prototype. So we think data is really a programming problem. How can we enable data scientists and data engineers to write code that works on large data and scales seamlessly? So that's what we are focused on. In terms of the languages, Python is really dominant today and is the language of data science because Python allows data scientists with domain expertise. They are not necessarily software engineers. They don't have a lot of them CS degrees. They know programming through maybe sometimes an online course or online material. So they are not really focused on the writing code aspect of it. But Python allows them to write complex code quickly. And this data is from Stack Overflow. The number of questions, the portion of questions per month in the past couple of years, and Python is by far the most dominant. And you can compare to R and SQL and Spark. You will see that Python is pretty popular. 15% of all Stack Overflow questions are in Python. Python is used not only for data and machine learning and things like that for other things as well, but the main use case these days is data applications. So at a high level, the problem goes back to the simplicity performance gap problem in computer science. The fact that high level scripting codes like Python, MATLAB, Julia are easy to write, but they are slow and not scalable, on a single core, typically on some interpreter, not even machine code. But on the other hand, HPC codes are low-level, MPI, Fortran, C++. They are fast and scalable, hundreds of thousands of cores, but they are very complicated. And very few people in the world can write those codes and get them to work. So there's a huge simplicity performance gap that a lot of CS researchers have worked on for several decades. The challenge is to come up with a programming system that provides simplicity, performance, and generality at the same time. For today, you have to pick two. You can't have all three at the same time. For example, the Python data libraries of Pandas and NumPy, they are pretty general for data problems. You can solve a lot of problems in them, different domains. They are very simple, but they are not fast. They are single core and not scalable. On the other hand, writing low-level MPI C++ code, by the way, MPI stands for message passing interface, dominant to use for writing parallel programs in the HPC domain. So MPI is, MPI C++ is very fast, scalable, and very general, but it's not simple. So these are the two that you get with this approach. The other approach is building domain-specific languages like Halite. Halite is a domain-specific language DSL for writing image processing pipelines. It's very fast. It's simple to use, but it's not general. It's a very rigid structure that allows the programmer to use, and it's only focused on image processing pipeline. So the Holy Grail solution is to have a compiler that provides automatic parallelization for general data problems, and this is what we have achieved at Bodo, we believe. So before Bodo to scale data problems, the main approach was the so-called big data framework approach, things like map reduce, Hadoop Spark. So the idea is that we create a library with map reduce APIs that is implemented as a distributed system backend, and the structure is there's a driver, it's like a single process, and there are some executors. The driver runs the program. The program is sequential because the language is sequential, but throughout the execution, the driver extracts tasks, schedule those tasks to the executors, and they return the results. So this approach gained a lot of traction because it's much simpler than writing HPC code. HPC code is not practical in practice in any kind of commercial setting other than some of the scientific applications and supercomputing centers, because developing those codes take a long time and need a lot of expertise. So not as complicated, but still very complicated, not comparable to simple Python code. Also, these frameworks are much lower than HPC, and not really as a scale level, just on Google search, Spark versus MPI, you will see so many examples of papers and things like that around showing the massive performance gap of Spark and MPI, for example. So from the approach point of view, we believe the distributed system approach is not a good fit for parallel computing. The reason is that when you are building a distributed system, let's say an internet portal client server application, that underlying assumption is that we have heterogeneous components, hardware, software that are connected with some unreliable network that are far away, but this assumption is wrong for parallel computing because the components hardware and software heterogeneous, the same CPUs in a cluster, same kind of compute running at the same time, both synchronous algorithm, and the network is fast and reliable. So a lot of assumptions are wrong, and later in the talk, I will get back to this and show performance results and discuss this a little bit further why the distributed systems approach is wrong. But first, the most important aspect is having simple code and programmability, because programmability for this area comes first in performance. So in this big data approach, the front end today is mostly SQL or SQL-like things. In Spark, the MapReduce APIs and some of the other APIs are either very similar to SQL at the high level or they run on the same SQL engine. And the problem with that is compared to Python code, the Python code is a very simple, expressive, imperative code and easy to express these data transformations. But if you write it in PySpark, you are essentially assembling a compiler intermediate representation IR for it that runs on the SQL backend. And all the constructs are lazy evaluation wrappers. The data frame is not really a Python data frame. You have to manage data partitions. So it's pretty low-level kind of code. So that's one option, the PySpark API that has these problems. A lot of programmers are forced to use SQL for a lot of these data applications. SQL is good for queries and a lot of things, but not everything. So if the application is complex and has complicated kind of patterns in it, the SQL queries become very long. These are declarative and it's hard to understand and maintain the code and develop the code further. So we believe Python is better for a lot of data processing kind of big data applications as well over SQL. And we can discuss it. But the ideal solution we think is Python as SQL that we have built and I will discuss in a bit. So here's an example of Bodo code and how Bodo works. The code is in standard Python pandas, you know, standard APIs, read parquet, you can go terabytes of data, you know, kind of these data frame, table operations, these mathematical operations are available. The programmer adds this Bodo that JIT decorates the function and the workflow is just Python from that point of view. But the compiler replaces the function with an optimized and paralyzed version of the function as if an HPC expert wrote the code, except that it happens automatically and transparently. This use case is actually something real. It's very simple, but used in practice in the financial domain. And, you know, they were able to run it on their kind of a small cluster, get over 100x speedup much faster than Spark alternative. Also, this described function provides a lot of information, means standard deviation, quanta, so on and so forth. But the Spark version of it cannot do things like quanta, because perl algorithm for quanta doesn't fit MapReduce pattern very well, so they don't provide it. So it's important to have a proper perl architecture to be more general than things like MapReduce libraries. So based on that, we believe we have closed the simplicity performance gap for data applications. The development simplicity is similar to Python because it's a Python API, but performance is close to MPI. Things like Spark are somewhere in the middle. And we think Bodo is a step function kind of improvement over previous approaches. So let's see how it actually works. It's a little bit of compiler stuff that I will go through, but I'll try to get to Bodo SQL, the things that probably this crowd is more interested in quickly. Do you have any questions, Andy? I was just going to say, I'm interested in compiler stuff too, right? We use compilers and databases, so yes, go for it. Okay, sounds good. Great. So Bodo is a different kind of compiler, and we use this term inferential compiler because it infers a lot of program properties. In a regular compiler, the program code in C, something human readable, is translated to machine binary. There are some optimizations and register allocation, a lot of things that are done, but the program structure is the same. It's not fully parallelized. It's just a quantia. And today you have to manually parallelize your code in C. But with Bodo, it understands the program structure, and it's able to optimize it at a high level. So not like a scalar operations of kind of addition of two integers and things like that, optimizing those, it's about join and group I and the kind of SQL level understanding of the program in Python that Bodo is able to optimize, and it's able to parallelize it and generate a fully parallel binary. So it's a very different kind of compiler, and I'll discuss why. Previously, automatic parallelization has been explored for a long time for decades, and it fell because it's not used in practice today. The reason is they were trying to analyze loops and memory access patterns of loops in C or FORTRAN programs, and one example is the kind of transformation on the right I have. It's FORTRAN code. It's array privatization transformation, meaning that there are temporary data structures in loops. When you want to parallelize some loop, it's better for each processor to have a copy of those data structures. You don't want to go across processors and exchange data for those little things. It will kill performance. So this work array, this parenthesis is actually array access in FORTRAN, equivalent of brackets in CN Python. So this work array is being privatized and copied on different processors. So that's one transformation. So basically the previous approaches would create a big decision search space and use things like integer programming to make decisions and would come up with some approximate solution, which didn't really work in practice. In our case, it's actually harder in some aspects because we are working on Python, and Python doesn't have static data types. It's dynamic typing, and it has more complicated data structures than just arrays because we have things like data frames in Python, which are actually the main targets. So the way we are solving this problem is that we don't focus on loops. We focus on high-level APIs, and the way programs are written today is in terms of high-level APIs of pandas and numpy and things like that. So that's where we focus. And we treat these high-level APIs as deeply embedded DSLs in the general program. So they are not just function calls. They are native operators of the compiler that are optimized and transformed along the way in the whole pipeline of the compiler. And these APIs have parallelization semantics that the compiler exploits and is able to parallelize the program. So there is no loop kind of memory access analysis as such, or this kind of loop transformation is necessary, which makes the problem something completely different and manageable to solve. So the way this compiler pipeline works is the Python function is in bytecode, in Python a stack-based representation. We transform it to NIR, which is done in Namba, which is an open-source package. And then we transform these DSLs in the compiler, some transformations for data frames, to convert them to some representation we like to, the compiler understands and can optimize, transform the series operator, series or columns of these data frames in Python, transform them to arrays as much as possible. Then we have this parallel accelerator component that is able to understand and optimize and take care of arrays. So we do the array transformation. At this point, the program is in a structure that's in terms of the operators that we understand is fully optimized, ready for parallelization. So there is distributed analysis of how to lay out the data structures and computations. And once that's done, it's transformed to a parallel version, kind of adjust allocations and loops and insert the MPI calls for parallelization and stuff like that. And at the end, there is the code generation with MPI. MPI calls, which is just a binary as if you wrote the code in C. So this is the compiler pipeline. We have a paper, if you are interested in more, some more details, you can look at it. But the key piece of it is the question. Yeah. So maybe that's actually the next slide. So the basic idea is you have, you're assuming this Python function that you're going to optimize and compile, it's primarily making calls to pandas or numpy. And that you pull out into the DSL. I understand how that all works. But what if there's arbitrary Python in it? There's two Python calls, you take the output of one, sorry, two pandas calls, you take the output of one, and feed it, make some decision about the next. You can parallelize the first one, and then you get the coalesce result to a single one, and you parallelize it again, like you're hitting all those data problems, or you assume that the function only contains pandas and numpy calls. So there is data flow in the program, we don't have, I understand where this has come from, because in SQL, there is no, it's like a tree, it's easier to analyze it. But for us, we have loops and conditionals and things like that too. So there is no restriction on that side. In terms of APIs, the package APIs used should be the things that the compiler understands pandas, numpy, and we have a list of things, which are common things that are used for data program, scikit-learn, and some others. But also, you can kind of write other JIT code and mix and match JIT and regular Python, based on certain structure. So it's quite general. It's not super magic, like stick the decorator and forget about it and it will run. You will hit some issues, but with a little bit of understanding of what's happening, maybe our goal is, with half an hour of training, any developer should be able to pick it up and write their code. Okay, the decorator example was the toy example we were saying, like, but a lot of times people come along with the rando Python function, they add the Bodo decorator. It doesn't always work because there might be some constructs in there that you don't support. And I actually have a slide on this a little bit later on the limitations. All right. Awesome. Keep going. All right. So the automatic parallelization piece is the main secret sauce. And as I mentioned, it's about exploiting the parallel semantics of these pandas, numpy, and other operators, which are implicitly parallel because they work on arrays. And we also know that for these applications, the parallel patterns are map-reduced, not to be confused with the map-reduced system, just the parallel pattern, and relational operators join and group via things like that. So we know the parallel patterns, and we know that a lot of the data, the way it should be distributed is one dimension, so-called one dimensional block distribution, which means that just divide, let's say, rows of a data frame into blocks and each processor owns a block of data. It's much simpler than some of the scientific kind of physical simulation applications that may be three-dimensional kind of arrays and four-dimensional and so on and so forth. So we have all of that information in the program, and we know for map-reduced, the main thing we need to decide is these big collections or big data structures that need to be distributed, and these small temporary ones that need to be replicated across processors. This is one of the key decisions that the compiler makes. Once we make those decisions, the compiler generates efficient kind of MPI binary, which is single program multiple data, SPMD, which means that each processor owns a chunk of the data, and there is no kind of master executor approach here. So that's how to think about it, and the implementation is the algorithm is a data flow compiler algorithm, kind of like liveness analysis, origin definition, all these standard compiler algorithms, so all the theoretical goodness applies here. So for all these operations that the compiler has, there are transfer functions that apply and is kind of applying constraints on parallelization of the arrays and operators, and the whole algorithm is a fixed-point iteration that converges to an optimal solution. To your point, Andy, if you want to handle control flow and things like loops and things like that, you have to have a data flow compiler algorithm that is able to do that. So in terms of the kind of theoretical setup of the algorithm, if you want to do data flow, you have to have a set of properties you want to infer. In this case, as I mentioned, 1D block distribution basically is this data frame distributed or not. 2D block, if you have multi-dimensional, two-dimensional arrays you want to see, which is used rarely and replicated, which is this data structure should be replicated. Replicated one is the bottom of the lattice, the 1D block is the top of the lattice, because we start from the top, we want to make everything distributed as much as possible, but some constraints make some things replicated. So that's how it's set up. And we have D of A is the distribution of arrays and D of P is the distribution of parallel for loops, which I'll show in a minute how they look like. So we have this kind of setup, and we have transfer functions for these IR nodes, and each iteration of this whole algorithm is applying this big transfer function on distributions of arrays and computation, getting a new version and continuing. So it's like this fixed point iteration algorithm, just like other data flow algorithms. And for fixed point iteration to converge, you have to make sure your transfer functions are monotone, which is very critical, otherwise you'll hit into infinite loops. So you can change the solution to replicate it, but you can't go from replicated to distributed. It's very important to keep in mind in the implementation of this to make sure it converges, but it converges very quickly, converges to optimal strategy that the parallel programmer would do. Very strong guarantees compared to the approximate integer programming algorithms of the past. A few examples of transfer functions, just to see how they look like, if you have an assignment, the left-hand side and right-hand side should have the same distribution. So it's the meat of the two distributions, meaning that if any of them is replicated, they are both replicated. So you apply this constraint. That's one example. If you have binary operators of arrays, which we translate into power force most of the time, but sometimes we don't, all the arrays involved output and the two arguments have to have the same distribution. So it's meat of the three things applied to all three. If you have join of tables, all the columns of a table should have the same distribution semantically. Doesn't make sense to have one column replicated, the other columns distributed. So it's a meat of the different columns. And if you have these function calls or internal operators that are not some IR node, we have a table for them and you pass in the distribution of all the functions, all the arguments and get the distribution of all the other arguments out. And if there's something unknown, some unknown API call, all the arguments are replicated because we don't know. You have to be conservative in a compiler. So that's kind of the theoretical setup, but practically here's one example. We have a data frame, DF has three columns on the left. And it's read from parquet file, for example. So the rows can be distributed across processors, one deep block distribution, the technical term. But if you do DF.mean, which is giving the mean of every column, the output in pandas is a series object with three values. A series object is like an array, could be distributed, but semantically, it's very clear that it's like a reduction. It's a small scalar like thing and has to be replicated across processors. So this is an example of something that the algorithm infers automatically. These solutions that say change your pandas, import into something else, and just that's it. They usually, you know, a lot of complicated applications, the programmer doesn't know to manually make things replicated across processors. And we have seen they hit some wrong results because of these issues, but Bodo can take care of these. And in terms of some of the kind of machine learning algorithms, if you have the sample metrics and your labels, obviously these arrays, the way they interact, they can stay distributed. In this case, we have three processors and we divide the data across them. But the weights of the machine learning algorithm, let's say, logistic regression has to be replicated because it's like on the reduction side. And it's very interesting that you write numpy code for these algorithms and all these semantics fit and, you know, the algorithm, the compiler finds these distributions automatically. With human eye, it's not as easy to figure out, but the semantics of the APIs are very clear. So that's about kind of data distribution, but a lot of the compute is in terms of data parallel operators, not join and those sort of things. So for those kind of things, let's say this array, element wise array operation here, A times B plus C, we transform them into parallel for loops, these power fours that we can parallelize and fuse the loops together for cache efficiency because in regular Python or other scripted language, the back end of these operations usually C, so it's fast. What makes it slow is that you do A times B store to some temporary array and go load that array again for the next operation because these are library calls and cache efficiency is very low. But here with power four fusion, we gained a lot of cache efficiency in a lot of cases, 10x speedup. So that's in terms of sequential optimization, but these power fours are very good for distribution as well, distributed computing as well, which in this slide I have it, I'm not going to go through all the details of how we distribute compute and power fours, but basically the idea is that we see what array accesses the power four has and how the array is accessed, the index of the access in the first dimension, that's what matters, we do distribution on the first dimension and meets all the arrays and this power four together, they have to have the same distribution. And we know that we generated these power fours so there is no cross iteration dependency, so we are not analyzing fortune code, which makes it much simpler to do. And if there are some cases that we don't understand, we can be conservative and assign replicated and throw a warning to the user, which happens very rarely. So in terms of results of this, here is one of the use cases of real application kind of large retailer that we did. So the original setting was one of the known cloud services running Spark on Azure, but it was taking more than an hour for a single data pipeline to run on a large cluster, in this case 16 nodes and 432 cores. So they couldn't run it in real time if someone needed some analytics, so they had to pre-compute and even then these clusters are not cheap, costing them a lot of money per month. So with Bodo, the same application was 12 times faster, meaning just a few minutes to run so they could run it in real time and save over 90% of their cloud bill, which is very significant in their case. So just kind of a real workload with this approach that actually works, but there are a lot of benchmarks and you can do a lot of analysis on it of how kind of the Bodo HPC approach compares to things like Spark. This example is from our biggest customer, which is a large tech company, Fortune 10, and their IT is very much into comparing benchmarks and things like that. So they created some cluster on AWS 125 nodes, 4500 cores, AWS would never give me that much resources and quota, but they give that company, of course, a big company, anything they want. So they created this large cluster and took this TPCX Big Bench benchmark. Big Bench is the subset of TPC, which is a little bit more complicated apparently more useful for machine learning. It's still a little too simple for machine learning applications we have seen, but that's the best we've got. And they compared their optimized, tuned whatever Spark setup with Bodo without any tuning and over 20X speedup at that scale, which is very significant both in terms of performance and cost. So why is Bodo and kind of HPC approach so much faster than Spark? I touched on this in terms of the DeSuta systems approach versus parallel computing approach. So in the DeSuta system approach, the Python code is written in terms of these high-level APIs, and they run on this driver library. The program is on a single driver process, but the driver is interpreting the code essentially, and when it is data parallel API, it creates tasks, go across cluster, come back. We call this waves of tiny tasks, and these tasks have a lot of overhead because the program is not really parallel. What with the Bodo approach, the program is compiled to a native binary that runs on bare metal, and there is no concept of driver. All the processes own their own chunk of the data, single program, multiple data approach, and they run efficiently and do their collective communication if necessary until they finish the computation. There are no tasks overheads, no driver bottlenecks, and we think this is the right way to do parallel computing that is made possible with Bodo because if you don't have a compiler to parallelize end-to-end, you would have to go through the library approach. Other than performance, these map-reduced libraries have a lot of limitations in terms of the kind of thing they can do, because the tasks are asynchronous, idempotent, whatever, so they can't communicate directly, and you can't do things like moving averages, for example. Very common operation requires near-neighbor exchange, but you can't do that in a map-reduced library like Spark, or you can't do cumulative sum, which is the prefix scan operation. Difficult to parallelize, has a lot of communication, doesn't fit map-reduced. The way MPI scan is already available, but the classic algorithm for it is you create a tree of partial sums of the values, even though the algorithm cumulative sum looks sequential, but you can do it in parallel in a good way, which is create a tree of partial sums, go up the tree, and then come down the tree, distribute the partial sums on the left, and you will have the prefix sum of the values in the output, so something like Spark can't do this, this is a complicated communication pattern that doesn't match map-reduced very well. So what are the limitations of Bodo you mentioned, and the seems very difficult to do. Python, in general, creates a lot of challenges because any kind of compiler needs data types. Without data types you can't do anything in a compiler, and in Python you can do things that would eliminate any possibility of a type instance. I have a couple of examples. In some variable on the left, variable A could be a scalar or an array, and the user could do a type check and do some work based on dynamic type of the value, so that's a possibility in Python. You can't assign a type to A, and it's a crazy type of thing for a compiler. You can even change the function that you're calling in Python. A function call is just a so-called callable object, and in the middle example, you can change it dynamically, but something that's more kind of a closer to the data processing side, you can even change the schema of the data frame. Your table schema through control flow can change, which creates problems because schema is part of the data type. So all of these problems exist, however, in practice, these are very rare for analytics applications that we have seen, and if there is a corner case where this happens, just the compiler needs to throw the right error, and the programmer can do some things in standard Python, then pass it to the JIT context or refactor it to avoid these issues. So we think this is doable for almost all programmers to do, and our job is to throw the right error to guide them in the right direction to take advantage of all this. Another kind of practical aspect of implementing a system like this is compilation time, because you're creating all this new compiler technology, and we write it in Python, so anytime you create a new compiler technology, initially compilation time is not great. The same thing was true, I'm told, for C++ templates, and a lot of other things. So a lot of our effort goes to bringing down the compilation time and getting rid of compiler inefficiencies. So with that, I just want to show you a little bit of how Bodo works. We don't have too much time, but here's a notebook I'm running on my own laptop, and my quote-unquote cluster is my own course. I'm attaching to four cores on my own laptop. I generated some small data set, just one column of daytime values in a parquet file, and the regular Python is loading the parquet file and using Dataframe.apply to do some custom transformation, unlike UDFs of SQL, and it creates typical ETL operation, and creates two new columns. So I'm not going to run this, it takes five minutes, and by the way, this Dataframe has 10 million data elements. With Bodo on a single core, these sort of things are compiled to binary and are orders of magnitude faster than standard Python, and here it took about one second down from 300 seconds. So you gain the sequential optimization benefits, especially for these user-defined functions, but then I'm adding this PX operator, Jupyter syntax, attaching to my cluster, little cluster, and I get speed up 0.25 seconds because of running on four cores. And the same way you can run on 4,000 cores, we just had the same setup I showed you at this big company, and you can watch some of our demos online. All right, so we are building a platform based on this idea that the idea is open source APIs of Python are supported, but the platform takes care of connectors, automatic parallelization, optimization, you can run it on-prem or in the cloud, we have a SaaS service, and can load, work with any kind of storage, it's storage agnostic, which feel free to look at our website for more details. So now SQL, why SQL? So initially, when we built this initial prototype, we went to this big tech company and said, hey, we can solve your machine learning problems. They said, great, look at, take this code. And it was kind of data transformation in SQL and Python. And they were running using these high-performance GPU SQL engines, and they were not happy with the complexity of the code and performance. So we showed them with regular pandas code, we could run that code over 10 times faster, and much easier to manage the infrastructure, much more scalable with regular CPU, they could beat the complicated GPU setup. So that's how we got started, and we learned that data processing is the problem, not the fancy machine learning algorithms that have all the hype. So even for this kind of big tech company. So we did that, but a lot of the groups told us, we have all this SQL code, and what can we do about those? We use Spark SQL or some other SQL engine, and it's slow, not easy to manage. You guys seem to be doing very well in Python, can you do SQL? So we don't have to write our SQL code in Python. So we thought SQL is a solved problem. There are so many solutions out there to suit the SQL engines, it should be fast and easy, but it was not true. So we built a SQL on top of Bodo, which I will explain in a little bit, and we are getting very encouraging results. Right now it's in beta, but it's already used in a few deployments, which is interesting. So in general, the problem today is that data applications are either in Python, Python is the dominant language for data science and machine learning and AI, but data processing is mostly done in SQL, and the application becomes a mix of these two, and developers are a mix of these two as well. Some prefer SQL, some prefer Python. It makes it very difficult to develop and deploy these applications to error checking across the two languages, scale the two together, and there are skills and mismatches. You have to use SQL for some things, Python for others, it's not an easy choice. So we came up with Bodo SQL to solve this two-language problem. Just example code to make it more clear, this function is kind of a JIT function that uses Bodo SQL. The data is read in Python in data frames, and it could be terabytes of data. The data frames are passed to a SQL context, and this BC.sql code just runs a SQL query, and the output is just data frames. So it's frictionless data framing, data frame out, and you can do both SQL and Python together, scale together. So all of those two language problems go away, and also you get error checking end to end. So if I, so this SSCustomerSK column in this slide, I changed it to SI, just the title, in the last group by, and in regular Python SQL setups, you have to run the SQL code, which could be something very long, and then you realize you have an error in your program, and it could happen in production. But here in compile time, Bodo catches the error and throws the right error for you. Before you had that, you could tell like you had, you just add the decorator, and you don't have to change any of your Python code, modulo, like some, some fucking discord cases. But now this looks like you're now bringing in some of the Bodo assist with Python for the SQL context. Yes. Right, like, so what was, is, like, what was the code like before? Was it like Psycon, the Postgres, or Snowflake? Is it, what you've shown here is that exactly how it was written before, and this is a new thing that you're providing. This is close to how these things are written. Typically, there is some Spark SQL context, or some other SQL context that's been created, Snowflake, or, I don't know, Teradata or whatnot. So we try to create APIs that look similar, very similar to those, but still fit the Bodo compilation model. For example, we very much like the context object to be immutable, so we can do our optimizations, and it's like the data type is clear. So some of those things we try to do, but we want to make sure it's familiar, but fits our compiler model. What I'm saying is, like, sorry, that you can't automatically convert the Snowflake Python code to your code. What is Snowflake Python code? Yes, we can. It's just, you know, the connection object, it's basically the same thing you have, and, you know what I mean? Like, instead of saying Bodo, I would say Snowflake or whatever. Yes, we could do that, but you can't think of Bodo SQL as equivalent to Spark SQL. We want to make sure you can load parquet files and from S3, and we are kind of storage agnostic. So it's like a single layer that can do other things. The Snowflake or those kind of objects are attached to the particular storage, so they are not as storage agnostic and portable. But we could do that too. All right, so the way it works today is dysfunction. The SQL portion goes through calcite, and we get an optimized logical plan. We translate the logical plan to Python code that Bodo understands, and we started with regular Python code that just we compiled recursively, but we are going towards more internal representation of the compiler to make it faster and more flexible. But the rest of the Python code comes into the Bodo compiler, so these two are compiled together and a binary goes out. And that's why you get optimization across them. In this case, you know, the parquet files may have, you know, typically in practice, 30, 40 columns, but this program uses four or five of them. So it's important to optimize out the rest and not load all the data. So that's the kind of optimization that you can do in this setup. That wasn't possible. In terms of performance, this is Bodo SQL is in beta early stages, but it's quite promising. This is one of the TPCH queries and we compared with Spark SQL and regular Bodo. It's not as fast as Bodo yet. There's a little bit of inefficiency in the code that we generate, but we hope we can close the gap soon and get the same kind of performance that Bodo provides. But still, we are much faster than something like Spark, even for this simple query. All right, a little bit about some of the optimizations we do in Bodo, some of the interesting things both Bodo and applicable to both Bodo and Bodo SQL. One is, you know, obviously we have a lot of compiler optimizations and compiler techniques used. The program on the left loads a data frame from a parquet data set and changes one of the column names and returns two of the columns. The problem here is the column name changes in place. So the data type changes, your table schema changes, so we have this iterative typing algorithm that is able to fix these issues in common cases and continue typing. That's a key piece of it which, you know, needs its own talk. It's interesting transformations to make typing possible in these cases. But also something that's very critical is getting rid of these data frame wrappers and objects to be able to do more optimizations. So when we read this data frame object, we kind of, if the columns are used in some operator, we break it up and just use the columns that are necessary, the arrays that are necessary, so that we can get rid of the data frame object and the other columns in the compiler pipeline. In this example on the right, it's kind of like the IR that's being generated read parquet, but only columns A and B that are actually used later, and then in output create the data frame with those arrays. So we only read two arrays, get rid of all the other junk in the IR, and optimize out the unnecessary columns. Some of the transformations that in SQL is easy or SQL-like things like Spark is easy because it's just an expression tree. They are harder in Bodo because it's a Python program, but still doable. For example, filter pushdown, we do a pattern matching. If there's read parquet and there's some filtering right after, and the data frame object read from parquet is dead, we have to do that check. Then we can do filter pushdown, and it's a transformation. It has a bunch of elements to it. For example, if you use this some computation to get the filter value, in this case PD to date time of S is used as part of the filter, so we have to move it above the read parquet node. Some of those issues of Python have to be solved by the compiler. It's more complicated than the SQL setup. There are other parallel computing optimizations that can be done. One example is doing efficient parallel join, very classic problem. The main bottleneck is shuffling data. Even with MPI and RDMA fast networking, shuffle is still the bottleneck for joining tables. We use bloomfielders to reduce the shuffle data before shuffling a table. Basically, from one side of the join, you create a global bloom filter, and you filter the other table. You create a bloom filter from the smaller table, and you filter the other table before shuffling the data, which saves a lot of communication. We found out that implementing this in practice, obviously, there's a trade-off of the communication cost you save versus the cost of creating this global bloom filter. The implementation efficiency is very critical. Otherwise, it just doesn't make sense. First of all, you have your bloom filter implementation should be efficient and cache friendly. When you search the keys, it has to be in the same in one cache line and not cause cache misses, and it has to be same de-implementation and so on and so forth to be fast. Also, creating the global bloom filter is a reduction all-reduce operation. You have to have very good topology-aware reduction communication algorithm for it to make sense. Otherwise, the cost is just too high, the communication savings are not worth it. We have a lot of heuristics to know when to use bloom filters and how to set the parameters, what should the size be, stuff like that. A key data point is table cardinality, how many unique keys you have, and we use hyperlog log to estimate that globally. Again, that also needs very efficient parallel communication. Otherwise, it doesn't make sense. We are gaining a lot out of efficient MPI communication. For our optimizations, a lot of our assumptions are based on that. Once you do that, some other techniques don't really make sense. We found out that for avoiding shuffle, a key technique is broadcast a smaller table so that you don't need to shuffle. But we found that with good bloom filter communication, broadcast join is rarely faster. Broadcast join in general is a non-scalable algorithm because if you look at the parallel work and memory, it increases with the number of processors you have, which is not normal for parallel algorithms. WFP, the parallel work, increases with the number of processors and the table size, which is not good. It has to be constant. Memory of MFP increases with the number of processors as well. Today, a server known as 50 cores, so you're duplicating your table on each server 50 times and you imagine 1,000 cores or 10,000 cores. It's crazy. It's so unusual that in parallel computing literature, the subscript P is usually dropped because it's no parallel algorithm does that. So we don't think this is good. Bloom filters are scalable and nice. The other kind of algorithm that was very dependent on efficient communication is parallel sort. So for parallel sorting, basically you take some samples and create some global partitions. Processor one takes keys from 10 to 20, for example, and redistributes the data and then sort locally. The tradeoff is that the more samples you have, the more balanced your partitions are, but it takes more time to do the sampling. So sample work versus sort work at the end. And you want them to strike the right balance. So theoretically, the maximum load, this epsilon is one plus epsilon times the average load. Let's say 10% more than average, some processor may have some data, 10% more data. So if you want to achieve 10% maximum load difference, theoretically, you have, there's this formula, which is the total sample should be P times log of n total number of elements you have over epsilon squared. So, you know, this is the theory in practice. We found that if we optimize the whole process of sampling and all to all kind of shuffle of the data, this formula actually works, which was very encouraging. But it's very critical to have an efficient gather of these samples, broadcast of the sorted samples and all 12 for shuffling the data. Otherwise, these formulas don't work. And, you know, on the same kind of tech company setup I mentioned, their IT really likes teradata, a benchmark for big data systems. And on 4500 cores, they took a four terabyte data set and sorted it. Bodo was eight times faster than their optimized Spark, even though this doesn't have any optimization for Bodo to do, except that parallel sort and things like that should be implemented efficiently. So a lot of these things I wanted to bring out the fact that a lot of these operations depend on efficient shuffle. And we use MPI for this efficient shuffle. And Ishaan, so we're short on time and I think somebody in the audience has a question. And I have, can I ask a question then? Yeah, go for it. Yeah, I'm sure here. So Ishaan, this is a good, good insight. I was, I'm a pure Spark guy. I've been playing with huge data. I have, he's a big eye opener, but I have a few questions. How do you manage resource management? You know, you showed me the TPC benchmark. The thing is that you still need those many nodes to do this operation, but you can do it in lesser time. Do we still need those many nodes or can we have a smaller node and optimize it? Well, Bodo is scalable. You can run it well if it will work efficiently on a single core all the way to 10,000. It doesn't need to be large number of nodes. I'm just trying to show that it's scalable on large number of nodes, but a lot of the, you know, we have a community edition on four cores and you can take the, take advantage of the benefits even on four cores. So you don't have to use a lot, a lot of nodes in terms of resource management. It's orthogonal to the Pell competing problem. It's a scheduling problem. There are schedulers from, you know, Slam to Yarn to others. You can use to manage your resources as jobs come to the system. And we think we shouldn't bundle those things with Bodo. Spark has a bundle of a bunch of things that actually hurt performance, we believe. So Kubernetes and things like that. Okay. And one last thing. So in Spark, we see GC pass and something. So here in Bodo, there is nothing like that. In Spark, what do you have? Sorry, Mr. The garbage stuff, you know, GC. There is no JVM issue here. Because we're using the LLVM stuff, so there is nothing to be avoided. Okay. Yes. Yes. Okay. Thank you. All right. So I'll finish quickly. I don't know. I think already over time, so I'll skip the resiliency section. And we have a blog post online coming online soon. We'll discuss that later. So basically on this slide, Shuffle doesn't fit the MapReduce model. So the Spark way of creating files and spilling and things like that is just a hack. But MPI does it in memory and just direct messages. So it's much faster. And all data processing is depending on this. So with that, I would like to conclude, we are very excited about what we have achieved here once they can put what happened to this. And if students here are looking for a deep take kind of start of opportunity, we have offices in Pittsburgh too. So please take a look. And in general, the compiler approach is very promising if you are looking for research topics for your PhD. And the parallel computing approach is much better than the suitor systems for a lot of these problems. And efficient communication with things like MPI is so critical, a lot of your optimization assumptions may not be valid if you are not using efficient communication. So with that, thank you guys. Sorry for running over. That's okay. So I will applaud everyone else. So we have time for one or two questions. I know there's a bunch of people that have questions on the chat. So Doug, do you want to go first? Hi, I'm Doug Baylock. I'm based in Pittsburgh. I work for Valor Equity Partners. So I was just wondering how Boto handles like losing a node, a spot instance in the middle of a long computation like Spark. Spark has an RDD which takes care of that. Just wondering how Boto handles it. That was actually my resiliency portion. So basically in Spark, yes, there is RDD. But for any real application, there is communication across processors. And Spark goes back to some checkpoints. So the model of Spark is you can't change the payload algorithm. And you have to use checkpoint restart in Spark. In Boto, you can use checkpoint restart as well. But the restart, the middleware is something else like Kubernetes is doing the restart and things like that. So that's one aspect. The other aspect is in practice because Boto is not JVM and this shuffle files and all these issues of Spark, there is no software failure. So you need to only worry about hardware failures which are much more rare in practice, like a node fails every 10 years or something. If you have spot instances and your workload takes that long, then you have to checkpoint. That's correct. I was just thinking of the case where you're using spot instances and you can lose one at any time. But typically R&D is supposed to be up for sometimes weeks. That spot instances will take you at any time. I know, but practically there is anyways. But typically, if you want to not lose computation, you have to checkpoint. Otherwise, you have to have just a restart setup. But something to keep in mind is that if you are 20 times faster than Spark, even if magically Spark is able to restart, which it cannot, but even if magically it doesn't lose any computation, you can run this 20 times and you're the same as Spark.