 Hi everybody. Thank you so much for joining us for today's Protocol Labs Research Seminar. Today we are joined by Alex Mikolov, who is a final year PhD student at MIT EECS. He's worked on research and theoretical computer science and machine learning. He is also broadly interested in the design and implementation of systems that simplify how we compute and organize data. Today he will be presenting Mandala, a high-level data management language implemented in Python. So Alex, I'm going to let you take it from here and thank you so much for joining us today. All right. Thank you for introducing me. Today I'm going to tell you about this sort of programming language, which is broadly speaking for managing data produced by computations. And while this is originally inspired by things like machine learning and data science applications, I believe the same abstractions can be very useful in a natural fit for things like content addressable computation, which is probably what you're more interested here. To just dive right into it, I'm first going to give you short overview of first the problem of data management, and then just sketch where we're going with this talk. And to motivate things a little bit, here's a here's a file name that you may have seen something like this if you've done some machine learning or data science experiments. And this file name tells the story in a way, presumably the story of how this file came to be. And this story doesn't look like a very happy one for whoever has to deal with this file. And to just unpack a little bit the different participants in this story, if you will. We have, first of all, the kind of quantity that's being represented by this file, which actually makes a lot of sense as it's pretty reasonable. Things to include in the file name, then we have some parameters which presumably participate in the computation of the contents of this file in some way. However, it's not completely clear how they do so. Then we have some versioning information. Again, it's unclear at all what this versioning refers to. And finally, and worst of all, we have these subjective annotations that people just like sprinkling around when they're in a hurry. And as you can see, as usual, final is not the last item on the list. And good luck figuring out what any of these mean, like a week down the line or a month or whatever amount of time. So yeah, so this is in some sense maybe the worst of data management. And in general, the problem of scientific data management would be to deal with things like this, meaning save, load, query, delete and otherwise just organize quantities of this sort. This file name, as we said, is a very naive and probably one of the worst approaches to data management. And the starting observation for this whole project is that there is actually something else that tells the story of how this file came to be, namely the code that you used to generate the contents of this file. I'm going to argue that this code is in many ways a better telling of the story of the contents of this file. And where we're going with this is that basically we're going to in some way replace this file name as a way to refer to this file's contents by the code itself. So this probably sounds very weird at this point. So I have some explaining to do about this. So first of all, let's briefly see how code tells the story in a better way. And there's actually a bunch of things to note here. So first of all, the code producing this final quantity is unambiguous in a very strong sense. So relative to all the functions that you've defined here, the way these functions are composed to produce this quantity is perfectly clear from the code. Another nice thing about thinking of code as the name for whatever data you're creating is that code is very expressive. So especially in imperative programming languages, you can tell all sorts of very different stories for how your data comes to exist. And code allows you to do so very nicely. Another maybe more subtle thing is that code also implicitly encodes computational relations between the quantities being computed. And in things like data science and machine learning and surely not only in these domains, what we typically care about when we have a bunch of data like this is some sort of query that asks, okay, so how do my, for example, final metrics that are at the bottom of this piece of code going to depend on some initial parameters. And since code implicitly contains precisely the relationship from these initial parameters to these final metrics, it also is in a good position to kind of somehow automatically keep track of this and allow it to query things. Another nice thing about code is it has very naturally editable structure. So here we have the story of a certain piece of data. And if you want to pass to another sort of story, for example, if we want to remove the pre-processing, we can just delete these two lines of code, like in the middle of this here. And very easily switch to this different story for a different piece of data. Or if we want to change, for example, what alpha is, you can just go to the part of the code where alpha is defined and change that. And the final thing, code is also refactorable, which is especially easy to do in modern IDEs. And this refactorability, if you adopt code as a name for your data, allows you to evolve the story of your data as your code evolves. So you do, you evolve these two things jointly in some sense. All right, so this has been, you know, a lot of advantages of code. However, you know, the real question is, you know, code is so great. So if it's so great, you know, why not just use code itself instead of file names? And more generally speaking, why not use code itself as the principal interface to the storage of its results. And this probably sounds a little bit radical at this point, maybe even a little bit hopeless. But what I'm going to argue in the rest of this talk is that in fact, it's not only doable, it is actually very fruitful to do so. And it can dramatically simplify both the code that you need to write for data management and also the concepts that you need to work with in your mind. And just to give you a quick sketch of how we're going to actually implement this. So there's is going to be two main components to the system that have to work together in some sense. And the first component is a very radical form of memorization or very, I guess, aggressive form of memorization. And the idea here would be that you write some experimental primitives as Python functions in this case. And then you go and you combine these primitives into experiments by using whatever control flow and data structures you want. And each of the function calls to these functions is going to be memorized. And this is what you can refer to as a composable sort of memorization. And as we're going to see, this memorization is very tightly integrated with core features of Python. Like as I mentioned, data structures control flow, subroutines, in a way that prevents or avoids data duplication, and also keeps track of all relationships between the things that you're computing behind the scenes. Because you're always, if all the function calls that you're doing are memorized, you know, this is going to mean that you can in storage actually recover the chain of events that led up to a certain thing existing. And then another powerful thing about this sort of end-to-end memorization approach is that once you have a piece of code like this that you've already executed, so everything's been memorized, everything's been computed. And then if you want to interact with the results of this code, you can simply just re-execute or retrace this code. So you can step through this code again, except now that everything's been computed, you're not computing anything new. So you're not doing any heavy work. Instead, you're just traversing the storage in some sense. And by combining this with, you know, some imperative control flow, you can do very expressive imperative queries to your storage. So you can take a piece of code like this, you can rearrange some parts of it, maybe add some more logic. And this gives you a very powerful way to interact with storage directly by just using the tool you're most familiar with, namely the programming language you're working in. So this is the first component. And then the second component is a declarative query interface, which is in some sense complementary to this imperative query interface that I've been talking about so far. And here the idea is that if you look at this code, it's actually very similar to the code we had before for computing things or traversing storage, except because of this query context manager here. This code is interpreted in a very different way. And what it actually does is behind the scenes, it builds this graph of computational relations instead of computing anything. So this piece of code is going to define some sort of a combinatorial representation of your workflow. And then it's going to be able to compile this to SQL. So you don't really need to know anything about SQL to be using this. You're using things that are completely familiar to you, like these functions that you're working with. But behind the scenes, this is compiled to SQL. So you're using a very powerful query language behind the scenes. And you use this sort of structure to then ask questions of the sort for all the experiments that I've done that I've recorded in this storage. You're asking, give me a table of all the things that satisfy basically these computational relations specified by this piece of code. And just to quickly mention this idea also appears in a Julia project, which has a bit of a different focus. It's called their conjunctive query. So you can look this up if you're interested. All right. So this is for the main components. And then just to give you a quick idea of why this would be a useful thing to do at all. The first big reason would be just a massive code reduction. So if you have some experiment that you're doing and it has like a bunch of moving parts, then depending on which parts you've computed already or which parts you want to query or which parts you want to save, you're writing all these different pieces of code that are somehow all of them about the same workflow and the same logic that you have. But all of them also have this additional stuff that you have to go and deal with to accommodate a specific data management use case. And with Mandelwald, you can essentially just remove all this complexity and go back to the simplest in some sense canonical expression of your workflow of the logic of what you're doing. And you can repurpose this code for all sorts of different goals that you would need to write extra code for. So the first benefit is a massive reduction in the amount of code that you have to write and maintain. And then the other big advantage of this is that maybe a bit more subtle, but still quite useful in practice is that we have all these things like data structures and control flow and subroutines and refactoring that we've traditionally used for managing the complexity of software. And they've been very successful for that. And with Mandelwald, you can leverage the same concepts to in a very natural way manage the complexity of the data associated with the code that you're working on. All right. So this is for the overview. And then just for a quick plan of where we're going with the rest of this. So first, I'm just going to show you a demo. So this is actually a working system. So I'm going to show you a demo of how these things work in a very artificial, minimal context. Then I'm going to discuss different ways in which these programming patterns can be scaled up to more complex projects that have more moving parts and are changing. And you keep adding more parts to them. And in particular, I'm going to talk about how things like data structures and subroutines seamlessly integrate with these programming patterns that I'm describing. Then I'm also going to talk a little bit about refactoring, which is a necessity if you're doing some sort of computational experiments, because you're always adjusting things and coming up with new ways of doing things. And if you have a bunch of data already sitting in your storage that's connected to the old version of your code, it may be very, very painful to adjust to a new code base. And these refactoring parameters that I'm going to show you are exactly kind of streamlining this process. And finally, I'm going to conclude with some more or like a broader vision for where this is hopefully going. Now we're actually onto the demo. So I'm going to pull up just a Jupyter notebook here, where I'm going to show you just like the very, very, very basic form of these programming patterns that I've been talking about. So first I'm going to just do some imports and set up the storage. So as you can see, this says that it's in memory. This is just to make this example simpler. Typically for a large project, you would write this on disk. So we've created a storage. And now to demo the memorization, we're going to start in the simplest possible way. So I have this increment function here that prints out the message. And I'm decorating it here with this op decorator. It stands for operation. And I'm pointing it to the storage. And when I define this, what I'm doing essentially is connecting this function to the storage. So now the storage knows that there's this function, code increment, it knows about its signature, and other things like that. And now that I've defined this, the way I write to storage in Vandova is primarily by doing function codes. So this is the whole memorization business that I've been talking about. And the simplest example of this, the Hello World, would be to just create this round context with the storage. And you call this function inside. And as you would expect, you get this message printed out. So we ran a function, and we put the result of this in storage. So now the first important thing to understand is, when I do this again, what's going to happen is that I don't, or what's not going to happen is that I don't get this message printed out anymore. And this is because the storage already knows that I've done this work, that I've called this function on this input. So what happens is it bypasses the function execution, and it directly just loads the results from the storage. And we can actually look at what this result is. So I'm just going to print it. And as you would expect, this is 24, so it's 23 plus one. However, there's also this around it. And the reason that this 24 is wrapped inside this object is so that the system can keep track of things. And in particular, we have this most important property of this object is the UID, which is some meaningless identifier that serves as some sort of pointer to a storage location for this result. All right. So so far, so good. This is a pretty basic memorization stuff. And then where this gets more interesting is when you start creating more and more functions and composing them in more and more expressive ways. So as a steps towards this, I'm defining here this add function, which again, prints out the message that's true numbers. And now I can create this mini workflow or mini experiment, if you will, on adding numbers. So what I'm doing in this in this piece of code here is I'm ranging over some values of I, and then I'm saying, you know, J is going to be the incubi and then final is going to be the add of I and J. So I can again, run this. And as you can see, a bunch of stuff gets printed out. By the way, one of these increments that we're doing is we're incrementing 23, which we've already done. So as you can see, this is not showing up. So this is why we have these two consecutive calls to add here. All right. So, okay, so this is great. So we've computed all these things. We've put all these things in storage. And now the same thing applies that applied before that if I run this again, nothing actually gets computed. And this is the retracing part that I was talking about. So when you have a piece of code like this, and you've already executed it, just running through it again doesn't do any computation. It just retraces its steps. So this is the retracing pattern. And so going back, oh, sorry. And so I guess expanding a little bit on this retracing pattern, what this pattern is very good for is not simply revisiting code you've already executed. What is very good for is actually making your code very open to extension. So, and this openness can hold in many different ways. So for example, I can take this code and I can extend, for example, this range of parameters. And I can also add some logic here. And when I run this, I am basically adding more computation on top of what I've already done before. So this is very useful in things like if you're doing exploratory data analysis or machine learning, this retracing pattern makes it very easy to iterate on a piece of code in the simplest way possible without actually having to really think about how you're organizing your data or how to avoid computing things you've already computed. So that's a very nice pattern. Another very related pattern in which you can use this retracing as a sort of query interface. So for example, if I'm interested in some values for I here, I can just edit this code and go over these values and just collect these results. So yeah, so you can even more flexibly modify a piece of code to just use it directly as a way to traverse the storage and as an imperative query interface. And the final thing to mention about this retracing pattern, which is very important if you're doing some long running experiments or computations, is that if you have a workflow like this and it fails at some point, which tends to happen sometimes with long running computations, restarting this computation is trivial because you don't really need to write any extra code to do this and you don't need to prepare in advance your code for resumability. What you do is simply you run this piece of code again and it's going to retrace its steps up to the first point where it failed and then just continue computing from there. All right. So this was the memorization demo. And any questions about the code so far? I have just one question. So if you have two different developers basically interacting with the system and maybe they don't even know what each other is doing, if they both define the same increment function, would they be hitting, benefiting from each other's cache basically? Right. Yeah. So it depends. So certainly the focus until now for implementation has been on like a single developer, but you could totally imagine this sort of caching taking place. The thing is they will have to really be sure that they're pointing to the same function behind the scenes because these functions like this ink, for example, when you're defining this is going to assign to this function some permanent identifier behind the scenes. And so if the other person makes sure that they're pointing to the same identifier, then yes, they're going to be sharing their work if they point to the same storage. Yes. I imagine that this is actually, it's probably possible to generalize it even if they haven't stored the, if they're not pointing to the same function in the sense that you're describing because you could presumably derive the identity of the function based on the structure inside so that like if two people implemented the same function, it would have the same idea. Right. Yeah. It's definitely a very interesting question. It also ties to a lot of, for example, if you want to do versioning, like if you had a structure representation that completely specifies the semantics of your function, this would be wonderful. Yeah. So yeah, the thing is, yeah, this is also a bit maybe rough around the edges to do this. So for example, here you have this print statement, right? So it adds nothing to the semantics. So you have to exclude it somehow. You could imagine like all sorts of other things that, you know, look semantic maybe, but actually don't really matter. Like you could have some death cult, for example, or something like this. So yeah, it's a little bit tricky, even though it would be wonderful. Yeah. If you would do it, it would unlock a whole lot of other things also. So not just like even for a single developer, it would be very useful to have this sort of introspection to be able to do it. And you know, of course, you can pass to a more restricted DSL and then you could be able to do something like this. Yeah. Yeah. Cool. All right. So yeah, so this is for the memorization. So this, this has been the demo of, I guess, how this works from a user point of view. Maybe what's more interesting, if we use also how it works behind the scenes. So to unpack a little bit, what goes on when I call one of these functions. So for example, if you take the F function, it has two arguments X and Y. As I already mentioned, it also has some permanent UID throughout the life of this function. So even if you change things about this function, it's going to have always this UID during the course of its life. And then if you pass in some arguments, the first step would be to assign UIDs to these arguments as well. And so you do this very natively just wrapping these inputs by content, which works well enough when your inputs are simple enough things. And you assign UIDs based on this content hash. So then once you have this, the next step is to compute a UID for this call. And the way you do this is you combine, you know, all the UIDs that you see here, you pass them through a hash function, you arrive at some fixed length UID that describes the entire call. And the final step, now that you have this call UID, you have to decide if you're going to compute or if you're going to load from search. So based on the call UID, you do some storage lookup. And then if you don't find a call, there's no choice you have to compute. So you compute, you save the results, then you return this value reference that has its own UID that you can pass to future functions. And if the call is found, and that's the easier one, you just load the results and you return something identical to what you computed the first time. The demo of this second part that I mentioned, namely the declarative query component, this is going to refer to the previous workflow we had. So yeah, this workflow. And now I'm going to, or rather, we actually have the other thing. So yeah, this workflow. And now I'm going to make a relational query to this workflow. And this is best understood by going through this, maybe line by line. So first we have this query context that tells us that we're doing a query, we're not computing anything. And then here we had a loop over I. So we had some variation over I. And now we're replacing this variation by this query object, which basically tells the system that I is going to be like this placeholder that at this point can match anything in the storage. And then to make this more interesting, when you have the same line, which is identical to what we had in the computation, when you say the J is ink of I, you are saying basically whatever I can match, J can only match ink of this. And then when I add this last line of the computation, which again is identical to what we had before, I'm saying that whatever I and J can match to, final must match the ad of I and J. So I specify this combinatorial structure expressing these computational relations. So this is what this part is for. And then to query this, I call this get table function. And I pass some local variables that I've defined up here. And I get out a table. And just to make this a little bit more meaningful, I can name these things here. As you can see, we get out this table. And if you look more closely, you notice that each of the rows of this table, the value in the second comb is one plus the value of the first comb. And this is because we have this constraint. And the value in the last comb is the sum of the values in the first two columns. And this is because we have this constraint. So hopefully, this example makes it a bit clearer. And just to really go a little bit deeper into this, I'm going to also show you how line by line what you're doing with this query interface would correspond to a SQL query or SQL ish, I guess in this case. So when you get to the first line, if you translate this to SQL, what this would mean is essentially that you only know that you're going to be selecting some stuff. Then when you have this first line, and you say I equals this query place holder, what this tells the SQL compiler is that you're going to be selecting the table of all values in storage. And you're going to be selecting from a copy of this table of all values named I. Then when you add the second line, the J equals Incvi. This is going to be interpreted in a very similar way. So there's going to be another copy of the table of values. This time is going to be called J. And you're also adding this constraint that J is Incvi. And then the last line is also or the last part of this computational or these computational relationships also has a very similar effect to the query. So you are conjuncting on this new constraint of final is the add of find J on top of the constraints you had before. And finally, when you call this get table function, it tells you what it is you are selecting from this. This was just a basic demo. And now I'm going to talk about how to actually scale this up to more interesting use cases. So so far, I've been talking about what we have here, this retracing pattern which tells you if you want to get to something in storage, just walk over the code that computes this thing, which sounds like it could take ages for certain things. Like if you have lots and lots of code, just walking through this code again could take a long amount of time, even if you're not doing anything. And so this pattern may sound a little bit hopeless maybe, but what I'm about to show you is how you can actually in very natural ways overcome these sorts of problems. And the second part of the demo is going to be based on this random forest example. So you don't really have to know anything about random forest beyond the fact that I'm going to explain here that it's a machine learning classification algorithm. And as typical in machine learning, you have some examples, you have some labels, and you have some randomized decision tree training algorithm. And you're going to pass these examples and labels to this algorithm many, many times to get a bunch of random decision trees. And these trees together comprise your random forest. And then when you want to make a new prediction on some example, you take the predictions from all these trees, you do a majority vote on them, and this is your final prediction. And for this example, we're going to have just a few functions that basically go over what I just explained. So the first function takes no inputs and just gives you the data set that you're going to use. So it gives you examples and labels. Then there's a function to train a single decision tree. So it's going to take in some examples, labels, and some random state just to keep track of the randomness. And it's going to return a tree. And then there's going to be a function to evaluate the entire forest. So it's going to take a list of trees. So this represents the random forest. It's going to take in the examples and labels and give you back the accuracy of this forest on this data set. And one notable thing here is you have this list data structure appearing in one of the arguments of our functions. And part of this example is to demonstrate how these sorts of data structures seamlessly combine with these patterns that I've been talking about. All right. So these are the functions. And then the workflow that we're going to start with is basically what you would expect given these functions. So we're going to create this data set. We're going to train 50 random decision trees on this data set and encapsulate them as a list. And finally, we're going to evaluate the random forest consisting of these trees on this very same data set. So going back to the notebook. So I'm just going to set up what I've been talking about here. And we have these functions that I talked about. You don't even need to read this code. And here we have this maybe a little bit larger. We have this initial workflow that I've described. And we can run this. That's pretty great. We get some accuracy of 0.94. And now, you know, despite this list here and things like that, running this code again again has the property that, you know, running code twice with Mando always has, which is that it's not going to compute anything. So running this again, if you notice, it took less time. This is because it's not really training anything. It's just retracing the steps. And I mean, again, we get to this number. And if you think about this workflow a little bit, you can notice actually some problems with this. And I guess the problems are clearest to understand if you think about what work you have to do to get to this last thing in the workflow, this forest accuracy. So this accuracy of the random forest, it's just a single number. So it's a very small thing in storage. It's very fast to load. However, by using this retracing pattern, getting to this number is very, it causes you to do some very heavy work. So one unfortunate thing you have to do is you have to go through this loop again. So you have to loop through each of these function calls. And when you're doing that, you're like, oh, I've done this, I've done this. So you know, you've done all these things, but you're still looping through them. And the other inefficient thing about this workflow is along the path to this final single number, you're loading all these things from storage. And they could be potentially very large objects that you don't even want to look at for your use case. So these are the problems. And we're actually going to show some very simple ways to solve each of these problems. So to overcome this looping problem, what you can do and what's very natural from software engineering actually is that you're going to put this loop into a function. And now instead of looping through 50 things, you have to just go through one function code. This is going to be it. And then to deal with this problem of potentially loading these huge things from storage, you're going to do lazy loading, which basically only ever loads something from storage if it really needs it for computation. So this is where we're going with this. And now in a little bit more detail, so the first going back to these problems, so we're going to first fix the first problem by using essentially subroutines and adapt them to data management. So these are what are the so-called higher-level memoized functions. So if you recall, this problematic part of the workflow that we have, this highlighted line here, this loop, what we're going to do is we're going to extract it as a function. So I'm defining this function, train trees. I'm exposing some of the things it needs to compute whatever it's computing as inputs. And it's returning this list of decision trees. And I've just taken this piece of code and I've just put it into this function. And if you notice, the decorator here is different. It's before it used to be up. Now it's super up, which reflects the fact that in this high-level memoized function, you can actually call other memoized functions, which is not something you can do in the lowest-level memoized functions. So you extract this as a function and then you just go back to your workflow and you refactor it. So we can actually do this exactly what I just described. We can execute this. And now here's the refactored workflow. So this is the refactored line. Everything else is the same. And I can run this. And as you can see, again, we're getting to the same thing we had before. But now we are going through this new function. And now if I run this again, something actually a little bit different is going to happen from the first time when you run it. And this, I think, deserves a little bit of note. So going back to this workflow that we refactored, if you think about what happens the first time you pass through it, so a lot of the stuff is the same as in the inefficient workflow we had before, except now we're going to get to this call to this new function train trees. So when this happens, because this is a brand new function, there's no calls to this function and storage. So you're not going to find this call for these inputs and storage. And what you're going to have to do is you're going to have to actually go inside the body of this function and retrace whatever is happening in there. And this means going through this loop again. And finally, you're going to come out of this loop. You're going to get finally the list of fees. You're going to return this. And importantly, you're going to save the call to this new function train trees. You're going to save this call and the storage. And then you're going to proceed with everything else as you did before. And the important thing that happens here is that by running this code, you've essentially created this shortcut to the storage. So you've had to go through the loop again. But now that you've done this thing, that you've saved the call to this new function, you've created a shortcut in storage. So that the second time you pass through this new workflow, again, you're going to get to this call to train trees. But now you're going to find this in the storage. So you're going to skip directly to the final thing. And then, again, everything else is going to proceed as it did before. So you've essentially kind of patched your storage to introduce this shortcut that allows you to jump over these uninteresting calls that you don't want to retrace every time. Clearly, a nice way to solve this problem. However, this very simple natural idea of using subroutines is also useful for a range of other things. So one thing, so what subroutines are usually employed to do in software engineering is to manage the complexity of code. And this abstraction actually allows you to kind of carry this over to manage the complexity of data. So as mentioned, this can be used to optimize retracing. On a more, I guess, philosophical note, these high level functions kind of make this bet that in any sufficiently complicated project created by humans, there are going to be these stable higher level or stable hierarchies of abstractions that humans come up with, just be able to keep things in memory even. And these high level memoized functions kind of allow you to tap into this phenomenon and go beyond the way we use this phenomenon in software engineering where we're just structuring code that has nothing to do with some data associated, or there's no data associated with this code. And these high level memoized functions allow you to kind of generalize this to also manage the data connected to the code. And a final thing that I'm going to show you some more pictures about in the next few slides is this sort of hierarchical structuring also optimizes the declarative query interface that I showed you before. And just as a quick example of this, this is just a different workflow just for the purposes of illustration. Imagine you have some workflow like this where you pre-process some data, then you train a model on this, and then finally you evaluate this model. So just very basic machine learning stuff. And imagine you've run this sort of workflow with many different instantiations of all the quantities here. So many different datasets, many different values of alpha, many different labels, even though the labels would vary jointly with the dataset. And you've run this many, many times, if you've computed many different accuracies, depending on what things you start from. And if you think about the relational picture of this, or what this might look like in your database as a mental picture, you're going to have something like this, or you're going to have, so here in green, you can think of them as one column tables that keep all the different values of the local variables in this piece of code. And in red, I have these tables with more columns that correspond to the functions that are being called here. And this is a very useful mental picture to have just for what's going on behind the scenes when you use this declarative query interface. And these red arrows here are essentially foreign keys from the memorization tables of these functions to whatever values of the inputs corresponding to a call. And if you want to use the declarative interface to query something like this, what you typically want is maybe to get a table of the joint dependency between x, y, alpha and accuracy. And to do this, you'd have to do some join of these tables. So you have to join a bunch of tables to get to this final result, which is fine if you're joining a few small tables, but if you have a project with like 100 functions or something like that, it may be more problematic. And what these high-level memoized functions allow you to do is essentially to just draw this abstraction boundary around this and replace this with a single function that kind of encapsulates whatever is going on inside this box that we have here. And now that you do this, you don't have to join any tables anymore. So you literally have the answer to your query like in this single table. So this is how these high-level functions allow you to optimize queries. And then, oh, maybe this is a good time to actually stop to see if there is any questions about this at this point. No questions? All right. So this was about these high-level memoized functions. And now the other optimization that I mentioned I'm just going to very briefly describe is this lazy loading. So it's basically what you would expect. So we have this one keyword that we can actually do this here. So you have this lazy keyword in your context, and this makes the entire context use this lazy loading. So you run this and you see now the final thing I end up with this accuracy, it's not in memory. And in fact, none of the things along the way are loaded in memory. So what's going to happen when I run this is that when I get to this first line, X and Y are just going to be pointers to storage locations without any data. And then trees, even better. So trees is technically, it's going to be, and Python is going to be a list of things. However, the lazy loading is so lazy that it doesn't represent this as a list of pointers to storage. It represents this as a single pointer to a list of things in storage. And it's only going to load this list if it really needs it for anything. So for example, in this code that we have here, so this tree is like a single pointer, but that's enough because this pointer has a UID. And so this UID is enough for this call to the next function down the line to see, oh, I've done this call before. I don't care about the contents of this list. I can just give you the result of this call. And so it's sufficient like this. And there's all sorts of other things that are involved in the implementation of this lazy loading. But the bottom line of all these optimizations is that they are defined so that whatever control flow you have, if you add this lazy equals true to your code, it's not going to break your control flow. So you have conditionals or iteration or things like that. It's going to keep working with this lazy loading. So it's going to figure out what it needs to load. And then I'm going to just very briefly mention some more advanced query patterns. So if you remember this query we had in the beginning, it's a very simple one, but there are some things to note about it. So one is that all constraints that you specify here are going to apply to whatever you're querying. And then another thing is what if I have data structures in my workflow? So how do I even encode this as a computational graph? And there's answers to both of these questions that you can, if you're interested, I have more slides on this in the appendix. So I can talk about this afterwards. But the bottom line is that first of all, you can match two data structures in these queries. It's a little bit more involved, but you can do it. So this is what this make list construct here is for. And then you can also partition the constraints in your queries with this branching construct. And what the effect of this is going to be is that for each block of code in a branch, either all of the constraints are going to apply or none of them. And how these constraints are going to be determined is going to be based on context. So if I request something, or if I request some variables only from this block, if I want to get a table involving variables only from this block, I'm not going to activate these constraints. However, if I request some or some variables from this block, and they depend on some things from this block, then I'm going to activate all the constraints in both of these blocks. But this is just scratching the surface. I mean, there's a bit more details to it. But the bottom line is you can use this to partition the logic of your workflows. And why is this actually useful? At least in machine learning, it often happens that you have some initial data processing, and then you branch into several different ways of trying to extract some information from the data, whether it's going to be training some machine learning model or doing some data analysis. And this sort of structure with the branches allows you to reuse the code for the initial stages and add on top of this the different branches that you're exploring in a way that doesn't cause the constraints from these branches to interfere with each other. And of course, you can do this recursively. So you can branch even more from one branch. You can have multiple branches afterwards and so on and so forth. And another nice thing about this is that these different levels of abstraction that I've been talking about with these high level MemoS functions can coexist in the structure of your queries. So if you want to treat one of these higher level functions as a black box, and you can do this, but you can also have a different branch in which you kind of look inside what's going on in this function. And the whole intention for these patterns is to enable you to, you know, however complicated the project you have with many different, you know, branches and pieces of logic to be able to write down a single queryable piece of code that captures this logic so that you can, if you want to query, you know, whatever relationship you only have to point to the variables that you're querying and don't have to write extra code for this. So this is the motivation for this. And finally, I'm just going to very quickly mention refactoring because I think we're nearly out of time here. So there's various ways to refactor things in this model of storage and one way is to just extend the functionality of a function. So before we have this function that creates our dataset that didn't take any arguments. And now if I want, I can just expose some parameter from inside this function as an argument. And I'm going to do this while keeping the relationship to whatever calls I had for this function before. So what this looks like on the level, like behind the scenes in the memorization table was going to be that I'm just adding retroactively this new column to my table. And all the things in this column are going to point to the default value that I've created for this argument. And this has a bunch of, you know, useful applications, especially in context like exploratory data science, where you always want to like tweak and add new behaviors to functions. And then the other thing you can, the other way in which you can refactor a thing is to create a completely new version of something. So you're going to forget about all the past calls to this function. And for example, as you can see here, we have this function called Inc, which actually decrements a number. So this illustrates one use case of this, which would be just to fix a bug in this function. So if you notice that one of your functions has a bug, and you just want to fix this bug, you can create a new version, then you can recompute everything that you've done. And it's going to, you know, recompute only the things that depend on this faulty function that you have. And in general, you can do other sorts of backward incompatible changes. Or if you want, you can also force reconputation of this function. All right, so this was just a bit of a refactoring, but I just want to say a few words about the vision for this whole thing. And where I hope to be able to take it is essentially to enable this metaphor with these programming patterns that I've been showing you of this infinite interactive session. Because if you think about it, in an infinite interactive session, you don't really have to save or load anything because it's all of it is in memory. So you don't have to even think, you know, how am I going to organize this? How am I going to save and load something? And what Mantua essentially enables you to kind of pretend is that you are working in this infinite interactive session. And what's more, it actually makes this sort of a metaphor practical by adding all these memorization features and queryability features to this so that you can address all the objects in this infinite interactive session by using directly the code that produces these objects. So this is, if it is like a one sentence metaphor for the whole motivation for this, it would be something like this. So yeah, so I think I'm pretty much done with the presentation part. So yeah, thank you very much. Thank you so much, Alex, for joining us today for the Collabs Research Seminar. For anyone who wants to follow along to make sure you catch these live, follow us at Proto Research on Twitter. And then you can also sign up for the monthly emails with the schedule to be sent directly to your email in the description below. So thanks again. Yeah, thanks everyone. Sounds great.