 Welcome to today's PostgreSQL conference webinar using the PostgreSQL recursive CTE to compute bacon numbers for actors listed in the IMDB. We're joined by Brynn Llewellyn who will discuss how to compute bacon numbers using good old PostgreSQL and demonstrate it with working code. My name's Lindsay Hooper. I'm one of the PostgreSQL conference organizers and I'll be your moderator for this webinar. A little bit about your speaker. So Brynn is a developer advocate at Ugabyte with a specialty in SQL and stored procedures in the context of distributed SQL. He's worked in the software field for more than 40 years and he started working with SQL when he joined Oracle UK in 1990. Brynn started off doing image analysis and pattern recognition at Oxford programming in Fortran and then worked in Oslo, first at the Norwegian computing center and then in a startup. In Norway, Brynn programmed in Simula recognizes the first object-oriented programming language and which was the inspiration for C++. So welcome, Brynn. So with that, I'm going to hand it off. Take it away, Brynn. Thank you very much, Lindsay. I'm going to assume that I'm speaking clearly enough and everyone can hear me unless I get a signal of some kind to the contrary and Lindsay's already told you the title of the talk. I'm going to assume for the time being that you all know what baking numbers are and if you don't, a moment will arise when I explain it and your suspense will be dissolved at that point. By the way, my title got changed. Who knows what they all mean, but I'm a so-called technical product manager these days. So who am I? Well, Lindsay already gave you a long spiel there and recently, I was one of these Postgres people of the week and when I poured out my whole life story and got it recorded, I thought it was useful just to have it there and I can point you to it and you can find out whatever else you need to know. Briefly, I'm interested in Postgres because I work at this outfit called Ugubite and Ugubite has a Postgres fully compatible scheme on top of a Google Spanner-inspired distribute storage system and the experience for the SQL developers indistinguishable in the two environments. So that's me and about, what's happening here, about you, I'm going to assume and it's hard for me now to know if my assumption is right so it'd better be that you know Postgres SQL very well and it's your business to actually develop applications using SQL and quite probably stored procedures, PL, PG, SQL as well and you're very used to this command line interface called PSQL and you use it every day I would think and I don't have to advocate the point of SQL and tell you that no SQL is rubbish or any of that business and you're not in any way upset by the fact that it has things go in the modern world. This is almost ancient history, cod and date were laying the foundations a long, long time ago as long ago as the 1960s. You may though or may not be an expert in this recursive CTE business, I'm sure everyone is used to the basic idea of a with clause with stuff in it but the recursive flavour of what you can put in it is regarded by some as a bit scary, threatening and impossible to understand and it's my mission to explain that to you today and when I was struggling to understand it, which I had to do in order to document it in our documentation, I certainly found a lot of it challenging and I hope that I can have you benefit from what I found difficult and came to understand. It doesn't matter if you have or haven't had exposure to a Ugabyte DB but I can always hope that I can inspire you to get interested. I mentioned documentation so here's a quick segue to it. A lot of my time and energy is spent writing our documentation. You might wonder why we're doing it because I just said that the Postgres developer, the SQL developer has the same experience using the PSQL prompt as they do using our version of that and when it's connected to our so-called Ugabyte DB, well, never mind exactly why we decided but when we had decided we should have our own doc, the question is, well, is it just going to be a straight copy and paste job? I have seen such things from other various variants of Postgres but we decided that we could complement the Postgres doc, which I'm sure people understand when I say it's never wrong or exceedingly rarely wrong but it's also very, very, very terse and sometimes you just can't find an example that brings the understanding you need together and that's you can often do things that you've just been studying and the aim with this doc is therefore to provide you lots of use case illustrations and it's structured in an obvious way. I'm in our SQL system and we come to the actual SQL language and this whole battery of create this, drop the other and so on and the with clause is part of the SQL operator so I can go in and that and indeed down into that into this here recursive CTE and under that we've got this vacant case study and for that matter an employee's one as well, which is another famous example. So all that is to say that anything I seem to rush today, you can look it up in our doc and all the examples that I'll be running were taken directly out of the doc and just put into little files for convenience. So let's get on and for the benefit of those who are a little bit uncomfortable in this space, I'm just going to do a basic refresher of this whole topic recursive common table expression. I've never heard anyone say common to table expression. They just say CTE and they probably have forgotten what it stands for. It's basically a view anyway. I don't know why the SQL standard committee invented this term common table expression when they already had view in their hands, but they did. So here we go. This is just to remind you and what I showed you that it's all there in our doc for you to read if you find the postgres version of the account, a little terse, and this is it then at the crudest level of saying it. This optional keyword recursive fundamentally changes the meaning of one of these things that look like views that you define after the keyword with and before you've run out of all that lot and come to the actual SQL statement at the end of it all. And as mentioned, they call these guys that look like views CTEs and the recursive variant lets you do things in SQL that without it, at best, they require phenomenally verbose formulations involving things like self joins and sometimes in the limit, it lets you do stuff that you just couldn't otherwise do and you would be forced to use a procedural approach. The whole thing is not exactly new, but in the history of SQL, it's relatively new. I think the SQL standard formalized all the stuff sometime just before the year 2000. So this is its basic look. The with clause starts with a word with and it goes blah, blah, blah, blah, all these things separated by a comma. And then you've got the thing at the end, which is very frequently, but not necessarily a select statement. And this is the thing we're talking about. And you might well have comma and some other CTEs following it, but you can only have one recursive one and it must be the first. And its basic structure is this bit, just as you know and love, its name and the columns that it will return, just like when you define an ordinary viewer at schema level or an ordinary CTE in with clause. And then what follows the as parentheses, its definition has got two parts, a select statement that they call the non recursive term. It's basically the thing that gets you started. And then union, all if you're confident that that you know what you're doing or if you don't, not all, you know, the union all is quicker than the bare union, but it might give you duplicates. But if you know, you're not going to get duplicates, you can save time checking. And then this, the recursive term. And I'm going to say, and it'll come clear as we go through. But I think that this basic syntax is utterly obscure and entirely non intuitive. It's only when you've understood what it's all about that you can in retrospect get a meaning from it. And of course, you need that in order to write one for your present purpose. But the key observation here is I've called the guy R and I'm referring to R here, which is why they like to call it a recursive self-reference. But the contents of R here are not perhaps what you might think. And there's certainly what not what they are here. And I will make that point clear now as we go on. So this is in a sense, this is odd, because in the message here is odd, because SQL is supposed to famously say, just describe what you want and let the implementation work out how to do it. And yet it seems that you're actually giving a procedural recipe here, but somehow you have to find your own way of feeling comfortable with that. You could say that we are defining, describing an ultimate set, but the way we're doing it is has a certain feel of proceduralness to it. And if this procedural kind of under the covers kind of interpretation of it helps you, well, be my guest. So this says that a non recursive term is invoked just once to get you started. And then the recursive term is invoked time and again. And on its first invocation, it acts on what the non recursive term produced and on subsequent invocations, it acts upon what the previous invocation of it did, not on everything it's accumulated to date, only on what the previous invocation of this recursive term did. But behind the scenes, completely inaccessible to anything you can express in your SQL, each of these evaluations appends its output to the growing result set. And then it stops when you've got no more things coming in next time you evaluate the recursive term. So going back to this then here, this means exactly and only what I got last time I did this one, but this means the whole shooting match, everything I've accumulated to date. OK, so here's an actual fully runnable example. And it's just all in all the books, all the databases use this no matter which one it is, it seems to me. And we're saying we're going to start with a relation, which has got a single column and it's got the single value one in it. And then each time we do this, we're going to select what we had before, which in this case is one, add one to it and then use that next time. So that's how we get one, two, three, four, five coming out. There we are. And this is, if you like, a slightly more wordy version of this. And it's my way of rewriting what the Postgres doc says. And this is a very much a pseudo code kind of way of thinking about it. But it certainly helped me. So we have to imagine we've got some invisible tables that are spirited up to support the implementation of this during the execution, a single execution of the recursive CTE. And there are some, there's a final results table, there's a previous results table, and there's a temp results table. Now, significantly, you can't see or touch these in any complete fashion, but you implicitly see the previous results table because that's what. This guy here is, it's the contents of the previous results table. Sorry, where I gone here. And you certainly can't see this temp thing, but you know what it will have. If you understand the effect that your sequel is having, you've written your sequel exactly to populate this temp results table and you most certainly can't see the final results table. It's just getting accumulated as you go. That's to say you can't see it until you fall out of the end. And then that's what this is, the final results table. So that's the way to think of it as we go through them to get started. We purge all the tables here. I'm saying purge the final and the previous ones. And then we evaluate the non recursive term and we put it as I've laid the stage for its results into the previous results table. And we also copy what we got this time around, appending it into the final results table, which at the moment is empty. We've just got it started. And then we do what the use of the word recursion encourages you to understand this recursion, but the postgres doc itself kind of teases you by saying this is really iteration, not recursion. Anyway, this is the loop and it goes on as long as we produced something last time around. And we purge the temporary results. We evaluate the recursive term using whatever the previous results table has in it for this recursive self reference. This is the recursive self reference here. And then we put what we get into the temporary results table, purge the previous results table and then move them over. You know, it's just like a two table little twizzle dance because we've got to preserve the previous ones until we've completed our use of it, then it's ready to be populated from what we got this time. That's why there's both the previous and the temp. And this temp is rather transient in this picture. It's content is. And then also this is the key. We append what we got this time around into the final results table. And then we done. And when we fall out of the end, everything is visible here in actually the final results table. So there we have it. And here there's a little demo of this. So I'll go over here now. And this is only what I showed you on the screen. And if I'm over here, I'll run up Postgres and I will just to be sure I know where I am, run a little shortcut I've got who drops and recreates my user. So you can see I'm fully up to date. Here we are. And this user is empty, nothing in its scheme and no tables, no procedures, no nothing. So then I can just do this thing here. And there we are. I rushed very slightly by saying this isn't quite the example I showed you at first. It just adds another little bit of subtle here by saying that I've got a two column CTE, two column view, and the non recursive term establishes three rows in it, zero one, zero two, zero three. And then when I do this, I'm acting on obviously the set of three. And I am producing another set of three and I am adding one to the first and 10 to the second each time around until I've done enough, which is why I see, you know, all the zeros with one, two, three, which is what I started with, all the ones with 11, 12, 13, all the twos of 21 and so on. So I hope that's clear. This is inevitably a bit of a rush. But my, you know, conscience is clear because you can look at all this afterwards. So having done that, we'll just look at this stored procedure implementation of what I described here as the pseudo code mental model pseudo code ultimately is not much use of its own in plain English. And it's far more useful if it's in real code and what better real code to use than what we know and love in our environment. So I'm doing what I said, I'm purging this stuff out for to have a clean slate. I'm inserting this stuff, which is exactly the moral equivalent of this, right? There it is. Same thing there. And then I'm putting it into the final as well. I'm copying over from the previous into the final. And now I'm ready to start. Here's my recursive term, the same while loop that I talked about. While I didn't get anything last time, then I am deleting from the two working tables and why I'm sorry, I'm purging at the temp results table. And I'm just doing this, which is exactly what I was doing over here. It's the same thing. Except I'm doing this little dance, twizzling around and accumulating everything in the final results, but also setting up the previous results for next time around. And then when I'm finally done, I'm just going to invoke it in the same way as before. So just to hammer the point home. What files have I got here? I can never remember the names. OK, so if I run the pure one first. Sorry, then what I showed you before. And now no surprises if I run the hybrid. Output identical. So I hope I've done enough to convince you that you could do your own experiments, of course, that this statement about how it all works holds. And actually, that understanding that there are two alternative implementations of the same basic idea is gold dust when it comes to getting a workable solution to the bacon problem at hand. So I'm just going to hammer home one other point now. Which is this business of recursion. Let's knock it on the head once and for all. I'm just getting that ready and we'll do a demo, first of all. So to do that, I'm going to come down here. Those are the files that I just got handy. No, sorry. I mistyped. Did I not? This is what I'm talking about now. And we're ready to run them when we need to, but let's look first of all at the code. This is the thing that everyone who studies computer science learns. It's the classic demo of recursive function. So the function is called Fib and its input is integer, meaning give me the N, you know, the Fib, Fibonacci number for the input N. And you notice we're calling. This is the thing that Wikipedia tells you defines the series. You know, Fib of N is equal to Fib of N minus one plus Fib of N minus two. And how on earth does it get started? If I come in with zero or in fact anything less than one, I'm returning zero. Otherwise, I return one. Otherwise, I do the recursion. And of course, if I last that in. This is absolutely nothing to do with CTEs yet. Then, of course, I can just select. Fib of 12, which everybody knows is under 44, at least I do the number of times I've done it in the past few days. And then sometimes it's just nice to step through the whole lot and show it for you to see this is the whole series now. I don't know if you are familiar with table functions, but this is one of the funny things about Postgres. If you want a nice neat output of something, you can't use raise info because it comes out on the screen and not onto a school file, the way to do it is a table function. I'm not going to explain that now. I'm just going to do my select. And of course, I'm just seeing the series as expected. And that gives us a clue to this very banal observation that you can always produce this series completely conventionally without any recursive invocation of anything, just doing a four loop. You set up the first term, you set up the second term, and then you loop until you've got far enough. And if I do this one here, oh, sorry, I forgot the stupid me. I forgot to drop it before doing it. Right. Now I can simply select it and no surprise of what's coming. It's what we've seen before. And just to blast that through more quickly. That's the the straight procedural version of it. Forgive me for my mistyping. And there we've got it. This output and this output is the same. But the second one is from a completely conventional iterative solution. And now the interesting bit is here is a use of a recursive CTE. Very artificial, but it works. I'm just using it as a means to do iteration. So I'm starting off funnily enough. I just have to establish a relation here with only one tuple in it, because otherwise I'd get as many tuples as I had coming in every time. And that would be massively confusing. So I'm starting at fib of one, not fib of zero. And then this bit is basically just what I had before. I'm saying that my my X is going up by one. And the actual fib thing is sorry, this here, what I had last time is the previous one, you know, because this is referring to the previous. And then I had the previous one, real previous one from last time. And now I'm adding them together to get the result. That might seem a bit of a rush, but if you just think about it afterwards, you'll be perfectly OK. And of course, it produces the same answer. So this did nothing other than this little diversion. Emphasize the point. And this is in the Postgres talk to a slightly reworded it. But strictly speaking, they say this process is iteration and it's not recursive, the recursion at all. In other words, the SQL's standard committee screwed up. They chose the word recursive, but they shouldn't have. They should have called it an iterative CTE and would have had a much clearer life. But that's all history and it can't be changed. So now moving on, I did that and now the employee hierarchy. My only point in showing this is because it's the textbook thing. And it's all documented. This is a screenshot from the doc I showed you, so I'm not going to explain anything. I'm simply going to run it and show you what it does. OK, which is over here. So I have to go up to and down one. And I'm simply going to run my master script and show you what happens. I'll be as brief as I can. This is our bare table where every employed person has a name and unless they're the top guy, they have the name of their manager. This is our textbook, a so-called pig's ear, self-referential foreign key. And then this is the basic use of our recursive CTE, where we've annotated each basic employee tuple with the depth they are in the hierarchy, calling arbitrarily the top one. We could have called it zero. That's just a choice. And then this shows another way of showing the same information. Notice this is depth first and this is the same order. The information is the same, but we're here. We have Bill and Joan, and you have to use your naked eye to work out where Bill fits in things, our bill is under John, whereas John is under Mary. Here we've got Bill and Joan, as we had before, going back through Mary and John. So this is more useful as presentation. And by the way, we can order it once we've got it depth first, as well as breadth first, and that's what I'm doing here. But you might appreciate knowing that the basic actual mechanism is depth first. And though some textbooks say that that's rubbish because it uses a lot of unnecessary temporary storage space and depth first would have been better than breadth first, it is what it is. But when we've got the result, we can order it how we please. And here's a depth first thing, which is far more conventional in the ease of understanding. And this is just a little trick anyone can do with padding and so on to to show it for what it is, which is much easier to read than this. But it's the same information, you know, here's Joan under Bill and John and so on. And then we can even, having got that far, make it look like this eunuch's tree presentation. More or less, there's this funny little gap there with enough energy and the right use of a stored procedure, you could fix that too. And this perhaps is not so interesting, just showing them, you know, sort of extract from it. But this is another way of doing it. If I just say, where is Marion the whole thing? And I can go back the other way and see. Sorry, not where's Mary? Where's Joan? Joan is here. Where's Doris Doris is here. So enough of that, except to say that the key thing here is what the actual CTE at the heart of it all looks like. And I'm only going to show you this as almost some liminal flash now here because it's on the screen too. Here it is. And I hope it's relatively clear because the approach for the real bacon thing is pretty much identical. We get started by picking the top person and the top person is the one whose manager is nothing at all that simple. And then this bit, we're basically treating the employees from the previous iteration as managers and we're joining these with their reports if they have any. And in doing that, we increase the emergent depth by one with each step. In other words, with each iteration, it's far more useful to say iteration than recursion, and then it all stops when we didn't find any people reporting to the guys we have in our hand at the moment. And that's what this is saying. We're adding on one to the depth and we're using the manager name that we had from last time and the name that we had from last time and we're joining it with sorry, I'm saying completely wrong. We're getting these out of our table and we're joining these to the people we had from last time on the basis of a guy and their manager, which is what this is saying here. So again, I'm rushing, but it's the only way to get through the material. Now we're on to the real thing. I hope we can get through in the hour. There's a lot of symmetry in the proposition, but some of the detail is important. So. We're using synthetic data first, just so you can get your head around it. And then once I've explained all the methods, then we can simply plug in not the entire IMDB, but a curated subset of it from one of these universities who publish the thing as an exercise for their students. And the full references in the doc. So this is the basic proposition. We've got actors in the IMDB and we've got movies there and an actor wouldn't be there if he hadn't been in a movie and movie as guaranteed got at least one actor in it. So it's the famous many to many, which when you get into the database implies you've got a table of actors and for this example, it's perfectly sufficient to say that an actor is identified by its name and then we've got the movies where a movie is identified by its name in real life. There'd be a ton of other facts about them, but that's not important here. And then the intersection table implied by this many to many is conveniently called cast members and it's got an actor and a movie. And then the constraint is that they both, you know, they form the primary key and the constraint not typed up here shows that those are foreign keys to these two up here. This is again out of the textbook. Now, there's latent information here, which is a graph of some kind, because you can imagine that if you put a circle on a piece of paper for every actor, you could draw lines between pairs whenever you found that that pair actor had been in at least one movie together and you could annotate the edges in that graph with the list of movies. So this is the starting data and I'll go on to the next step now in a second and then we can populate mechanically an edges table, which has got the node at each end and the list of movies. Now, this is a bit non-codden date, a bit denormalized having an array here, but it is overwhelmingly convenient for the practical solution, particularly because for that basic path finding, that actual list of movies is irrelevant. All you need is to know the edge exists and that's given by this information here and then these constraints just do the right thing. So this is the derived but completely robotically unintelligently derived list of nodes where Alfie and Chloe are joined by Hamlet and Chloe and Emily are joined by Julius Caesar and so on. And this is the picture. OK, so this is our toy data set with how many four, five, six actors in it and how many edges, four, five, as many as you can count up there, that's a very manageable thing and I've annotated these edges. I drew the whole diagram by hand, but that information is what we saw there. If you like, I got the information from somewhere else. I typed it all in, I processed it in the database. I got this picture and I drew this from the picture. I hope you're all with me so far. So now let's see what this thing looks like here. But before I do that, perhaps I'd just better see where I am when I decide I should show you some stuff. Yes, in a second. So this is the heart of it all. And this is to be compared. With this, so this is how you do it in the hierarchy of employees. You get some place to start and then you get the next one or many people as the reports of the present set of managers by joining on the particular relationship here, manager ID and employee ID. And here is the same idea, except now instead. We get started very simply by mentioning the one we're interested in. And I've encapsulated this in a procedure. And there's more obviously outside at all, but this is the heart of it. And then this bit here, I'm considering who I got last time, that's the path. I should say the paths thing here, paths and I'm using what I got last time to go out into my table, which is the edges and I'm joining. So I've got these two relations in question, the edges, which is the table out in the database and the paths. And this is my recursive reference to what I got last time around it. So I'm basically saying all the people who are one hop away from the people I got last time, add them into the result set and remember that those are the people you got last time so I can come next time. And this is basically then identical to the logic for the other one. There's a trick here and I'll say it as briefly as I can. And that is to make this code simple and maximally symmetrical with the employee thing. I've acknowledged the fact that it's better to put every pair of actors in twice. Alfie going to Emily and Emily going to Alfie so that when I just come across Alfie that I can find Emily, whether one is no one and the other is no two or vice versa, so that's why there's a tried implementing it and you can see it in the doc where I didn't do that denormalization and the code is noticeably more complicated. So the denormalized approach wins. Now, this is the key point here because obviously there are cycles on any one the possibility of cycles as you follow any one path. For example, if you go Emily, Chloe, Helen, James, that's all perfectly legal. And then you could go back to Emily. Oh, dear, you don't want that because that doesn't tell you anything and it would go on forever more to point to your thing would never finish. And in the same way here, Emily, Chloe, Alfie, Helen can go back to Chloe and you don't want that. So on any one path, it is easy for you to eliminate, I should say to stop if you're about to get a cycle and that's what this is doing. I don't know if I should have stressed it more, but the array notion comes to our rescue here because we are accumulating as we go around the whole path to where we've got as an array and we can at least inspect all the elements in that array whenever we need to. And the Postgres stock gives a very, very let's say cryptic clue to this. It says you can manage to prevent cycles by using arrays intelligently more or less. And here we are doing it right there. So I'm just going to demonstrate some of this now by going over to the place in question and finding the. Bacon stuff and getting out the master code script here. And I don't need to show you it all in action because a lot of it is just superstructure. Sorry. So that's all the stuff there. I don't have to clear down everything, but I will just because then if I want to see what I produced so far, I've got a clean environment in which to look. So I won't just blast it all through, but I could. Rather, I'm going to get this far. Just create the tables I need, create the edges table as well, and the procedure who populates it, and I'm going to insert some data, the data I showed you. And there it is, nothing to see. This is where we see something of interest. So let's just see that. All I'm going to do is get my stuff into the edges table. Bang, that was easy enough. And then I'm going to see what I got. So here I'm looking at the interesting actors among the set of node one and node two, given the fact that everyone is in there twice and those are the ones. That's what I expect. Here I'm going to get the movies because it's an array for the list of movies. I have to do some stuff with distinct and unnest, but it's no harder than what's on the screen, and those are the movies. Sorry, Shakespeare, but they're movies for me. And this then is an example of the edges. So in fact, it's all the edges. OK, that's what I showed you before. And incidentally, I can look backwards at it if I prefer. And I see the same actual information, but they're distinct in a sense because Chloe and Alfie here, Alfie and Chloe there, united by Hamlet. And as mentioned, you don't have to do it that way, but doing it this way leads to the maximally terse actual code in the recursive CTE. And that's why I preferred it. So we've done that. And then what comes next, this bit. So. We've yes, OK, we've got that far. And now we're going to do this find paths naive, I've called it. And you'll see why I've called it naive in a minute. Which way has that one gone? This one here, that's what we're going to look at. But before we do, just to remind where we're going, I'm going to have income this far. Just create some superstructure, which you don't have to see anything about. Just imagine there's got to be some tables and a little procedure that shows me the paths and then this is our thing, number seven. Here it is. And I'm going to do stuff with arrays and to save myself typing in the body of the code. I'm just making a trivial procedure who encapsulates cardinality to get the guy at the end. So that's what this is. Terminal of the path tells me the element at the end of the array. So I can blast them through. They're uninteresting in a sense. And then here, this is the main guy. And this is what I showed you on the screen. Right. This is exactly what I showed you on the screen. So I need to talk about it again. I just need it to be in place. OK, now I can run it. Of course, all it's doing is inserting into a table so you won't see anything. But now I want to see what I've got. So I'm going to call this utility thing that I wrote. But it's actual implementation doesn't need any recursive CTEs or anything. So I'm not going to explain how I did it. I'm just going to run it. And there it is. What a shock. There's forty four paths. They're listed just with this is the self-populating surrogate primary key. So you can see there's forty four of them. And they are in the order they came, which is not surprisingly, then, by the number of nodes being two, the number of nodes being three. It's listed in the card analogy order and they get as far as six. OK, so if you look back at the picture here, you can see what's going on. I'm doing Emily to Chloe. Then I'm doing Emily, not in that order, but at some later bite. I do Emily, Chloe, Alfie, Helen. OK, I mean, while I do Emily, Chloe, Helen, so I get to Helen twice. Then Emily, James, Steve, Helen, I got to Helen three times. There's nothing in the recursive CTE sequel apparatus that tells me I don't want that sometimes I might want that. But in this case, I don't want that. All I want is OK, there exists one path from Emily to Chloe, and that will do. There exist two paths from Emily to Helen, but one of them will do me. I would say in general, the shortest of them. But it doesn't matter which of these two of length two I pick. I just want one of them and I certainly don't want Emily, Chloe, Alfie. But I'm getting this is the problem so far. This is why here I've got that humongous number. So how do I rescue myself? And so simple, I simply go over to the actual procedural implementation of the same idea and I intervene with the pruning I need. And if you're like, this is gold dust, so it's a funny idea. You get your head wrapped around the recursive CTE, you understand what it's doing, what it's all about, and then you work out, it's not going to work for you, you throw it away and you say, well, that's OK. Because I know a procedural way of doing the same thing and you do that. Sad, in a sense, that we have to do that, but it works and it's quick. So don't worry about it. So you can see here, all we're doing is what I showed you. Here we're getting started in the same way. Exactly the same way. This is what we had before. And the stuff around it is just managing those tables, as I mentioned. And then here, the recursive term, again, there's the stuff about managing the tables and what we're putting in is this, which, as I'm sure you'll see in a heartbeat, try and remember that, at least it's shape. It's this, OK? So it's very direct. If you've done one, you can do the other just mechanically more or less. But the point is now, having got this far and put in our little code as efficiently as I could do it, to know when to stop, there's an opportunity for us to do pruning. And I'll show you that in a second. This is in a procedure who offers you the Boolean. Do you want to prune yes or no so that we can test at least when we don't prune, it does produce the same results and then more table management to the end. OK, so this is the pruning. But let me come to that perhaps in a second. Let me just first demonstrate with no pruning. So we've done that, number seven. And we're just going to get this one ready and do that. And no surprise, but we hope for this and are pleased when we get it. The output is identical, OK? All by the way, this is just reporting how many I accumulated each time, which is only what you get if you counted up all these with number six and so on. There are 12 here corresponding to this 12. So so far so good. We've shown that we've got a scheme where we've got an opening to do our pruning, but we didn't use it that time. So now let's look at the pruning. Which is here, OK? So this is the whole trick. If we're going to do it, then we want to get stuff we don't want out of this temp paths before we establish it over into the previous paths for the next time around. So we're certainly going to get rid of any paths who are equal to one of the shortest. This is a random choice. But among all those that are the shortest, we're going to keep only one. And if there are any longer ones, we're certainly going to get rid of them. And the thing is now we have total visibility, not just the particular path we're extending, but all paths that we produced in this repeat of in this iteration of the recursive term. That's the whole key to this. If you're just using the pure SQL, you only see the single path that you're extending in that very much key whole view. But if you get access to the tables, then when you've got first time around all the candidates, you can actually operate on all of them in the classic set way, which the actual recursive SQL formulation doesn't give you an opportunity for. So this is first bit of it. And the other thing I certainly don't want to accumulate a new path. If it goes to a place and it's certainly going to be longer because it's the next time around to a place I've already got. So that's what this is doing here. I don't know if it's immediately obvious to you. This is going to make sure that I retain only one among the two or several or one, two or several, actually, shortest paths. And then having done that, I want to make sure that that one shortest path I've established isn't going to a node I've already met before by a yet shorter path. There I have to see the raw paths. And again, you can't do that in the textbook pure SQL. Just a side note, if someone thought that the SQL should have an extra clause in it expressing the idea, retain only one shortest path to any given node, then this could be implemented behind the scenes. And we'd have a wonderful feature. But I don't suppose I can encourage anyone to do that. I wish I could just a little sort of existence proof oracle in this general space has a no cycle keyword. Postgres doesn't you have to do all this cycle elimination? Where was it now here? By hand, but in Oracle, you simply write no cycle. So clearly someone could do the same over, take that idea and do the same as I have been talking about with a keyword for this pruning, no pruning, in other words, retain only shortest paths. So let me show you that now in action. So I'm going to do the pruning business, but on the way, I'll just show you one little kind of side curiosity, almost. Sorry, I'm going to produce from this set of forty four only the longest paths that contain all the others. So you can see this first line says there's a path Emily Chloe, there's a path Emily Chloe Alfie, there's a path Emily Chloe Helen and so on. So this actually is the whole story of this in a more compact form. And if you wanted all every single path, this would be a useful way of distilling the things down to something you get your head around. But it's not what we want for the bacon problem. In case it's not crystal clear, I don't know if I'm really, really, really said, but I'm going to say it now, the whole aim. Sorry, this is the best picture for it is to say, by the shortest path, is there a route to Helen? And if so, how many hops is it? The answer is one to same for Alfie. The answer is one to now for Steve. The answer is one to that way around for Helen. It's one to that way around. That's what we're after. That's what the bacon number is. How many hops do you have to get through through the common movies to reach the guy of interest starting from the one of interest? OK, so back to this. Sorry, if I didn't make that completely clear before, I'm now going to do the sorry about that, that running the same procedure that I showed you before, but this time saying, yes, please do some pruning. I forget all the time I should say that. And there we have it. So this is the answer. There are these, you know, you could say that the number of hops is one. Emily to Chloe, Emily to James. And here there's two hops and that's the full story, as you see from the picture. And then if I want it yet more compact, showing only the containing, that's I can prune the answer down to three. So that's essentially now job done, at least on synthetic data. And here we are. That's what I just showed you. And I could I was able to use that to annotate this picture. These are the five paths. And if you said, well, that's not all the paths, I would say, but the wine says it is useful. Show me another interesting path that's shorter to any particular place than I've shown you here and you won't be able to. So all I have to do now is completed. Oh, actually, just for the hell of it. I'm just going to go through that table of paths, join it back on onto the edges and produce this. This is nothing to do with CTEs. It's completely ordinary programming where I've in the first thing just done it for everyone, you know, you can see from Emily, you go through these plays to Chloe, through that plays to Alfie and so on. And down here, just a slight twist on the same procedure. I've gone into the table looking only for Helen and the relationship to Emily. And you find these websites that let you calculate the bacon number for anyone of interest, they will have pre-calculated stuff and they're doing something just like this, querying up out to the record of what they got when they did the total reversal so they can quickly get the answer for the one in question. So I'm almost there now. Now I just have to blast through on the real IMDB. So it's exactly the same structure of data. I'm just going to delete it all out and put in that curated subset of IMDB. And do a couple of queries at the end to see what I got. So these are just the first few actors, first few movies and the first few edges going one way and the first few edges going the other way to get you a rough idea. But it's interesting to know there are now this many intersections, that many connections between actor and movie among these actors and these movies. So it's a much bigger data set and you saw we got 44 paths that most of which we didn't need, we needed only five out of a data set with only six nodes and about the same number of edges. So you can imagine what we'd get here. It's an actual combinatorial explosion. And if you try the ordinary, non pruning implementation on this bacon data, the thing just never finishes. So I'm just going to now run the stuff. There's nothing to see now, apart from the answer, basically. And all I've done is say that, well, I got that many paths and the deepest, the longest is six. I don't get to Kevin Bacon, obviously, because I start there and the one who's furthest away, there might have been several, is Christopher Nolan. And should I prune them down to those unique containing paths, there's slightly fewer than there are actual paths. And here I've done the annotated version, the thing you'd see if you Google such a thing on the internet. And then I've simply turned it on its head, started with Christopher Nolan. Same number of paths, obviously, because this curated dataset was set up so that you can get to every single actor from every other one. It's a fully connected, undirected, undirected cyclic graph, as they call it. And obviously, this stuff is corresponding. And therefore here, you know, it's just the reverse of that. And that's what I've captured on the slides for you. So you can see, as I go from slide to slide, the symmetry, starting with Bacon, ending with not reaching Bacon, of course, starting with Bacon, ending with Nolan, starting with Nolan, ending with Bacon. OK, same number of paths, same number of maximum length. So there we are, we're done more or less on time. And now I guess you can appreciate that this recursive CTE out of the box as a pure sequel thing is very powerful and it does support, among the other things, completely unrelated to graph work that it can do. It certainly lets you find graphs in, sorry, paths in graphs of all these kinds. We were looking with Bacon at the most general, the undirected cyclic graph. Clearly, you can do a directed cyclic graph and a directed acyclic graph. And as a special case, the rooted tree, the hierarchy. And that's what the employer hierarchy I showed you is. As it happens, because the rooted tree is so simple, you can use a slightly simpler implementation, but the basic principles is the same. But the problem is and it's not a problem in the hierarchy because, you know, all the paths is what you want. But in the Bacon thing, it is a problem because exactly as illustrated, there's no way to express find only one among each of the shortest paths to each distinct. Reachable node and no problem because a hybrid solution inspired by what the Postgres doc says is behind the scenes in the recursive CT. It comes to the rescue exactly because it gives you an opening where you have visibility over the whole works of what's going on. The entire results you've just produced this time around and the entire results are date and you can prune away so that you will get both rid of all the things you don't need and reach the end of the iteration much, much quicker. You'll produce an empty set in the case I showed you with Emory and so on. After only a couple of iterations rather than going on forever. And that's the point then it works. So finally, just a plug for our Yugabyte DB thing, it uses the Postgres SQL code from the distribution as is to do the what they here call query layers, better to call it the SQL processing layer, and it uses its own storage layer, which gives you global scale, fault tolerance, cloud native, all that stuff. Based on the basic thinking of Google Span it, but our own implementation of it. And if you want to know more, there's references in the slides. And there we are. I'm done three minutes over the bell, I'm afraid, but the first five minutes were taken up by Lindsay introducing me. So I'm going to say I made it. And if there are questions, I'll take them down. Thank you for your attention. Brilliant. Actually, technically, we are at 58 minutes and 57 seconds, not counting my introduction. So yes, you made it. Well, not a question, but you got a comment. All right, good. I understand the CTE now. Yes, I can see that. I'm going to read it out really good. I understand all this business now like I never did before. Well, I'm pleased and thrilled. Yes. Question is the presentation and the scripts. Are they on the documentation at Yougabite? Well, yes, there's two answers to that question. One of them is that as I think I showed you, but if we go to, for example, this place, there's a few references to it. I'm saying download a zip that includes all the examples. Now, this is a much bigger apparatus than I used here. And I produced what I showed you simply by reading this here bacon thing from start to finish. Of course, I have to think of it because it's a long time since I've written it in the dark, but when I read it through, I, you know, saw all these things here and I copied and pasted and I copied and pasted into scripts. And I arranged them into groups for demonstration purposes. So what I used here contains no new material that isn't available by copy and paste here. Or if you prefer to save some effort, you could just simply get it all in one humongous big lump by downloading off our website. Having said that, it did strike me that having that, if you like, curated extract that is focused on exactly what I talked about today as a companion for the slides would be useful so I can, during the rest of the day, simply zip up what I showed you, correct a couple of typos that I saw along the way before I do it and then post it in a kind of as a sibling to this thing and include the URL in the slides before I send the slides to Lindsay and then you will have exactly what I was showing you today, as well as the possibility to create it yourself by going through the dock. Having said that, of course, if you do go through the dock and read all the words around everything, you will have a real deep understanding that even what I said today didn't come close to giving you. Does that answer that question? It does. And, Brynn, if you want to send me that URL, I can go ahead and include it in the follow up email when I send out the record. Yes, I'll put it in my slides too, but you can I'll send it naked as well for you to put in, but it'll be, let's say, tea time, my time, my day. Sorry, tea time here in San Francisco before I do that today. Beautiful, beautiful. So I think that was our only question. I want to thank all of our attendees for spending a little bit of your day with us. I want to thank Brynn, especially the attention to detail here was pretty outstanding. And even I learned something and that's saying something given that I'm not technical and I just hope everyone has a great rest of their afternoon. So cheers and thank you. And thank you for having me, as they say.