 Good morning. Hello. Thanks for getting up at the crack of dawn, 10 a.m. while being more civilized than eight or nine, it's still early, so I appreciate you guys coming. My name is Asya Komsky. I come from MongoDB, and I'm gonna talk about SQL and MongoDB and how we try to get together and the role that Postgres had in it, although I don't want to get too far ahead of things. First of all, how many of you guys are familiar with MongoDB? Most of you. Okay, good, because this is not actually an intro to MongoDB or anything like that. Hopefully, if some of you didn't raise your hand, you'll be able to catch on. So I've worked with MongoDB sensibly for the last four years, but I've actually worked with databases for more like 20 years. In fact, one of the first databases I worked with was Postgres in early 1990s, because I happen to have had a job at UC Berkeley. So pretty cool. After that, I actually got a job at a database company called Empress. You probably haven't heard of it, because it's a tiny little company started by some University of Toronto grads, and they just kept the company small. It's actually still around. It's kind of cool. I then worked with many, many different databases. I worked with Sybase, has a bit of a relationship to Informix, of course, which I thought was excellent technology. Sadly, not the best business execution. I worked with Redbrick, which is a specialized data warehousing database. So I did a stint in data warehousing, as well as OLTP land. I used Times 10. It's an in-memory database. Sadly, since then acquired by Oracle, I even did a POC in MySQL once. I sadly then actually used Oracle in production. And then I used Postgres in a more modern times. It was like 8.4 by then, very different than the first time I encountered it. And for the last four years, I've been working at and with MongoDB. More recently, again, re-inquanted myself with Postgres, now the current version. Now, why do I say this, other than like to impress you with just how old I am? Well, I kind of want you to see that I'm not coming at this problem from the SQL hater point of view. I have used relational databases. I've worked at many small companies. So I've worn many hats. I've been the DBA, the data architect, the back-end developer, the person that yells at myself like, how could you think that the schema would ever handle this use case correctly? What were you thinking? I've been the sad man, et cetera, et cetera. So I try to be realistic about the features that the tools that I use provide. And I kind of try to use them to the best of their ability. I'm certainly not a no SQL fan girl because it's web scale. But I do understand that things are designed with different strengths in mind. So wanting to be realistic about it, I also want to kind of make it very clear that what this talk is not, is it's not a MongoDB versus Postgres. There probably are talks like that somewhere out there or there have been. I've seen some slides. This is not it. While it's not a MongoDB primer, I am going to highlight some of the features of it that cause it to be selected, which also happened to be the things that make things challenging when trying to map SQL to the type of data that MongoDB usually represents. Which, of course, is a problem for business users, which is, of course, the problem that we tried to solve, which is this chasm that exists between MongoDB and the dynamic, ever-changing shape data that usually goes into it, and business users with their tools, which expect kind of a more regular, predefined tabular data. I'll talk a little bit about some of the assumptions that we made, some of which turned out to be long, some of the technical difficulties. We had a specific implementation with Postgres that we did end up using. So I'm sparing you the marketing slide about MongoDB because I think we know it's quite popular. What I do want to highlight is what, in its initial design, appeals to modern application developers. So MongoDB is a document database. Of course, we know document is kind of like JSON, not like PDF or something. And it was designed kind of attempting to bridge the gap of having flexible functionality that relational databases provide with an ad hoc query language and the ability to query by many different attributes, but also to try to emulate the scaling and deployment model of distributed key value stores. So some of the things, it was definite that MongoDB had to keep, right, flexible query language, ability to index fields that are not the primary key, et cetera, but also the ability to scale horizontally because that's kind of like more like what we do these days for scaling. So this is an example of a JSON document. Hopefully you guys can see it's font big enough, more or less. So you can see just a bunch of key value pairs. I'm sure you're all familiar with JSON. Postgres actually supports it. I like to say JSON is like what we thought 15 years ago, XML would be, because it turns out it's actually more human readable and people like to be able to read the data occasionally, even if you're not gonna examine the entire data set. So JSON allows you to have values that are of different types, right, including arrays. So you can have a key value in your document where the value is an array of scalar types or it could be an array of other documents called sub documents and formally. Here's an interesting catch. You've heard people say MongoDB is schemaless. I hate that term. There's no such thing as schemaless. Schemaless would be like a binary blob, right? What it has is like this document has what we might say self-describing schema, right? If you look at it and you look at the different fields, you can tell what type each field has and you can see which types exist. What MongoDB doesn't have is schema pre-definition, a way of saying only, here's my table definition. We're only gonna have these fields and they're only gonna be of these types. If you've read that MongoDB does have schema validation or document validation, yes, relatively new feature, it's really more akin to relational database constraints. So like an optional thing you can tack on after the fact to check the data before it goes in, which is very different than schema definition, not required, something that might get tacked on after you've put a bunch of bad data into your database. Yes, that's my experience. So flexible query language, syntax looks quite different than your typical select, but really find is like a select and then the query predicate is expressed also like JSON, right? And you can have different types of queries. There's a lot of different operators. They map pretty easily to operators you'd already recognize. You don't have to get the full document back. There's the idea that you can say select only this field and this field, not return back to me the entire document. We call that projection. It's also expressed as a JSON document. There's range queries, right? You can sort the results, limit, offset, the kind of the stuff we're familiar, right? So people don't feel like they're giving up the flexibility of querying. When querying arrays, you can actually query them really flexibly. You can either say, for example, return me all the documents that in the array cars have at least one model called Bentley and at least one year that's less than 1966. It could be two different elements of the array. Or there's an operator to say, return back to me the document where the same element has both the car model is Bentley and the year is less than 1966. So semantically, you can express different things. Obviously, the more complex your data structure, the more extensive you want your language to be to be able to express these different concepts. There's support for geospatial queries, a bunch of other stuff. I'm not gonna go into, there is support for secondary indexes, right? And there's a lot of different types of indexes, basically different types to support all the different types of queries you can have. So there's compound indexes, the special geospatial either flat 2D or spherical. Text indexes, multi key is indexing anything that's in arrays. Indexes can have properties, unique sparse, partial, et cetera, TTL time to live. Okay, great, that's a whole lot of stuff that's there but some stuff had to get left out, right? So we already mentioned schema definition got left out. So you're flexible, you can put any document into any collection. Good news, bad news. Also to be able to scale horizontally, which is essentially partitioning your workload across multiple servers, that to leave out joins across different tables. There is no built-in support for multi-document or multi-statement transactions. And of course, having our own query language means there is no support for SQL. Well, if developer doesn't have to learn SQL, which we all know they don't actually know anyway, already, even though they've been working with relational databases for many years. How does a developer talk to the database from their application? Well, there's drivers that provide native support for over a dozen different languages. I think that's not even a complete list. There's also a lot of frameworks built on top of those native drivers. So let's look at this from a modern application developer's point of view. As far as they're concerned, they've just arrived in heaven. They don't have to learn some fancy new language, which they never actually tried to learn to begin with. There's no pesky DBAs or data architects telling them what they can and can't put into their storage layer. They can basically just do whatever they want and treat the database as a persistence layer for their incredibly clever objects. Stop me if that sounds familiar. Think that happens in all databases, but here it's almost intentionally by design. All right, so developers love it because great, they can do whatever they want. What about ops? Why would ops like MongoDB? Well, as somebody who used to carry a pager for a production system, I can tell you this. The main reason is probably replica sets. MongoDB pretty much almost from the very beginning had a concept of replica sets for high availability, which was achieved via automatic election by consensus of secondary node that's most up to date if a primary went down. So if you have a replica set of three nodes, one's a primary, the other two are replicating off of it. You could have like seven or 50 or however, I think 50 is the limit. If a lightning strikes your primary, instead of getting a call in the middle of the night saying you have to log in and promote a slave to be the new master in your barely awake state, instead the most up to date secondary gets elected to be the new primary and the cluster continues to work. When you come in in the morning and realize that somebody just tripped over the network cable and accidentally disconnected the primary from the network, you plug it back in, it rejoins, the replica set catches up, great. I never liked getting calls at three o'clock in the morning more than once because if you get the call once, you're a hero. If you get it more than once, then you're the bozo who didn't configure things correctly, right? Okay, so the other thing that ops tends to like is the idea of horizontal scaling, right? The idea of, well, if I have too much load for one replica set to handle, I'll just add more replica sets, we'll call them shards, and the work will get partitioned across them. And one of the reasons that's really appealing nowadays is because, well, we operate a little bit differently now than we used to, we operate everything in the cloud. The cloud is magical. You push a button and a bunch of servers magically appear. So the way that things used to be, like 20 years ago when you spent a few million dollars, which was like a lot of money, on a big powerful server and you'd put your system of records single database server on it, you had to treat it like it was your pet, like it was your family member. And if it got sick, you had to nurse it back to health because your business depended on it. And now with the cloud and Amazon and pushing a button, we pretty much treat those servers and databases on them like cattle, which is we'll push a button, we get a few thousand of them or a few hundred of them. And if one of them is not performing, it's best for acting sick. We don't generally go to the vet. We just not kill it whatsoever. We just turn it off and we provision a new one to replace it. So it's the whole servers as pets versus cattle analogy. All right, so developers love MongoDB. They wrote their POC in record time. So you know what they do next, right? They ship it. I hope no developers in the audience are getting offended. I feel like I can say that because I've been a developer and I've been guilty of exactly that. It's just that when I did it, I called it agile. That was even before that was a buzzword. Okay, and DevOps says, all right, we can push a button and provision more servers for you because we didn't actually load test it. So it needs more servers than it would if you learned about indexes and stuff. And it's cloud friendly, great. Everybody's friends, except there's one group of people that MongoDB didn't make friends with during this process. And that would be the business users. Who are the business users? Well, in a large enterprise, there may be thousands of them in there generating reports that pretty much tell the business it's health. You deploy a new application. How many users are using it? What sort of ramp up rate is it showing? What different things that you're trying are having an effect or not having a desired effect? And the problem is that not only do business users pretty much no SQL because that's what they've been using for the last couple of decades, but they're also using tools. They're using tools like Tableau or MicroStrategy or Excel, everybody's favorite reporting tool. And those tools understand that the entire world is made up of data that's in tables, rigid, predefined tables, and that's what they expect to speak to and they send SQL to it. Most of them don't even understand arrays, even though relational databases, a lot of them anyway, have provided support for arrays for a while now. So on one hand, you have this modern dynamically shaped data that's got arrays and sub documents and flexible data types. And on the other side, you have your regular reports. And the people who write in regular reports say, hey, we want to be able to access the data in MongoDB. How do we give them this access? So we figured we had a few options. Option number one, everybody's favorite from our data warehousing days, ETL, extract, transform and load, right? You, the source data comes from one of your bazillion systems, all OTP-ish system, one of them now is MongoDB. You just extract that data, you normalize it, you flatten all your shapely data and you load it into your data warehouse or your data mart, report against that. The problem with that is, of course, the business users are saying, why should we have to wait? We're more and more actually going directly to the source system. Hopefully they're going to secondaries. And they're saying that takes too long. Plus, they don't want to depend on another group. They don't want to depend on resources on another team, right? To update this ETL process when schema changes, et cetera, et cetera. So okay, option two, let's teach the business users the MongoDB query language. They'll love it. They'll love JSON. Everybody does, right? All those parentheses that look so pretty. Yeah, I didn't think so. It's a lot of people who have skills that they're experts in. Plus, even if we teach them Mongo query language, how are they going to stick that into their Excel or Tableau, right? So that's probably a no-go. What if we teach the BI tools, Mongo query language? Now, honestly, in my world, that's the ideal scenario. Ideal scenario is these tools come into the 21st century, realize that relational databases and other databases support arrays, support differently shaped data now, and start representing that to the end user so that reporting can be more comprehensive and more true to what the data behind it actually is. But that's not my windmill to tilt that because I don't work for those companies. I work for a company that's trying to solve the how do we get business users access to MongoDB data. So, okay, what if we just tell them that they can switch their BI tools? It turns out that there are a handful of BI tools that can natively connect to MongoDB among other less traditional databases, right? There's not many of them. So imagine going to some giant bank that has 100,000 employees of whom like five or 10,000 are these business users. So hey, you know that tool you've standardized on like a decade ago? Time to switch. Here's a tool you've never heard of. It supports MongoDB natively. And they go, yeah, right. Goodbye. So that leaves us with the option to add some kind of a mapping layer to translate between SQL and Mongo query language and to somehow provide the information about schema from MongoDB. And that's what we picked because we're a bunch of masochists. So, the other thing I gotta tell you to be honest is my position was this will never work. I said, fine, let's build a prototype, give it to some of the end users that we're working with closely, right? Have them try it with their reporting tool and they'll tell you, they'll tell us that it's unacceptable, that it's too slow, that it's too complicated to maintain an extra layer in between them. They'll tell us that SQL can't really represent the way they want to query their data that they can't possibly use a fixed schema definition to describe their fancy multidimensional complex MongoDB data. And here's a funny thing, that's not what happened. The users actually said, yeah, that works fine. Great. Yeah, do more of that. And I was like, really? You sure? You're like, oh yeah. It turns out no matter how fancy and complex the data might be, the people who are generating reports usually know which fields they expect that they want to report on and those tend to be the fields that, well, that they expect that are generally gonna be there. So there's some technical issues around that that we have to solve, but it kind of proved that maybe this approach would work. I said, okay, sometimes you gotta give the users what they tell you they actually want rather than what you want them to think that they want. No, really, this is better. Okay, fine. What you want, I guess, is gonna be good enough. So we gotta build this bridge between your Excel and MongoDB and call it BI Bridge, Business Intelligence Bridge. The product is actually called BI Connector and said, well, our prototype was based on Postgres foreign data wrapper. So let's see how far we can take that in doing what needs to be done in order to make this workable for let's say 95% of all the users, right? Because maybe the users were working closely with, maybe their data is not very representative of what's out there. So what are these problems? Well, these are all things that like Postgres already knows how to do with the reporting tool. The question is, can we fill it in on the right-hand side? So Postgres already knows how to present schema to a BI tool that says, hey, what's your schema? But Postgres needs to get that schema from MongoDB and that's a big problem. Because you remember that self-describing schema, it might be very differently self-describing in different documents. So going to detail of the issue there. SQL goes from the BI tool to Postgres. Postgres needs to push down some kind of a MongoDB query, right? Hopefully pushing down as much of the work as possible down to MongoDB so that the work is done on the large cluster, not on the single thing that's sitting in between. And then of course, the Beeson is gonna be streamed back which needs to be translated into rows and make sure to only return the appropriate fields that are requested and only correct ones that match the predicate. Great, so off we go, trying to solve this. Biggest problem of course is schema. Right, so here's a typical document, I don't know, I guess black background might have been a poor choice. Hopefully you guys can read it. What this is, is this is a sample document from a user collection in the Yelp sample data set. So Yelp makes available every year a bunch of sample data. It has users, businesses, reviews, right? So I downloaded and of course, searched for myself, found myself. Turns out there's only one Ossia in all of Yelping, at least for that data subset. So I was like, okay, we got some embedded fields, votes, there's like funny, useful, cool. I got lots of votes, yay. You know, there's compliments. I didn't really know what that meant. And then there's an array of friends. So I have some friends on Yelp. That's like people who are like, oh, I know this person from Facebook, so I friend them. So we try to map that onto a relational schema. So I said documents are self-describing, so you say to the document, what's your schema? You look at a bunch of them, because some fields might not show up in every document. And this is what you end up getting. The weird one is compliments, because there's like literally 11 of them. It just so happens that in most documents, you don't have all of these subcategories. I'm not a fan of how this was modeled. But sometimes you don't really get a choice, right? Because sometimes even the developer didn't really think about how this would be more usefully modeled for reporting on it later. And of course part of the problem here is we might have missed some fields. We might have missed some fields because they were very sparsely populated. And let's say there's 30 million or 30 billion documents, you're not gonna look at each one of them, right? So you're gonna try to sample them. And I guess you're gonna need to give away to an user to fix or correct or enhance the schema that you generate based on what they know about the data. Of course, an obvious problem is also friends. So notice friends is not on the left-hand side. Friends is an array. Now, I said, well, Postgres supports arrays. Could we just map it onto an array? And I went looking around to see how BI tools handle arrays in relational databases. And I can tell you, it was extremely disheartening because what I found on the various Tableau boards, let me translate this for you in case it's too small. I just don't wanna read the whole thing. Basically, arrays is a dirty hack for when you don't really wanna do things the right way. You should just flatten them or un-nest them in Postgres terms or unwind them in MongoDB terms because they don't belong in the relational schema. So it's not considered a flaw of a reporting tool that it doesn't understand arrays. It's considered a flaw in your schema design. So rather disheartening, but I guess I knew what we have to do. We have to un-nest or unwind the arrays and the way to model them really came down to one of two ways. We could model it as a completely separate table. So say Friends is a separate, like it's a normalized relational table with a foreign key back to the person to whom the friends belong. So we can essentially unwind the array and represent that as a single table. Or we could represent essentially like the user table pre-joined with the friends array being un-nested. So this is what it would look like if it's represented as a separate table. Conceptually, that might be easier for the report user to understand. They've already seen this pattern. Each user may have zero or more friends so they can join the friends. If they wanna keep all of the users, whether or not they have any friends, it's a left outer join. If they only wanna report on people who have friends, it's a regular inner join. The problem is if that join doesn't get pushed down to Mongo, which I'm jumping ahead a little bit, but in foreign data wrappers, it doesn't get pushed down to Mongo. We don't know that there isn't actually a join happening that the query is actually happening from a single table, right? And a lot of unnecessary work might get done. So presenting it as a pre-joined, flattened, un-nested array, kind of seemed safer. It would discourage people from joining tables that underlying them was really a single collection. But again, we weren't certain that we were, not only weren't we certain we were gonna get it right, I was pretty certain we were gonna get it wrong for some subset of the users, which meant the users had to have an option to represent arrays one way or another, depending on what they wanted to do with them. So the other problem we were gonna have is types. So the more simple type problem is that MongoDB supports some types that don't exist in regular relational land. So for example, if I use MongoDB, the underscore ID field represents the primary key that is always present in every document. And if the application does not provide its own value, the natural primary key, one will get generated on the system and add it to the document. And that is type of object ID. It's a 12 bytes, first four bytes represent a date, blah, blah, blah, et cetera, kind of meant to be more or less globally unique. But it's not a type that exists in other systems. It can very easily be visualized as a Varchar, we can certainly show it as a Varchar and we can return it, have it convert to a Varchar, that would be fine. The problem is that, how do you test for it correctly? Object ID of a certain value doesn't equal a string that looks the same, right? So luckily, foreign data wrappers allow you to have options on individual fields. So it can annotate a particular field with what its actual underlying type is. This is also kind of a convenient place to allow people to rename things. So let's say that they have location lock dot coordinates and they don't want dots because they think it'll confuse people. So replace it with an underscore and keep track of the actual underlying Mongo name with something else. That part was actually easy. The hard part is the mixed types. And there are two different scenarios when you will have mixed types of data in the same field in MongoDB. By the way, what happens if you get an unexpected type in foreign data wrapper is something like this. Hey, I expected American, this string, I couldn't convert it to American no matter what I did. So one scenario is an intentional scenario and that is when people have what I would call polymorphic collections, right? They're modeling dynamic attributes or they're modeling several different types of entities in documents in the same collection and there's like a type field that indicates whether some other attribute is gonna be some number or a string or a Boolean or something else. That one, there's a few solutions, right? You could try to, you could map them into several different fields of different types. You could track all the different types and do comparisons with all of them. It turns out the much harder one is where people just have the wrong data because it turns out that a very clever code still had bugs in it. And it's actually the reality of code that bugs will happen or sometimes it's an intentional decision and you're like, hey, you know what? We shouldn't represent dates as seconds since the epoch. We should change to use ISO date or a string in a particular format. But as a result, you now have different types that actually they might find out for the first time when they try to use the BI tool and it says, oops, I just found a number where I expected a Boolean and even though JavaScript tells you that one and true are both the same, they're actually not. Go figure. So we came up with this idea that instead of just inferring the schema and pulling it in, we would create some kind of a definition language and we call it dreidel because it sounds fun and stands for document relational definition language and it would allow users after the schema was inferred and generated. It would be stored in a file. The users could then edit it, add fields, fix types that they thought that they knew should be different types. The kind of a cool flexible thing is pipeline. The field that's pipeline is where all the different aggregation pipeline operators or stages would go that should be applied to the Mongo collection before it's presented as a flat view. So the obvious one is unwind an array field. A less obvious one maybe do an entire aggregation of some kind to show an already pre-aggregated view. That the BI tool now just does some sort of filtering or slicing by. All right, so what about performance? How's the performance? Well, one thing that became very obvious is it was critical to push down as much of the work as possible to the Mongo cluster because imagine you're doing a sort and let's say you have like a 15 shard cluster. You managed to push down the sort to Mongo. You're gonna have 15 separate shards, each one sorting a subset of data. In fact, if there's an index in Mongo, they can just read the data from the index already in order. The sharding router process just has to merge sort those results and return them already sorted. The alternative is Postgres has to wait for all of the results to be returned and then do its own sort on that one single machine. That's just never gonna scale. So it was important to tell the Postgres query planner that Mongo could sort by any field in any direction, whether or not that sort would be supported by an index or not, it still made sense to push it down. Obviously, we can do stats and index lookup on the table. We first connect on a MongoDB collection and figure out what indexes exist and provide the information about path keys for the Postgres query planner to be able to know which, and we also can tell if any of the indexes are unique so we can tell the cardinality or the selectivity of a particular index. The really challenging part came when we tried to figure out what we could return for relation size. So we know the total size of the collection so if there's no filter, we know of course the full count that will be returned except that if the array is being unwound or unnested, you actually might be returning a lot more documents or a lot more rows than are in the underlying collection to begin with. So what do we say? Do we figure out the actual total number of rows? Do we return the number of rows that we have up front? Would it matter for most people? The other thing is the size of each row. Now, when we look at stats for MongoDB collection, we have, it's usually depicted as average object size. So if we're returning the whole collection or some subset of it, we can tell you on average what the size of each record would be, multiply that out by the number of records we think we're returning. The problem though is normally you'd be only asking for a handful of fields. One or two fields might be 102% contributor to the average size of each record or it might be 98% contributor. It's like saying how big is a var char going to be? It could be a couple of bytes. It could be a lot more than that. I hope you're not holding your breath waiting for the perfect answer because there isn't one, right? There are basically best guesses, best estimates and as much as we hate it, sometimes maybe knobs to allow the end user to specify which way they want things to error. Really best thing would be to be able to push down as much as possible to the MongoDB query planner, right? But that brings us to the SQL. So predicates obviously can be pretty easily pushed down. Most of them can be pushed down. Most of the operators in SQL map to equivalent operators in MongoDB, all from the usual equals, not equals, less and greater than, in any all even, you know, rejacks, case sensitive, case sensitive, et cetera. But joins are not pushed down as I mentioned and more importantly, aggregations are not pushed down. Now, sometimes there are tricks that you could use. For example, if Postgres asks for no fields at all, we know it's doing a count. So we don't have to do a query for all documents and then return no fields. We can actually run a count and then synthesize that many documents to return. There's only one little problem. If that count comes back 2.7 million, it's a lot slower to synthesize and return 2.7 million empty documents back to Postgres to turn into rows than it would be to just return 200 and however many, the actual number, right? But that's not a problem that's specific to, you know, our implementation, that's a problem that's specific to any foreign data wrapper that wants to be able to push down the aggregation because they can do it themselves and it would be much more efficient to return just the result, which is gonna be a lot smaller. So for this reason and for a couple of other reasons, in particular that there are things you might wanna express in Mongo query language that can't easily be expressed in SQL or if they're expressed in Mongo query language, they can use a specialized index like a 2D sphere index, for example. We decided we needed an escape hatch. I came up with that name. I can't remember what it's actually called in the docs but an escape hatch basically says that you should be able with any arbitrary SQL push down a Mongo query language where predicate, some kind of a clause that will get applied before everything else. So if you wanna do like a geo-near, you know, particular coordinates within a certain distance, you just push that down, it'll use the 2D sphere index and then the subset of results will go through the rest of the transformation. So some of you may have noticed an earlier schema that I showed. There was a field called underscore underscore Mongo query. That was kind of originally the idea was like, well, you make it some name that you hardwire to mean this special query. Engineering actually come up with a better idea is how about we use a special foreign data wrapper attribute to indicate that this is a synthetic type. It's not a real type. So it's a string but we know that this is synthetic type filter that leaves it open to have other special types to do whatever other special exceptional handling would do. And now in some cases, you know, we can push some predicate before all the unnesting so it can more efficiently use an index, right? So for Yelp you say from the user join friends, I wanna get the names of all of the users who are friends with me. That's my ID. That's my user ID in that Yelp dataset. And that took like three and a half seconds because it had to go, it had to flatten the giant table and it had to apply the filter to it and return it. Or I can push down JSON syntax that's gonna get applied first. So before we do the unnesting, it's actually, well actually here, I can even apply it to a table that doesn't have the array defined in the relational schema but the underlying table has the array. So the filter is applied to the underlying table and we return just the columns that the top level table needs. A more complex SQL query would be, don't just give me all of Asya's friends, give me the list of friends of Asya's friends. So an SQL essentially involves a subquery. First we say give me the IDs of Asya's friends and then we query again saying give me all the IDs of Asya's friends of the people who are in this set and that took like 200 seconds or something like that. Or you can push down, you say select all the users who happen to have this user in their friends array that gets applied before the unnesting and then 132 milliseconds later, sorry, 132 milliseconds later, you get back the result. And this is simply, this is not, yes, yes, SQL looks more complex but also you have to multiple times query on the same table that's just gonna be extremely inefficient in the workflow that we have with the foreign data wrapper. Okay, well, so what's next? I mean, we clearly have to improve performance, right? You can handle all of SQL conceptually, you can let them push down the stuff that they can't express in SQL but it would be nice to be able to push down all operations and unfortunately, aggregations, joins, that information is critical to give the user with a large cost to good experience. Postgres foreign data wrappers don't push that down so we have to come up with something else. I think maybe we'll embed Oracle in the next version. I'm just kidding, turn off the recorder. You guys awake? That was a joke. Okay, so yeah, and that's pretty much it. And look at this, 40 minutes on the button and we have 10 minutes for questions. Wow, sorry, I spoke a little fast. I guess I have more time than I realized. Any questions? Yes. The early prototype was, it was kind of like an internal Skunkworks project that just did the very basic foreign data wrapper multi-corn and it just pushed down the very trivial equals not equals greater than less than and it didn't have any optimizations. It assumed the schema would be flat and it assumed there wouldn't be any type conflicts. And it turns out if all those assumptions held it would be awesome. In fact, if nobody had any arrays it would be really fantastic and we'd never have to even think about anything else. So yeah, and my assumption coming from the point of view of having built Data Warehouse and Data Mart, my assumption was just you're using something that handles flexible schema. It must be for a reason, therefore you're not gonna be able to apply very simple SQL-ish type statements to that data. And I think the reason that I was making that assumption is I think I was mentally mapping the business users and the developers of the application onto the same kind of a group. The developers expected the data to be frequently changing. They might start adding like linking with Facebook profiles for users, which means that there may be getting attributes tomorrow that they don't even know because Facebook doesn't know they're gonna be sharing them tomorrow or something like that. So they weighed the flexibility as being most important and then the ability to check for existence of fields and query on those fields. From the business side, reports even though they can be run ad hoc, they're not designed ad hoc. You don't run a general query and see what comes back. You usually say what questions do I need to answer for the business and then look for the data that answers those questions. So what I realized as I was working with some of the early users giving us feedback is that that was the place where mentally I didn't put the gap, the separator between what the developer requirements were versus what the business user requirements were. And as long as I was thinking of them as one uniform group of people, I couldn't see how they could possibly be happy figuring out a fixed schema for something that they specifically said wasn't fixed and applying SQL to it, right? But they were two different roles and so it actually, I was wrong. It doesn't happen often, but I'm a big person, I can admit it. And yeah, sure, if in my ideal world, I think I said this before, the tableaus of the world, the micro strategies of the world would start recognizing multi-dimensional schema. They could start by adding support for arrays, right? And then support for other data types that maybe can be somehow discovered dynamically. I mean, Postgres has exactly the same problem, right? Support JSON, business users eventually are gonna be like, how do I report on that stuff? It just shows up as one field in my reporting tool called data, what do I do? And their answer is gonna be like, oh well, if you wanted to report on those fields, you should have pulled them out into the regular relational field, columns. They'll be like, we don't have control over that. Developers have control over that, right? So it'll be back to the same thing, ETL or whatever. So the problem is not actually specific to MongoDB, it's specific to any data that doesn't fit the standard relational, traditional SQL type queries. Sorry, I was much longer answered than it needed to be. Somebody else, raise their hand. Yeah, ah, well, you don't have to. There's a couple of things that could be done, right? And the right thing might depend on the needs of the business users, the application, or why the field. Let's say the field has two different types because it can represent one of two different entities, right? One thing you could do is, you could actually map the collection onto two different tables that have that field, one type in one case, the other type in the other case, and have your pipeline include a filter. Type equals document type one, the other pipeline includes match type equals document type two, and now you actually expect that subset to be uniform, right? Obviously Tableau supports arbitrary SQL, you could actually union those and actually see them again as a single table. Can't remember what happens if you union things that have inconsistent types across the same name. Well, you can actually name the columns different, you don't even have to name them the same thing, right? The other way is in any particular table definition, you could have the same MongoDB field represented as two different columns. So let's say I have field foo and it could have a boolean or a number. I can have foo underscore boolean, maps to foo, of type foo underscore number, maps to foo as number, and then use either, you know, the foreign data wrappers use the column attributes to indicate the different type that you want to essentially filter. And Mongo has a dollar type operator, a query operator that says only match the dollar type boolean for particular field will only match that particular field or less efficiently when returning the fields you could actually massage the types into the appropriate type. In fact, I had a bit of code in the very, the Skunk Works prototype that actually would do that, but it clearly wasn't, if you're gonna be returning a couple of million rows with a dozen columns in each, you don't wanna be examining the type of each one of them, right? Yeah, I love the five one. Yeah. Dredo? Dredo. Dredo, Dredo, Dredo, I mean it. Sorry. So was the option already? Yeah, so this is where the workflow could be that you have a little job that constantly re-samples and adjusts the types because you can load new foreign schema. There's no actual data in living in Postgres, right? You can update or alter the definition of a foreign table and the very next query that comes in is gonna have to refresh and all the BI tools have mechanisms for noticing the refreshing the schema so they can expose the field to the end user. The problem is that if somebody hand edited their file, you don't really want to be generating new schema and overwriting things. So this is kind of a balance between do you do something for the end user or do you just explain to them how they can set up a cron job and do stuff unless they want to apply some transformations there? We don't really have a good answer to this because you want both to just magically work the way the person wants them to work but you can't really read their mind. So right now it's a, you know, that they'll have to make a choice and we're still trying to figure out if anybody has any suggestions, I'll be around at lunch by the way if you think of some questions or suggestions. Okay. You mean like the one that's in my personal GitHub repo? Yeah, that's not the product but that is like the very early prototype. That, I apologize, that has bugs in it. I've never had time to like go back and fix stuff as I actually figured out that there were bugs, I probably should because that makes me a good open source citizen but yes, no. So I mean the BI connector is part of the enterprise so nothing that actually ships is in any public repo if that's your question. Yeah, sorry, they tell me we have to charge money for something and I like my page X caching so I'm not against that in principle, you know, or in practice I should say. Honestly, there were only a couple of things I found where I wished things worked better. One of the things I looked at was using MongoD for authentication via PAM and the PAM implementation in Postgres is really limited. Like it will not let you pass any additional arbitrary fields which we have to be able to in Mongo in order to be able to authenticate properly so it was kind of a, oh, if you use the most trivial authentication on Mongo and it's always gonna be on either the admin database or the database you're connecting to, then it works. That might be in my repo somewhere, that might be public but it's of extremely limited usefulness and my thought of course was to immediately like, you know, add the missing stuff in Postgres and then I'm like okay, I don't have time for that because it was probably already like 11 p.m. on a Saturday or something. So the cardinality, yeah, the selectivity for a particular product, so what we can do in Mongo to emulate that is, and I kind of played around with that a little bit in the prototype, is you can either look at the query and if there's no index on it, you can just punt because you know you're not gonna be able to estimate in any performant fashion, right, and just return the size of the collection. If there's an index on it, you just run a count with that predicate, that uses a shortcut within Mongo to just do a fast count of an index scan, it doesn't even have to scan the leaves, it just kind of does a quick. And you can always put in Mongo, there's a concept of max time MS that you can tack onto any query to make sure it doesn't run for longer than a certain number of milliseconds. So you can always essentially do a best effort to find out the actual cardinality, the actual selectivity of the query to return the real number back to the Postgres query planner. And if it fails and you punt and you say, well, I don't know, it might not be selective at all. It, we're about a minute over, original time. Like I said, I'll be around at lunch if you guys have any questions. I'm happy to chat about this stuff. So thank you very much.