 I'm Mark Friedman and I'm sorry if you guys didn't see me earlier, I just landed today in Singapore in this conference. So I'm from Pivotal, we do a databases, prepare them, most of them are open source now. So just give you an insight of how we actually worked around with Postgres and Hadoop. Why we did that, what are the takeaways that we got and what are the takeaways that the community should take a look at, just give you a quick overview. What happened to this guy? Sorry, we just gave him a minute. Oh, it's fine. Oh, it's fine. Did he just start? Yeah, that's fine. I don't know what you're talking about. You're going in and out. I'm just going to have a look. I don't know what you're talking about. I'm going to set me on a hundred percent. Is it okay? Is it okay? You move your slides now? You're not going to be? Yes, I can. I think it should be good. It should be fine now. It should be fine, I guess. Yeah. I think this one, I'm going to lose. It should be good now, I guess. Give me one of those shakes again, we can fix it. Okay, all right. Good? Yep. All right. Okay, cool. This one. Okay, so as I was saying, this combination is not something that has been imaginable long. I was in Madrid in the PGCon in 2014. I got this question. Why isn't Ausgust working with Hadoop yet? Why are we a closed barrier to the whole ASF community? Because if you look at it, it's not only about Hadoop. It's about the whole ASF ecosystem. They have Pashyambari and all at high. There are a lot of things that hang around there and they have myriad manifesto that something is doing something on HDFS and nothing on Hadoop and blah, blah, blah. So we're all logged out of that. Traditionally, we've been a relational database. And this is something we've always been good at, right? A word of caution. The slides suck. So please don't hold that against me. And so having a look, you know, we've always been a great relational database or standalone server. We start off a standalone server, right? And for Postgres, something that really mattered was the quality that we maintained, right? I think most of us here are part of the developers community. So we know how hard it is to get a patch in. And, you know, there are always talks about how Tom brutally rejects your patches and, you know, how you spend lesser time writing a patch and more time arguing it over. So that's the kind of stuff we have. But it all boils down to the quality of database that we've always maintained. So moving ahead, then over the years, we started moving towards OLAP capabilities, higher analytical functions, you know, making reports. And recently, I'll take the credit for that. So we have a grouping set functionality and all the queue, rollup, stuff like that. So we've always been a feature rich database. SQL standard compliance, we will mostly complain. I think we are competing on time. I'm not sure. We have some things which are not a part of this SQL standard, but they're good for us and for the customers we have. So we've always followed them. And, you know, there's a long-standing complaint that absurd wasn't present. It's present now. So even that complaint is gone. Strong transactional semantics. It's pretty hard to mess up transactions in Postgres, given the, you know, the variety of isolation levels we have now. You really have to do something extremely stupid that we haven't thought of in order to bung up your transactions. Interactive queries, I wouldn't call Postgres a real-time system, but it's a near-real-time system. I know, unless you're in, you know, using Postgres in an aircraft, I don't know if it's as done, you'll be good to go. You know, you'll have a pretty good response and the T-page is pretty high, fairly high enough to keep you happy. Relative to something like Hadoop, where you know, if you, you can't really do interactive queries with MapReduce, right? So yeah, that's a win for us. Now let's have a look, and this is a bit controversial. I don't want you to throw that to me. So what is it that we're not really good at till now? Now, maybe all of this wasn't really applicable ten years back or fifteen years back, but now given the scale of data that we work at, kind of, you know, how many formats of data we have around, the ad hoc queries that our data scientists wanted to do today, and, you know, the massive shifts that we have between our data formats. We don't support a variety of data formats. On this, we have one specific data format, the on-page format that we have in 8K files or whatever size we have. That's it. There's no other way that you can store your data. Now, somebody may argue that we support a lot of, you know, other data formats that we ingest into Postgres. That's a different story altogether. You take some data, you use some converters, you convert them into the native Postgres format. That's a different thing. Then you're following an entire ETL process. So what happens is that, okay, let's suppose that there's a data format called X, and, you know, you have SQL functions that convert from X to the native Postgres format. That's basically relational tables. So you had tons of X, you called all your functions, you converted it. You stored it in Postgres. Nice and happy. Now, tomorrow your data scientist says, hey, you know, X had an attribute, or maybe there were two attributes in X that I wanted to call it in some other form. I need to do that now, but you don't have those attributes. Your conversion functions will not take care of that. What do you do now? Do you go back, get that X format, write a new conversion function? Yes, that's the only way. Eventually you get bored of it, and then your data scientist will say that you cannot do ad hoc queries or ad hoc experiments, right? That's the need of the R in terms of data science. No way to bypass the SQL, yeah, injection in the sense, okay, I'll come to that point three and one are correlated. No way to bypass the SQL interface. Now, here's the thing, you know, I have 10 gigabytes of data, and I really can't create, go to your data directory. Everybody here knows what a, you know, a Postgres database looks on this. So I can't go to your directory and create a new folder, okay, table one, four, four something, a random YD, dump my data into it and I'm good to go. Postgres knows my table is there. It does not happen, right? So that prevents me from ingesting large amount of data in whatever format I have. I can't really go and do that, even if I, you know, manually munch my X data format into the Postgres format, I can't really go and put physical files and expect Postgres to pick them up. That just does not happen. And other way around, even if I have my own weird C++ or C program, which can understand the Postgres data format, I'd be really stupid to go and touch the data directory, right? So the only way is I can use the SQL interface that Postgres provides. Now on this module. No resource management. That's well known, well complained about, so I won't even argue it here. Everybody says there's no way I can put limitations. Okay, I can restrict the work, ma'am, but then Postgres starts using stupid nested loose joins and stuff like that. I can't really manage resources here. Scalability, yes. Because we are a process for backend server, we scale to a limit. It's not an infinite scaling system. We don't use threads and, you know, we're not a massive cluster system. There are limits to our scalability. And even for XIDs, for example, there were Thaidubis to recent times, but a colleague of mine wrote a patch, which I think got accepted, so there's 64 bits. Not sure, but yeah, we are getting there, but there's a limit. There's a ceiling which will hit when coming to scaling. And what I mentioned earlier, there's a whole lot of things waiting for us in the Apache Software Foundation ecosystem, which we just cannot touch because, you know, we... Here's the thing, look at it as Hadoop is the entrance to a lot of things, right? It's just not Hadoop. It's a lot of things that we can leverage off. We don't need to reinvent the wheel for a lot of things. We're close to it right now. So, okay, now Postgres moves to more of an MPP style, massively parallel processing systems. There's been a lot of great work done on the FDW sharding, this point data wrappers, and for a sorted data pushdown, I mean, we have sort keys pushed on, and maybe one day we'll even have query plans flowing around in point data wrappers. But then we also have the capabilities that HGFS offers us and Hadoop offers us. So, what do we do? So, pivotal. It's an experimental product which tried to combine the two of them. We won't discuss a lot of it. I'll go over it a bit, but then we'll come to what we learned when building that product. That product is called Hock. You know, it's now Apache Hock. It's been going to the Apache Software Foundation. I'll go over how Hock is related to Postgres. So, some of you might know that Pivotal did an proprietary fork of Postgres many years back. That was somewhere around 8.2, PGA 8.2, which was called Green Plum. Green Plum is an MPP engine, and a lot of code rewrite happened. They have all those parallel query execution and parallel processing. They're even going for, you know, plan distribution among multiple segments and blah, blah, blah. So, Hock was originally a fork of Green Plum internally. And the first objective that we had was making Green Plum work on HGFS. That was the first thing that we did, and Hock. And, okay, I don't know why it's not Hock. Don't ask me that. Just, I don't know. It just sounded cool. Okay, so this is what Hock basically does. And now it's completely open source. You know, we'll go over how you can contribute to it. So it's like, it natively works on HGFS. There's nothing like we're using some weird connectors, or, you know, we're getting data out of HGFS into the local file system and then working on it. Nothing like that. There's a library called libhgfs3, which we use, which directly talks to HGFS. And that's it. There's no middle man in between. Just, you know, just an FII. To write Hock, we need to add the truncate operation to HGFS. Why we need that? I'll go over it. But, yeah, truncate was needed, which was added to HGFS. It's called, it's in pivotal HGFS. I don't think we open source it, or maybe we did it. But that was something we needed to add. So, here's what we've got, right? The data processing in HGFS, scaling with HGFS, full access to write to the Apache code, everything in Apache became open to us suddenly. The ASF community came to our support. You know, we have a dedicated team, which works on the integration of Hock and Apache Ambari. We have some engineers who work on YAN as well. And two of our engineers actually became Ambari due to their work in Ambari for supporting Hock. So that's how ASF grows. And as I said, Ambari installed. We have the resources managed by YAN itself. So, suddenly, we're very happy. So, this is what it looks like. Something that's really cool here is the cost-phase optimizer, because, you know, the pose risk optimizer was developed over many years by Greatmines. That was leveraged off by Green Plum, which enhanced it further. And then it was used by Hock. So now, Hadoop suddenly got a system which had a great query optimizer. You know, the query optimizer actually became one of the salient features of Hock. And people surprised that something on Hadoop can have such a great cost optimizer. We'll go over dynamic resource allocation, which is actually done by the query optimizer itself based on the cost of the plan that is finally been executed. So, that's something great. ANSI SQL, that's directly leveraged off, leveraged directly off pose risk and can continue enhancing it. Madlib is another open-source project, a fashion madlib, which was developed by Fibital, which is basically a library of machine learning operators, which you can use to do any sort of weird KN and clustering and stuff like that. The interesting thing here is that it works with pose risk. So I'm not sure if anybody knows here, but back in 2013, we did a project with madlib in Google Summer of Code, so I was the co-mentor for that, in which a student in the pose risk community actually added some features into madlib. So that's the kind of correlation we always had with madlib and the people who have been using madlib are with pose risk and writing functions. So madlib is totally applicable to pose risk, not something specific to green thumb or Hock. A quick overview of the Hock history. It was originally called green thumb database on HGFS because that was the only goal it had. Then it went alpha in 2012. Then we have subsequent releases, and in 2015 it entered the Apache incubator. And hopefully it will be in the Apache top-level project soon enough. So I'm confused, which part of Hock is pose risk? We'll go with that. Okay, architecture. This is what it looks like. Basically, Hock is a... If you've looked at clustered mpgs assumes from the Xe or Excel, I love Excel. So you have a master save system. It's called the master and segments in this case. The master basically has the yarn resource manager, the catalog service, the master and the HDFS name node. So we already have HDFS up and running in this case, and name node is on the master of the Hock cluster. The data nodes for HDFS run on each segment. Segments are nothing but slaves. It's a fancy term. Node manager is internal Hock node manager, and then we have segments. Now segments basically are physical boxes. One box is equal to one segment, but then that segment can be used to process multiple queries at the same time. So then it spawns virtual segments. So maybe the term's not correct, but imagine segment as a physical box, and it spawns multiple threads for different parallel queries. They are called segment as well. So that gets a bit confusing, but that's simple enough. So this part, the grayish part was actually inherited from Postgres through Green Plum. So again, the query optimizer is a lot of... The partial analyzer are mostly Postgres. We just added to the syntactical tokenizer and stuff like that. The resource manager is specific to Green Plum, so it's not something Postgres, but yeah. Optimizer, as I said, dispatcher, catalog service, they're all specific to Green Plum, but they are there, right? So this is the main client invoking code, and this is what we do. Interconnect is, again, an open source alternative to the Postgres communication protocol. FEBE, instead we use Interconnect, which is much more stable and you use state machines. You won't go over it much, but it's just there. This is what the execution looks like. Pretty standard, far so a planner. Dispatcher is an additional functionality here because dispatcher basically dispatches your plans to the state. So all the planning right now happens on master, and then the master dispatches your plans to the executors, which are the slaves. Our resource manager here checks if there are enough resources allowed and available to be executing this query if they are, well, no, queues are the slaves. All goes well. Or else, the source manager says that, okay, not able to execute this query, leave it off. Just another plan, so with the recent edition of parallelism in Postgres, we have the gather motion. This is a pretty similar gather motion in Hock as well. Distribute and then gather result. In addition to gather, we have broadcast and redistribution, which we won't go over here because they are pretty standard non-escalating, I think, and not relevant to our discussion. But yeah, these are three types of motions we have. Broadcast is basically sending tuples to all segments. Okay, this is something interesting. Now, what happened here was that catalogs started to become a bottleneck for Hock in the sense that catalogs only exist on the master. And whenever a segment needs to access something in the catalogs, it goes and talks to the master. So that sort of started becoming a problem with the curve because that was never built to scale at this level. So internally, a language called CAQL was invented, which is basically internal to Hock. If you're a client, you want to access the catalog and still need to use SQL. The CAQL proved to have had a lot of pool caching and stuff like that. So it basically optimized your catalog management. So this is part one of what's really different from Postgres and Hock. So imagine, okay, I'm not saying that Postgres needs to first become a trusted system in order to move to Hadoop or HDFS. But if we do want to move to HDFS, that's something we need to look at because catalogs really can't live on HDFS. We need to store them on a normal file system. But then how do we make sure that the normal file system does not become a bottleneck? The catalog access eventually does not prove to be a bottleneck. So this is one part of it. There's another thing called self-decorated plans that we'll go over. Virtual segments, which we discussed. There's a physical box that's called a segment. Virtual segments are the processes that are spawned by that physical segment to execute parallel queries. Segment is pretty much stateless. It does not maintain any state. Self-destroyed plans. Self-destroyed plans are pretty interesting. That's something we want to go over. So right now, if you know the Postgres query... Is it the front end you change or is the back end you change from the raw file system to the HDFS? The back end. The front end is mostly the same. In front end, you have enhanced the capability like resource elevation, which is not available in the Postgres. Yes. So you can actually go and set your resource limitations. Don't allow more than 2 GB of this to be or something like that. But from front end, it not much has changed apart from some additional features I mentioned. The core changes went into the back end to support. I know this one, like I can query the whole... Who can file the Postgres query by status? The integration between the status and the... Hadoop TV. Hadoop TV, okay. That probably... I think this is more of a marriage of NPP systems with Hadoop. That's what we call it. We're not saying that they're not comparators. Just to be clear, this is not trying to be an OLTP system. No. Interconnect is the replacement for FEBE. It's more of a state machine and it uses TCP and UDP both. Which provides a much better performance benefit with UDP and provides more stability with TCP. We won't query it too much. It's more of a state machine. This is what a state machine looks like. But don't ask me questions about it. So yeah. Okay. Transaction management is something new again. So something that proved to be a problem in HOP is the presence of 2PC and MVCC. Now how do you actually allow multiple writers to the same table or to the same top? And remember, the biggest problem with HDFS is that it's an appended only system. You can add to it. You can probably truncate it with the new features, but you cannot override it. So how do you manage that? So for that, and that's the problem with wall as well. You just keep on adding wall. And for pages, they just keep growing. How do you figure to what extent is a page valid and what part of your page needs to be truncated? Maybe a transaction did not come in, but the data was already written to the page in HDFS, and you can't delete it. You need to truncate the entire page, which is not a great idea. So what do you do? Okay. So basically something here is called a swimming lane protocol. Swimming lane protocol basically means that, in a nutshell, it allows multiple writers to create multiple files for the same table on HDFS and all of them write to their own files. And there's a separate table, not in HDFS, in the catalogs, which maintains the page, the HDFS files, which are relevant to one table. So because it's an OLAP system, we don't expect a lot of transactions. OLTP would have a million transactions together. This would have gone however. Million files for one single update. No, thank you. OLAP, we don't expect that much for load. So there are lesser transactions writing more data. That's good for us because maybe there are five transactions. Each of them writing 10 gigabytes in one file, which is fine. The catalogs don't grow that much. Catalogs only see five files, and that's it. But data on HDFS is growing exponentially by each writer. So writers aren't really blocking each other. They are writing concurrently, and there's just a bit of a more catalog overhead for maintaining which files are related. Updates is the standard, you know. But if you have multiple versions of the same... You need to update all of them so that they have a catalog, right? So catalog knows which files are relevant to each of the tables. So you need to go and write to all of them. So yeah, yeah, and that's it. Are the extra files covered by the Hadoop kind of thing? Because you have distributed kind of stuff? Yes, that's it. So all the distribution and the... Distribution only? How do you authentify this? So data is on HDFS, right? So data is on HDFS. So it's been handled by Hadoop, that's fine. Hoc only maintains internally not an HDFS, which files on HDFS are related to which tables. So they're catalogs, right, on the master. So there's another new catalog, which basically tracks the file node IDs on HDFS and the table ID, which is known to Hoc. So, you know, we get the ID from PG class and maintain that, okay. These five or six files are for this relational ID. It's kind of a work process for me. Yes. You can develop and... You can definitely call it that. Okay. Does it create a replica asynchronously on the side? Let's say you've got another writer. How does it get the next file? Yeah, I think that they use a standard. So we don't really mess with HDFS ways of getting files. So yeah, whatever method HDFS uses. Not an HDFS code. But we let HDFS handle the entire fine writing. Okay. So that is something that's good. And so that's called a swimming lane transaction. Each writer writes in his own swimming lane and we just maintain which swimming lanes are led to which table. I'm out of time. Okay. Resource management. Another critical area for Postgres, which we don't do, we should do. Now there are two level of resource management that Hoc does. One is Yarn or something external, which basically gives you resources. So Hoc sits on top of Yarn. Whatever resources Yarn gives you are the only resources known to Hoc. That makes Hoc pretty readily available on shared clusters and in clouds. You can use commodity hardware. You can create shared clusters. You can just manage your resources, allocate to Hoc itself through Yarn. And then Hoc's internal source measure, which are basic resource queues, manage available resources internally to each different query. So there's Yarn, which allocates X amount of resources to Hoc. And then Hoc internally manages those X amount of resources for each different query. How that resource allocation is done is again, you can do resource managers. You can set your limitations. No query should use more than two gigabytes of, I don't know, maybe work mem or memory itself or this much of CPU. You can actually set the amount of CPU percentage that is used after the query is done. So if it's at 37%, it should not exceed 50%. So it basically estimates how much, so that's basically an operator CPU usage and you can estimate how much a CPU usage will happen here. And then it's mostly a predictive system for these cases. But yeah, it does work. Hiacal resource queues basically are the pretty standard computer science hiacal resource queues. We have aging of processes. So what happens if five queries cannot be executed at this point of time? They are put into queue and then the age and gradually, when resources get available, they are pushed through. This all supports prioritization. So you can set the priority of a query and there are some cases where you can say, okay, pack off whatever number of queries you need to do, but execute this one. This is business critical. That's another way. So this is what it looks like. This is the Hoc QD. This is basically the master. The fault also is only different. And this is what happens. So Jan and Mizo's interact with the resource broker, which is basically the Hoc resource manager's entry point. And then we have the policy store, what query gets, what kind of resources, the resource allocator and request handler. So then the optimizer said it's a cost per scene, right? The optimizer interacts directly with request handler. At the optimization level, we decide if the resource is available or not. At the execution level, because when we start executing, we're already on these slaves. At that point, it doesn't make a lot of sense to roll back because you already send the plan to your slaves, right? You always spend the resources, the network IO, whatever. It doesn't make sense. So that's what it looks like. Storage, again, there are row-oriented and color-oriented tables, and we can compress them. Basically, it was more about inventing new scan methods, access methods, and that's it, not much. PXF is basically the extension framework. The extension framework allows you to... It's basically a library of access methods. So you can start reading accumulator tables in Hoc. Just plug in the access method into Hoc and then tell it where the accumulator table is available. It's good to go. So that's basically a black box for Hoc. Okay, how to contribute? That's a site. There's a pretty standard Apache process. There's a JIRA. You can go and look at the bugs. We have newbie bugs and there's a whole wiki on how to contribute. So the community is pretty helpful. You can definitely go and have a look. You can actually catch me outside and ask me if you have any specific questions. Okay. You can find it on the wiki. Nothing too special, pretty much like postcard except the JIRA process. These are the things that we really want to have indexes on top of HDFS. That's something that the community is really looking forward to. JIRA replication, there's some work going around it, but yes, the more the merrier. Integrated ecosystem, that's a huge pain point for us right now. We have limited resources and we need to integrate with the entire Apache framework. There are tons of them, accumulo, HPACE. I don't know. I can't even remember the names of them. So if you work with them, if you experience with them, please come to us, talk to us, add some access methods of a lot of integration work is happening. We're really thankful for it. And maybe you can even get to become an Apache Hawk Committer. That's something good. Okay. This is all references. Now, this is something that's interesting for us as opposed to this community. What does it boil down? So suppose I wanted to make Postgres work on HDFS today. These are the two main areas that I need to look at. Why? You know, the point one, the transaction recovery is basically because HDFS is appended. So once you get things working in HDFS, you can get to work with anything because most of the Hadoop ecosystem members use HDFS as their, you know, file system. So once Postgres starts working in HDFS, we're good to go. Now, that's not something very easy right now. You know, off the bat, it's not something like adding five or, you know, 10, 15 methods and getting it to work. It's a lot of rework for transaction recovery. I doubt that it'll ever make it to the Postgres core. Tom would probably shoot the writer. So, okay. Now, what poses someone from Hawk? Eventually, when foreign data wrappers start sending query plans or, you know, start acting as the connection points between shared nothing systems to build them all into one whole cluster, right? Then we will need to figure how we handle metadata, the catalog tables. Right now, it's fine. We just, they're all acting in independent systems. But once they start depending on each other in terms of catalog, we might need to look at how do we scale catalog tables. Self-describing plans are something that Hawk does to avoid that problem. So, if you look at the amount of catalog access we have, when we're executing a plan, that's a time where we do a lot of catalog lookups as well. Not saying we don't do that during parsing or planning, but execution requires a lot of time. So, what you could do is self-describing plans. Basically, decorate your query plan with the needed catalog data. And one aspect of that is that you get to look at the visibility of that specific query plan. So, you know, you don't need to call the heap tuple satisfies MVC methods all the time to check if this tuple is visible to this user or not, or this transaction or, I'm sorry, this transaction or not. Basically, get all of that data decorated in your query plan. So, your catalog accesses and your CLog accesses go down drastically. So, the number of times you actually go back to the master and talk to it and access data, maybe CLog or not CLog, basically visibility functions, goes down massively. And if you need to have some operator data, like, you know, some operator types or if they are multiple, they can use in merge or not. That's pretty regular data. So, that can all be gotten at plan time and added to the plan tree itself. That makes the plan tree pretty heavy. That's one downside of it. But yeah, that is something that can be done. Multiple types of tables. I think there was some work being done to support columnar tables. I'm not sure where it went, but yeah. Then postgres is definitely moving in that direction. Compress tables would be next, I assume. Swimming lane protocol. So, if you don't do 2PC, swimming lane actually allows you to do the multiple writers, the concurrent updates thing. And that is something that's, if it's not impossible, it's pretty bad to do with 2PC. That's something we might have to look at if we want to move to HDFS. Not sure, right now, we only support two isolation levels. I'm not sure how much it will integrate with the variety of isolation levels we have in postgres right now. But that's a star. That is one way of looking at it. But the writing on the wall is that 2PC maybe won't support the needs of HDFS. Or maybe we won't be able to exploit HDFS capabilities completely if we only stick to 2PC. Beyond wall. Now, one thing is that for a file, right, you can just, in HDFS, you can just append to it. So there's a physical length of that file. But now, some records at the end of that couple, or at the starting, are not maybe applicable. Or, you know, they need to be deleted. So what do you do? You define a logical length. Simply say it says that, okay, only maybe 100, first 100 bytes of that file are valuable. Just read it there. And if there's any access beyond those 100 bytes, you know, there's an offset. Ignore it or just error it out or do whatever. But do not depend on the data beyond those 100 bytes. That's the logical length of your files. And that basically is a one-time write into your catalogs or wherever you're storing that data. But that saves you a lot of pain from appending a lot of wall records. That's one thing we should look at. Second part of the resource management, which is, again, something that is discussed all around, virtual cost-based dynamic resource management is secondary. We need a resource manager per se first. And the way HOP does it internally is using resource queues. So if you want to move to Hadoop or, you know, HDFS, Hadoop specifically, we will need to have our own resource manager. And we could probably start off with a source queue. Resource queue is pretty simple. We, you know, we define 10 slots and you say that aggregate of those 10 slots should not exceed 5 GB of data or 70% of CPU utilization. That's not something very, very hard to implement but might be hard to test. But if we get that, we will have a lot of customers happy today. And if we ever decide to move to Hadoop, that will integrate very well with Yarn or resource, you know, the native Hadoop resource managers. Okay, so that's mostly it. You know, again, sum it all up. The biggest two things that I feel personally need to be done if we want to move to Hadoop is, I'm sorry, R, the transaction management, the recovery part, the wall part, seeing how we can scale up our metadata access, the catalog, and as a bonus, resource management, right? If we can get resource management in place, I think that would be something really cool to start off with. I hope somebody will come up with a pass for that. So, yeah, thank you, man. If you have any questions, I'll be glad to take them. Tables are regular tables with regular columns. Bank are integrated. Yes, they are. How does that map to unstructured data, JSON, start, and all of that? So, we use a regular JSON passage that we have opposites, right? So, I mean, you have JSON view, right? So, that's the same format. You just take that and put it into HGFS format. So, the only thing that changes is the page layout. That is posters, right? The table headers and the kind of data that we put in through structured, and then we have the footers telling how many tuples are there. That's the only thing that's changed. Rest of it, if you can convert, you know, if posters SQL or internal functions are converting JSON into its own format, it's good. We use the same to write them to HGFS. I'm guessing you guys are not using the shared buffers? No. That's something that I have to do. Have you added a new storage manager for HGFS? Yes, that's right. Storage manager was the biggest part that we had to write. Was it very difficult to understand this? It was. It was very difficult to get right. You know, it was not difficult to write, but to get our HGFS. Any other questions? Sure. So, the integration was possible? Yeah, we definitely can do that. I think there was somebody writing a connector for that. And it should not be too difficult if you're separate. If you treat them as two separate entities, you can probably use the connectors available. I think there's somebody writing a font data for that. So, yeah, this is definitely possible. Any other questions? Nested loops don't make sense for this kind of data. So, you don't support that? We do, but that's only if you're very unlucky, but they don't make us any sort of sense. If we actually say that if you see a nested loop join happening, you should contact us and let us know. We'll see what's going wrong. No, I mean joins in general. Joins in general, yes, that's the same problem as Hadoop. Joins don't make sense in Hadoop, but since Hoc essentially still is a relation system, we need to support joins, so we do them through spawning multiple workers and stuff like that. But yes, they don't make a lot of sense at this stage. So, there are talks of making joins better and using some in-memory systems, but it still didn't work, not something that we have really worked on. Essentially, you took the green-prem parser to continue the storage manager is HDFs. That's what you got. Yes. So, this was originally called green-prem on Hoc and HDFs. So, what are you doing in the back-end scenario? Yes. So, eventually... Source allocation of front-end as in... So, you can set your resource limitations, right? So, you as a user might want... might not want your individual queries to exceed some set of resources. You might have ten queries and you want to ensure that each of them gets a fair slice, even if one of them is very heavy. So, you know, even if there's one guy... there's one elephant in the room, you don't want it to hog all the resources, but you want it to wait. So, you can set your resource limitations, but that does not mean that that affects the amount of resources we get from Yarn. So, Yarn resources are something that are... you know, there are not something that you see as a user. What you see is the per-query resource management that you set at your left. So, if a hog gets X resources from Yarn, that's not something you can do, but you can tell hog how to invest those X resources in your queries. That's a hierarchical point. Okay. Thank you, guys.