 Let's get started. So, welcome to today's Postgres World webinar. Hadoop can't query Postgres QL features. We're joined by Venkatesh Raghavan, senior manager, VMware Tanzu Green Plum, who will discuss platform extension Flamework, or PXF, an open source project that enables users to query heterogeneous data sources via pre-built connectors. That is a mouthful. My name is Lindsay Hooper. I'm one of the Postgres World Webinar organizers, and I will be your moderator today. A little bit about your speaker. So Venkatesh has been with VMware for nearly four years. Prior to that, he earned his PhD in query processing from Worcester Polytechnic Institute, while also working at Green Plum. Welcome. So with that, I'm going to hand it off. Take it away, Venkatesh. Thank you, Lindsay. So as she said earlier, if you have any questions while the conference is going on, either put it in the chat window or you can unmute yourself and ask a question. I would like to have a little bit more interactive session, and given it's a small cozy group, I think it might be too rowdy. So let me take it away. And I think everyone can see my screen, correct? Sure can. All right. So while I was, while I'm doing my journey of Green Plum, there was an instance where we were, all of the Hadoop vendors had difficulty in query because there are some traditional structures and optimization techniques that has been well-established in databases that's not available in the Hadoop ecosystem. But Hadoop ecosystem has a rightful place in our data pipeline. It's a data lake. It has a lot of great features that we don't want to get rid of. We don't want to start fresh again. So how do we manage all of these disparate data systems? And some of them don't do the powerful query optimization, query execution that a traditional database can do. So that's when our PXF solution came into being. So we had PXF, which was connecting Green Plum to all other ecosystems outside and how can you query, how can you use power off query processing and get data from all these disparate sources. That's, and also get over the hump of how do queries being too slow because they don't have great joint optimization or like we get pushed out on many other such or indexes, how many of such things. So that is the premise of the talk. So this is a PostgreSQL. So Green Plum is a PostgreSQL based query database, data warehouse. So we have developed PXF and we are developing PXF. So we want PXF to also be used at one point in the near future, hopefully next year with PostgreSQL. And because once we get it working with foreign data wrappers with Green Plum, I think to do that with PostgreSQL should be straightforward. So the agenda for today is I want to present to you the use case of PXF and then a product architecture, some features and best practices what we found. And I want to give some little bit of PR of recently released features. And then I want to focus a little bit heavy on the PXF and the foreign data wrappers API of PostgreSQL. All right. So the issue that we have is many of our customers, both open source as well as enterprise customers, they have a lot of data that is on different sources. So they have a usual question is how can I run all of these algorithm queries that have semi-structured structured data, many different resources, external sources. And I also have Green Plum data warehouse where I have all of this data that I'm terapetabytes of data that I have. So how do, rather than having individually querying all of these different spaces, can I have a unified view where I submit a query to Green Plum and Green Plum does the job of collecting these data and presenting the results to me. That's a very commonly asked questions. Every data, DBA, ops person in every company on even universities have multiple databases that they have connections for or systems for. So we want to extract value from all of those different places rather than building a new thing and then migrating all of the data to a new system. Because there's always going to be new systems coming up. So Green Plum has this platform extension framework PXF that enables a user to then go gather all of these data. So right now, what we support is we support traditional relational databases like Oracle, MySQL, Hive, SQL Server, all of those traditional databases. We can query files, Parquet files or RC files on the Hadoop, HDFS system, or NAS storage. We can query on Google or Azure or any of the cloud storages. We can query Minio and other object stores which have S3 APIs that support. And we also support many different file formats, whether it's CSV, Parquet or RC, Outro, JSON, it doesn't matter. So we have many different file formats that users can store all of this information. We also have a project of image processing. How do we take images that are stored in say any cloud storage and then use PXF to extract that data and then use machine learning algorithms to then query those data. So we have a project going on of even image files that how can we get all of those image files into the database and query them and do machine learning. So clearly we have many different use cases and like anyone can, anyone who's in this field can say, hey, what are the different areas that I can go and use this powerful tool. The first one is Federation. That means you have different sources and you have warm cold data across different systems and how to query that. Exploration, there are many data scientists, they want to take a sample of the data, they don't want to get the whole data system. They want to explore what the data that they have collected, any researcher, any scientist or machine learning data analyst. They don't want to get the whole system, but they want to go on exploration on these different sources. How do they do that rather than like, you know, building their own data warehouse or database. So this gives you an opportunity to do that. So another one is data migration. Of course, there are some once they explore or once the data scientist figures out, oh, this is a sample, then I can bring all of that data into the data warehouse. So you can migrate the data from external sources. Say, for example, you're migrating from one product to another product, how do you bring it up so you can, you can say hey, if you're going off, say for a simple example. So you have Postgres instance has reached its capacity in regards to it has like 100 terabytes of data. It's very slow. You want to move to Green Plum, which is also a Postgres compatible. It's based on a Postgres. So you want to migrate, how do I migrate it so you can bring all of the Postgres data into Green Plum and both Postgres and Green Plum are open source. A researcher could just have a Green Plum instance and then migrate it to Postgres data into Green Plum. So data federation, data exploration and data migration. Those are the three main use cases that PXF tries to address. And PXF provides multiple native connectors that connects to these different sources. We have connected JDBC connectors that connect to traditional data warehouse. We have Hadoop connectors, cloud storage connectors, Hive connectors, HB connectors. Each of these connectors have specialized connection because each of those connections need different parameters or different settings and how do they perform, how do they, what are the values that they need and what are the features that they support based on that these are native connectors that connect to these different sources. JDBC is the default in the sense like in any new data, anyone who supports a JDBC connection can, even if you don't have a specialized connection for them, they can use the JDBC connector. So how Green Plum supports polymorphic storage, what does that mean is you can have heap tables, AO tables, you can have partitioning, vertical partitioning, horizontal partitioning. Then on top of that, from an external disparate source perspective, you can have warm data in Green Plum, very hot data, say for example, like instant, like ticket booking or something that's really hot, some ticketing system. You can have it in Gemfire, you can have it in MySQL, something that is very quick, very fast transactional database, you can have that all the hot data in MySQL. You can have the warm data that is like, you know, say 2010 past 10 years of sales, you can have it in Green Plum. And then you have, you can have archival data, you know, you don't want to, you don't frequently access those data, might as well put them in cold storage, cold storage is only when you need it, you can pull it up. So you can put it in any cloud storage, you can put it in a Hadoop data lake, you can put it in any data lake. Somewhere where it is, rather than using the resources for Green Plum, you can put it in an archival format, or you can have it more compressed, you can do many different things. So Green Plum provides you not only polyphonic storage, in regards to how data is stored inside the database, but also it's stored, it's like how where you store it, so Green Plum enables you to do that. But from a customer perspective, if they do say aggregate query, a count star query or something, they submit the query to Green Plum and Green Plum knows that, hey, I am going to be doing this 2019, the latest from 2020 to 2022. So then I'm going, that is hot data that I have to go fetch it from my SQL for the systems. But from a query perspective, all of those things are like, you know, like the user just interacts with Green Plum and Green Plum does the job of aggregating all of the data, pushing doing column protection, doing all of those things. So this polyphonic storage across disparate sources, the Green Plum user can have a seamless integration with all of these different data sources. So Green Plum has a native API called the external table framework. So it's basically an extension. You create a table. So what do we try to do? So we have, the first thing is you have the location, where is it located? So where is it located? You have sales 2032, so that's where it's located. Then you have, oh, I have a profile. It's a HDFS parquet file. So that is the connector format that I'm trying to get. And then the server, like here I have an example called data lake. So it's like a server that you connect outside to the outside server. You connect to where the location, the password, all of those things. So you have a config file that will tell you which is, what is the server defined as? What are the, say, for example, if you have a key tab, if you have Kerberos or something, what is, how do you, what is the Kerberos properties that you need? To provide so that you can have authentication. And then you have the custom format. That means you each one, you can have it, you can have your own formatter, how the data is written. You can, you can, you can specify the format, the formatter that reads that external data. So the external table API, for the latest Green Plum, which is Green Plum 7, where we are bidding end of this year, early next year. So that one will be, we are slowly migrating to the form data wrappers. So what is this architecture? What is the PXF? So as you know, Green Plum is a distributed shared nothing architecture. So I'm just giving an example where you have a Green Plum instance with two segment hosts. And each segment has three segments each. Okay. So each segment could be considered or understood as a post-crisis instance, for example. So each, the query comes into the master, the master plans out of the query and sends a copy of that plan to each of the segments. So segment one, segment two, segment three, segment four, segment five, segment six. All of them will get a copy of the plan. And within that, they will say, hey, I have to read. Say, for example, if the data was not in an external source and wasn't an internal source, each of that segment will have a portion of the data. So say, for example, segment one will have all California data, segment two will have Massachusetts data, segment three will have Wisconsin data, and so on so forth. And if you have a query that says, hey, give me all of the average sales of all of the customers, all of the sales that you've had per a group by a state. So each segment will then go and do an aggregation in parallel and then give the aggregation back to the Green Plum coordinator, which is called the Green Plum master. Now, if you had an external source, then each of these segments will start a PXF thread. So Green Plum PXF 6.0 comes with the springboard microservice with Tomcat, so it will create a thread of PXF server which will then go and get the data and format it so that it will fit with the Green Plum format and then give it to the segment. So it will do all of the column projection, it will push the, it will take the query and see what are the predicates that I can push down and then push it down and then see what are the columns that are needed so that column projection happens and all of those things. So, so this the PXF threads that are there on each of the triggered by each of the segment run in parallel. And then the workload is split and then you can get a large amount of data inside Green Plum. We are currently working on a more expansionist PXF model where, say for example, if you have more fragments that you need, you can, you can start up another PXF segment host that just consumes data and processes, rather than tied very coupled with Green Plum. So PXF, in principle, we have tested it where PXF can remain in a different segment host, not necessarily the host that is that Green Plum segment is. This way, some, you know, Java has a lot of memory, like we have to be very careful memory utilization. So this enables PXF to run on a separate host have its own dedicated like system that you can consume as much memory as you want. And the segments don't get affected by the large amount of data that's being ingested inside Green Plum. Any questions so far? No questions as of yet, but this is a great reminder to get your questions in the chat. Thank you so much. Cool. So moving forward. So the first thing, so I'm going to touch upon a few things on PXF what it provides and then features and then we can go into what does the PXF roadmap with Postgres form data app or API. The first one is predicate pushdown. So whenever you have a query, so it's state equal to California, we take that predicate planner, the Green Plum planner optimizer will say, Hey, I have this predicate. I know you're an external table scan. Here's that predicate. Can you help me push this on the PXF server will take that information and then push that predicate directly down to the external system. If it was JDBC, it will augment the query that it sends to the external system with saying, Hey, I want also to filter this predicates. All right. So if you are having a high tables, or many of the external systems that partition the partition data set, then you can say for example, in Hadoop or Hive that is partitioned by state, and you have California so Green Plum, the PXF server just go and ask Hadoop to then say, Hey, just don't touch all of the files, just touch the files which has California information. So this predicate pushdown is very useful in regards to reducing the number of predicates that number of tuples that are consumed by the by the Green Plum system. I'll turn it otherwise you have to get all of the petabytes of information into the PXF which will just not be efficient and it will probably disturb other workloads that are going on in the currency and PXF pushdown joins. Very great question. PXF currently does not push aggregates and joins that's what we're working on with the FEW work that we are doing we want to push down aggregates and we want to push down joins a great question. Right. So the next one is to say for example if you have so this is the partition pruning so if you have external data sets which have partitions of data we will not touch those things so so that's another one. Right. So if you have column projection this is again trivial not bring surgery here. So if you have only certain columns that you need PXF will not bring all of the tuples that are just bring those columns that are needed. Of course it depends on the data type and what the external system supports. So based on that we will only pull in all the columns that are needed. And especially if you have column rated storage, this is the one that works best with us. One point I don't have one of the slides. So but I will talk about it. This is the right time. So we have something called named query. So let me just scribble a little bit here. So say for example you have an external table, external source that has two tables one is a TPCH line item, like line item and orders right you have the two tables that are stored in external data source. And you have a sub query inside your mega query that joins these two tables that are an external source right. This is going back to the question one of our attendees asked can we do join pushing. We currently don't have native joint pushing but what we do support is called named queries. Right. So say for example, if you have the simple join, you can create a named query called as joining line item and I think my, this one crashed. Can everyone see my screen. I can. It says user impersonation. Clear this out. Always remember that annotating while doing presentation sometimes causes problems. So, so same for example you have the line item and orders right and you have this both an external table. I could create what is called as a named query. What is a name query you can view it as a view you can see it as a view. So you have a view you create a view saying line item order and join the command, and you store it as a view inside green plum. It's called as a name query name query dot sequel. And then you create an external table based out of that name query. So then what you do if you have additional predicates that you're on top of the view. I think we'll view this whole will seek this materialized view. So non materials just a logical view as one external table and then push the whole query into the external system. That is the it's I would say it's a work around for not being able to push aggregates and joins down right now so I don't know whether that answers one of the questions from Slava. The name query features enables enables one to push these kind of complex queries into the external system, but this has to be user initiated initiated it's not initiated by the system as a whole. So if you, it takes a little bit of engineering effort, but that that is a lot. Let's look at the JDBC connector of the JDBC connector if you have a traditional JDBC connector you're connecting to my sequel, and you just say hey get my sequel it currently only one segment will go and fetch all of the rows back. Rather you want to not do that you want to have have parallelism with green plumb then you have to say to the external when creating the external table saying hey I'm going to partition by a range. And this is all of these intervals. And then what happens is each of the intervals one of the segment server will take it up and rewrite the query say for example if you select start the external tables just orders. People say select start from orders get me this month, then get me this month each of the segments will take one portion, basically rewrite the query into a smaller query and then submit it to the external system. So it's much faster we have performance numbers I think one of our blogs talks about performance numbers of how much I think two or three x faster that we can get by doing that. And the next one which I wanted to talk about is impersonation. So right. So in a traditional system and you don't have any common information in impersonation. So there are two users of green plum which is Alice and Scott, and they submit a query they have data and Alice has some data Scott has some data. And I say for example competing. What is it they're doing their own research competing research on some sensitive information that they don't have privilege of seeing each other's data. But if you are going to write this into an external system, or, or access your Alice wants to access her own data that is in her do data lake that she has privilege privilege for that is not possible without it from an in impersonation what it means is when pxf goes to write or read data. It goes through based on the GP admin which is the path of the user role that pxf server is installed on. And it goes in as GP admin and for GP admin to access any of the external data, it has to have access to all data. But if Alice and Scott cannot access each other's data this is a problem. So what you need to do that is to have an impersonation which says Alice can Alice's persona in the HDF a system is X and Scott's persona in HDF a system is why, and I have mapped my user of green plum Alice to X and Scott to why, and then pxf services how I know Alice on the system on the server is X. So I'm going to give that privilege to the HDF a system so that I can access those privilege information because Alice has those privileges. And Scott has access to a different set of data so it is a user impersonation which enables green plum to maintain the data protection for the external system so that people don't have access to everything. And they don't have the privilege to access everything. But if you want to take it up a notch and say hey I have this Kerberos or that authentication server, then I have this is my keys and key tabs, then I can have a more secure way of asking those same queries to that external system. So green plum pxf supports that so you have an extra level of strong and secure authentication that green plum will support with pxf. Any questions so far. Nothing new. Okay. Alright, so in this system, what is it that their best what are the best practices which which I want to like really point out like when we have data across different sources, we need to use the right tools for the right workload. So pxf is great for certain use cases there are other things like GPS as safe example Kafka it's going to go open source at some point in the future. So, different tools are needed for different things, and make sure that if you have high tables, if you can partition it you can use different options that green external table API and pxf provides that will speed up your access to different things. If you know that you have material, if you can have joints you have joining some tables in the external table, create a named query so that you can push it to the external system. Use column projection and predicate pushed on wherever possible. If you have using S3 you prefer S3 select if you can bear the additional cost. The connection pulling for JDBC connector is very important to tune. And base figure out which is the right profile connector that you need so that you can do the different predicate push down or column projection. And as usual, as long as you're working in a Java environment you have to make sure that the, the JVM settings are correct so that you the number of the load or the volume or the concurrency that you have you have to give you enough memory for the JVM. The new features that we are coming up and very excited about this is, hey, we're having now constraint delegation with Kerberos. We have enhanced features of now, our abrological data types we are supporting or see rights on reading Jason arrays. Now we are working on so allowing connection timeouts so that you know, so if the data is too large and you don't want to hang the system. So we are also doing reading or see now PXF produces metrics in Prometheus then which you can then graph or monitor on on an external system. There's a question. Yep. All right. So, so this is the part of green pump and foreign data wrappers so currently PXF as I mentioned green pump six comes with PXF and it supports external table API. We are shifting towards green pump seven shifting towards foreign data wrappers. I don't know whether just for public information green pump seven is based on a post was 12. We are planning to beta it by end of this year early next year by January, and then GA sometime mid next year. So, what does this for data wrappers API helps us aid allows us to do better predicate push down I think it push down join push down we're working on that. We can now do statistics on foreign table tables. Currently as it stands an external table API as well as our statistics in green plum. We only have a constant number like you could have a data lake with a petabyte of data. We have the optimizer inside green plum things it is hey it is just 1000 rows because it's a constant in one of the header files, because we have no idea of the number of distinct values, number of rows, any of these information. If you have another post this instance that you're trying to query and you have histograms, you could easily take that histograms and put it inside green plum. But we currently don't do that so all of the foreign data wrapper API will enable us to get the statistics on the table. We have there are many open source committers on the post just hooks into a post this planner through FDW we can avail that and also validation options, and this also enables the more wider use of BXF in the post this community. So we want to do as a future step we are currently focusing on the first set, which I talked about the future what we want to do is BXF for post this and BXF for join and aggregate push down. So just a little bit of overview, what is our implementation of the foreign data wrappers provide to the community. So the we have eight or more set of foreign data wrapper APIs, like JDBC, foreign data wrapper, HTFS, foreign wrapper, all of those things. And each of these wrappers are associated with one or many servers. So you can have multiple Oracle servers and like you can have each of them going through this JDBC, foreign data wrapper. Then you have customer mappings where you can say what is this for this server, HTFS, this Aiden has through the HTFS server, the server X, and then their username in that HTFS servers Aiden on HTFS, for example. And then you can create the foreign data wrappers where you can say this is the path this is the format this is where I want to error. So, all right, so, so let's look at what the mapping of this external table API with what we have, we have done for foreign data wrappers. Right, both we have to create an extension BXF in external table API, but here we need to do extra FDW. Then what we have is instead of the profile we have, we have this S3 this is my foreign data wrapper S3. So what used to be this profile S3 becomes a server, foreign data wrapper S3, PXF, FDW. All right. Then we have, during, when we define the user mapping, we say, hey, my user mapping is S3, my S3, and in all of the configuration file, all of those information of login password, all of those things. We have the service located, all of those things will go in. Then you have foreign data wrapper, foreign table where you have different options that you provide. For instance, you say where is the resources? What is the format? And what is the error, like, what is before we had to do both the S3 CSV as well as format is kind of confusing. Now it's a little bit cleaner. It's only one place. After, like, you know, if you, what do you do when you hit an error limit or how many rows do you fetch at a time? Those are some things that you can put in. In the external table API that we currently have in GPT-B6, if you have to change something, you can't do an alter table. That means you have to drop the table and then redo it. It's the pain. For foreign data wrappers, it'll be easier because we can always alter table and not drop it. And another wonderful thing is currently the external table API has a separate format for writing and separate form of reading. But the FDW framework will just have the same table for reading and writing. So every table is a read and write. So that is also very good. So what are the completed tasks so far? We have finished the reading. We have done the validation. We have done performance testing for reading. We have done finished writing through master. We are working on parallel writes where each of the segment goes and writes in parallel. And also moving all of the testing framework, the end-to-end testing framework from the external table API to foreign data wrappers. You know both PXF and green plumber open source. So if you are interested in looking at our progress, please reach out to me or Lindsay and we can give you access to our work in progress branch. What is the looking forward? What is it that we are looking at? So one thing we are looking at is high availability. For any reason, any of the PXF servers goes down. How do we boot it? Integrate with data lake solutions, like say for example, Databricks Delta Lake. Managing, if you have a data warehouse or database and you want to monitor and you want to restrict, you want to do some form of workload balancing. How much resources they can use? How many memory, how much CPU? How many, how many, how much amount of data can they bring into the system? You need to have some form of workload management. So we are working on integrating green plum with resource groups or C groups so that we can have these kinds of privileges where different kinds of users can have different resource allocations to them. Right. So we are also looking for flexible data and column mapping. We are also looking at, this is a pet project. I don't know how realistic it is that many time customer wants to do discovery based mapping, saying I have this table. Can you just go discover what the thing is and then map it to the appropriate columns for green plum, rather than me specifying one to one, all of these mappings. Then we have Hive 3. We manage tables, we currently all do, we don't do a great job supporting Hive 3 managed tables. So we want to make sure that any asset transactional tables that are there on Hive 3 we can support. So those are the main roadmaps that we are looking at. The github is right there. There's also a green plum db, gpdb, which is that's also a green plum and also our open source slack. So these are some basic information on pxf and our roadmap. I'm looking forward to any of your questions. Wonderful presentation. Thank you so, so much. A reminder to everyone to get their questions in now. We'll kick off with the question that we do have, which is when configuring fdw. Where do you store users slash password data. Can it be pulled from single sign on providers like okay ta. Yes, that's a great question. We are working on. So currently, the GP admin has access and they are encrypted. So basically, let me share my screen. We are where our green pump instance has folders for servers where the login passwords are on XML files, which are encrypted. But we are looking at the single silent aspect of it. Wonderful. We have a graph database a George had a George superlative house can pxf support graph database. We haven't tried that but we have had multiple requests for graph database so let me get back to you that's a very good question. Yeah. You were on it before even read it. Amazing. So then, let's bring it home. Thank you so much to all of our attendees for spending a little bit of their day with us. Thank you. Venkatesh. This was wonderful, clearly a very thought out presentation. And I can say that I personally learned something. We hope to see all of you on future Postgres World webinars, and I hope you have a great rest of your morning afternoon or evening. Thank you so much. Bye bye. Bye bye.