 Welcome to data processing more than a billion rows per second. I'm joined by our speaker Kai Gai-Kohe, founder of HeteroDB. Today we will cover how SSD to GPU direct SQL implemented as an extension of Postgres optimizes data flow from storages to processors over PCIe bus for efficient execution of analytic reporting workloads. My name is Lindsay Hooper, and I'm one of the Postgres conference organizers. I'll be your moderator for this webinar. Let me tell you a little bit about your speaker. Kai Gai has been a contributor to Postgres and the Linux kernel for over 10 years. Since 2012, he's been the primary developer of PGStrom, which is an extension of Postgres used to accelerate analytic and reporting queries using GPU and NVMe. He founded HeteroDB in 2017 for productization of the PGStrom technology and to democratize data analytics. With that, I'm going to hand it off to Kai Gai. Take it away. Thank you for joining my talk. I've already updated today's slides here. It is slide sharing. If you want to open today's material on your desktop, please use this link. And let's start my talk. First of all, it is our self introduction. My name is Kai Gai-Kohe, chief architect and CEO of HeteroDB. HeteroDB is a small startup company on Tokyo operated for three years. We started the company to productionize and develop PGStrom. That is GPU accelerated functionality as PostgresQL extension. First of all, let me introduce what is PGStrom that is a custom scan provider of PostgresQL that affords some kind of SQL workload to the GPU. It has a multiple functionality to accelerate scan workloads, typically evaluation of warehouse and join and group buy. That is a typical use cases for analytic workloads. And you can access PGStrom from GitHub. The company's open source project has been developed since 2012. And we currently support PostgresQL 10, version 10 and 11 and 12, but works in progress for version 13. So let's start today's main topic. In this talk, we process real loads per second in a single node PostgresQL. What workload is our target? It is IoT or M2M class log data processing. Since many of the devices generate log data day by day, it shall be accumulated on the data processing system. Likely it is kind of database. And administrator or user or application want to process these large log data and analyze the result for visualization, understanding the trend, and so on. In this kind of dataset, we have some problem. Since this size is usually very large, it easily goes to terabyte class. And this kind of large dataset need many times for importing. And system administration tends to be complicated. So let us introduce one typical usage scenario. When I had a talk with a folks in semiconductor industry, he said a semiconductor factory generate 20 terabyte data per year. A factory have many of machines to product a semiconductor material, and these machines have sensors. It is collected and served to PostgresQL database. People administrator learns queries to processing data to get knowledge and insight from the data. It is kept up to five years. So in their environment, up to 100 terabyte data is processed. On the other hand, recent server hardware can have more than 100 terabytes in standard 2U rack server. For example, this server have 24 slots for NVME SSDs. When we use 8 terabyte SSDs, it means we can sell 192 terabytes. And its cost is enough reasonable. But tons of data is not small enough. We need to do some kind of optimization to learn processes. Today, I would like to introduce three key technologies. First is SSD to ZPU Direct SQL. Second is ROFDW. And the last is table partitioning. So the first thing is SSD to ZPU Direct SQL. First of all, I'm not sure what your impression for ZPUs. It is widely used for supercomputing, simulations, computer graphics, and machine learning, for example. All of them are computing intensive workloads. And ZPU is usually used for accelerator or computing intensive workloads. You may wonder how ZPU can accelerate IO intensive workloads. Let's see. This is an overview of x8664 server. This server have a ZPU RAM and many comprehensive devices. It is usually connected over PCIe bus. When we process many data set as a databases, we have to load a data block on stretch over PCIe bus into main memory RAM. And when we process a data set, we give a warehouse join group by on SQL query. And many of our loads are filtered out by the conditions, where conditions. It means we load the data from stretch into main memory to carry junk data. It is a waste of IO bandwidth. So PCIe strong changes the flow of the data over the PCIe bus. We use the peer-to-peer DMF from SSD to ZPU. PCIe strong controls to load the data block on SSD into a ZPU memory over the PCIe bus. It is a direct transfer, so it doesn't consume a ZPU cycle and main memory buffer. Once possible SQL's data blocks is moved to a ZPU device, we can run SQL query in ZPU under the March 1000 threat in parallel. Runs a warehouse join and group by. Even if a source data has billions of rows, it is not rare to reduce a data size into hundreds or thousands of rows by group by class. So once data size is reduced dramatically, it is not a tough task for CPUs. This diagram, this figure shows a background technology over SSD to ZPU Direct SQL. We use ZPU Direct RDMA by NVIDIA. This technology allows to map a part of ZPU device memory on the physical address space of our host system. Once PCIe device memory gets host physical addresses, we can use this address as a destination of NVMe read command. When we send a NVMe read command to NVMe device to load the data block to this area, it is intermediated by NVMe controller to load the data block into a ZPU device memory directory. So we like to show a benchmark result using SSD to ZPU Direct by Star Schema benchmark. We used a single 1U rock server with one ZPU Tesla V100 and three SSDs, NVMe SSDs, thrived by MD RAID gel. Star Schema benchmark, well-simulated IoT class workload says it has a single large table line order that is a core of workload. It has, in case of scale factor equal 400, the size of line order is 350 gigabyte. And query, query, try to join a line order to other table and runs group by on various columns. Star Schema benchmark defines 13 different queries on this dataset. In the results, PCIe shown with SSD to ZPU Direct SQL close out by a good performance in comparison to POST SQL. POST SQL loads the dataset via file system. If those POST SQL runs query in parallel, the data access over file system is a little bit expensive. On the other hand, PG Strong uses direct access to SSD and runs a Star Schema benchmark query in ZPU site on the parallel. Its performance is about three times faster than POST SQL or normal POST SQL results. Its query execution throughput is nearly eight gigabytes per second on single CPU and one ULAC server configuration. We implement SSD to ZPU Direct using POST SQL extension and own kernel module. PG Strong extension sends a special request to this module. This kernel module translates the file offset into a block number on the block device. Then it sends LVME request to other block regions. That is our first technology. And the second one is Caramelaster. Let us introduce our own FDW. You will see Apache Arrow. That is an open source project supported by Apache Foundation. It defines data structure on file system. This data format is designed to have color-oriented structured data and also designed to exchange many of big data applications. We add functionality to read and write Apache Arrow dataset from POST SQL. This Apache Arrow defines various kinds of dataset. Fortunately, this dataset almost has one-to-one mapping with POST SQL data types. So our usual dataset can easily transport to Apache Arrow format. And one other background we would like to introduce is log data characteristic. When we learn the traditional reporting query on OLAC system, it processes data generated by OLTP system. We can say this kind of data is generated inside database system. So no need to import the data again. But when we use IoT and M2M log data, usually mobile devices don't connect to the database directly. Usually, a gateway server receives very low data. And it is imported to the database system again. So time for data importing is quite heavy operations on this kind of systems. When we use Apache Arrow, even if many of the device generated data and the gateway server receives it, as long as it writes out Apache Arrow files, no need to import its data copy. Once the gateway server writes out Apache Arrow files, we can map this file as a following term. Since it is a file mapping, we don't need to copy the dataset one by one. And we can also generate Apache Arrow files from the database tables using utility tool, PG2 Arrow. ROFDW also have a liquid capability to write data into Apache Arrow file. But it is only workable on batch insert. So once we map the dataset on Apache Arrow format, from the standpoint of PG show, it can be also a source of the data to be loaded to the GPU. Even though it is not a possible SQL heap data, we can load the dataset from Apache Arrow file. And Apache Arrow is a columnar format. So as long as we can identify what is referenced in the query, we can load the only referenced column. It dramatically reduces the amount of IO and also helps to performance of SSDs in the direct SQL. This is another result of the performance benchmark. When I saw this result shows the number of rows processed per second. And blue result and orange result is what I showed several pages before. And ROFDW reduce the amount of IO according to the query pattern. So in case of Q1 series, its references are very limited number of rows. So amount of IO is small. In the result, number of rows processed per second is a very high score on Q1 series. Q2 series requires more of rows. So its performance looks slower than Q1 series, but much faster than low base data structures. When I saw this benchmark result, I doubt whether it is a reasonable result since it is too fast. In case of query 2.1, it references four columns, ROFDW, ROFDW, ROFDW and ROFDW. And the total amount of IO for query Q1 is about 159 GB. It is about 23% of the entire dataset. And its execution time is 24.3 seconds. So actual IO bandwidth is 6.46 GB per second. And it looks quite reasonable for triple-slide SSD configurations. So that is the second technology we use to process large volumes of datasets. The third thing is table partitioning. When we look at log data characteristics, here is a difference from transactional data. Since log data is usually insert only, very rarely updated and deleted. So MOVCC is not significant. And once, for example, when we partitioned the data by year, once we got to 2020, no more data will be added to 2019 or 2018. Since log data usually have a timestamp, it is generated when data is generated. So due to the characteristics of log data, we can use a partitioned table with mixed charts or physical tables. When we define the partitioned table with normal PostgreSQL table and RL polling tables, we can run query on all the data with higher performance due to camera data structure. And it is relatively, but no problem. New log data will be written to a PostgreSQL table only. And once a month later, we can export log data current into the last month. It contains data for one month and the red table by a PostgreSQL heap will contain only current months. But here is a problem when you see the partitioned table for table joining. Since current implementation of PostgreSQL partition try to gather the scan result from partitioned table first. And then it joins to non-partitioned table. This is a typical query execution plan when we join the partitioned table with non-partitioned table. Optimize the constructor query execution plan with table scans on Partition 0, Partition 1, Partition 2 and it is gathered, then it is joined. Once this set is back to CPU, it is how to optimize by SS2 to CPU direct since the data is already located on main memory. So I added an enhancement on Fiji Show to rewrite the query execution plan to distribute non-partitioned table into a partitioned heap for table joining. Once table joining is distributed for each partitioned heap, we can run scan and join and pair that is equivalent to partial aggregate for each partitioned heap. And then up and down gather the partial result for each partitioned heap. That is a modified query execution plan. You can see non-partitioned table T1 is distributed on the partitioned leaf and hash join is executed for each partitioned leaf. After that, append all the results. That is a factority I call the asymmetric partition-wise join. This kind of partitioning tightly combined to a physical layer of the Slash and the CPUs. When we run query on CPU optimized servers, some models have PCI switches that bypass data from when we use peer-to-peer DMM. And IO expansion box can also have a direct pass between closed devices. If we have a partitioned leaf for each SSDs, an ideal data flow is to load SSD, to close the GPU. Once the data box is loaded to close the GPU and run a scan and join group by workload on the GPU, all we need to transfer to our CPU under our main memory is pre-processed and it is very small. So as long as we can choose a closed GPU over PCI switch, workload of CPU to transfer peer-to-peer PCI packet is quite limited. For the kind of special optimization, PgShrom checks bus topology or PCI bus. In this example, GPU 0 and NVMe 0 and NVMe 1 is connected very closely with the PCI bus. So when query tries to scan the table on NVMe 0 or NVMe 1, this query execution plan tends to choose a GPU 0. On the other hand, table scan on NVMe 2 tends to choose a GPU 1 in this case. It is shown as a GPU SSD distance matrix and it shows distance and preferable GPU for each NVMe devices. That is key of query optimization. So that's three technologies I introduced. It's key of achieve beyond-laws per second processing. So we learned the sub-scheme benchmark actually on the large-scale HPC servers. First of all, we can expect 8 to 10 GB per second physical data transfer rate for GPU and full SSD unit expected from the result of 1U, a small benchmark result. And once we move the data set into a Karamuna format, it looks 25 to 30 GB per second of effective data transfer throughput since Karamuna formats allow to loss data set only reference to Karamuna. Usually it is very limited. And once we learn the query execution as a full unit in parallel, its effective data transfer ratio is expected to 100 to 120 GB per second. It is an effective data transfer ratio, but it also means we can process 100 TB log data within 20 minutes depending on the requirement. It is a reasonable response time for large-scale batch jobs. So we built a benchmark environment using 4U HPC server and 4GPU and 16 SSDs. That is our photograph. Here is the Tesla V100 CPU and it is connected to the next to SSD connector. This is the HPE of NVMe SSD and it is connected to JBOG devices. It is connected one by one. And we built a partitioned table on a line order. It is distributed into four child tables and using hash partitioning. In total, the depth size is 3.5 TB and the individual part should be up to 880 GB for each. The result is quite good. When we run starting schema benchmark queries, we can see that there is a lot of data in the system. Postal SQL processed about 65 million rows per second. And in case of low-data bears, SSD to GPU direct by PG strong, it processed about 250 million rows per second. There is some query pattern over the billion rows per second processed. That is a result of 4GPU and 16 SSDs. It is one aspect of the benchmark result. We visualize a result of IOS data during query execution. This is between the query execution of Q1, 1. The next is Q1, 2, Q1, 3, and so on. That shows individual SSD units shows about 10 GB per second in depth throughput. Individual part should be achieved 10 GB per second. In total, we can raise a depth set from SSD in 40 GB per second to process a depth set. It is almost hardware limitations. That is a result I wanted to introduce. I have a plan to talk at the PG conference in New York City last March about cancer. After the last of today's talk, I would like to introduce our future directions. One movement I would like to introduce is ZPU Direct Stretch recently announced from NVIDIA. That is almost equivalent technology with our software stock in Linux kernel. This new API intermediates data transfer from SSD to ZPU. But we don't need to special kernel module. So as long as applications support ZPU Direct Stretch, we don't need to install our own developed Linux kernel module. And some other third party vendor supports this ZPU Direct Stretch. It means PG strong can use these third vendor solutions like complex supported SSDs or software defined spatial over fabric network and so on. One other new future is post-GIS support. Since IoT and M2M data often contains geolocation positioning data, latitude and longitude. These location data is accumulated day by day and people want to analyze these location data with definition, with geometric polyglot data. That is very usual workflow and I was asked many times but so I try to implement it. Currently ZPU version of post-GIS support very limited APIs, distance and contents process. And we also currently we are under development of GIS index support for related to GIS. That is very helpful for ZPU characteristics since ZPU can run very large number of threads in parallel. So we can search R3 index for each point gathered from mobile device, car, drone and so on. Once we can search R3 index by a thousand threads in parallel, I expect it is a much faster option than ZPU burst R3. So that's all I wanted to talk today. Here is a repository of post-GIS show and package distribution. Document provided here in English and Japanese. The contact address is here it is a mail address and you can also ask me any kind of question over twitter K-KaiGai, myself or Hedernity. Thank you for your joining my talk.