 Hello, good afternoon everyone. My name is Kisung Kim. I'm the CTO of Vietnam Global. I present a graph database project based on Postgres SQL called Agents Graph. Before we start, let me introduce ourselves, Vietnam Global. Vietnam Global is headquartered at Seoul in Korea and was founded in 2014. There is an R&D center in Santa Clara, California too. We provide technical services for Postgres SQL and big data systems in Korea. We have partnerships with IBM and cloud data. We had core technologies for data processing and currently we are developing a graph database system called Agents Graph. This is the outline of my talk. I'll introduce the graph database and our motivations to develop a graph database based on Postgres SQL. And then I'll introduce our product Agents Graph and explain how we implemented using Postgres SQL. And then I present Agents Graph's performance results and conclude with our future roadmap. Graph database is a change in data representation. GATNA says it has a radical change in how data is organized and processed. Let's compare the graph and relational data model. In relational database, every piece of data is a row. Data is managed as a row over table regardless of whether it is entity or relationships. And if we want to retrieve some data, usually we would join the related tables by ourselves. The database does not provide any facilities to handle relationships. But in graph database, the data model uses nodes and relationships rather than the row and tables. The relationship is the first class citizen in database system. This is the biggest difference in graph database. Relationships are explicitly managed and handled directly by the database engine. In graph database, you can create nodes and relationships explicitly and you can query in terms of relationships. So we can make our data more connected and the management of the connected data becomes much easier in graph database. Let's see how graph modeling is simple by an example. In this example, sellers have their products and customers order these products and list them. If we represent this as a relational model, we can make these entities and relationships as a table. For example, the order table has the information about which products. So the order table stores the relationship information. But in the viewpoint of the database, it is just a table. In the graph database, this real-world model is reflected directly to the graph model. The relationship between entities are stored as relationships in the graph database. So you can see that the data model of graph database is much simpler than the relational database. To show the conciseness of the graph query, I want to introduce cyber query language, which is created by Neotechnology. Cyber is a declarative query language for the property graph model. It is inspired by SQL and Sparkle, which is the standard query language for RDF. Cypher borrows many clauses from SQL like where and order by, and the graph pattern matching features from Sparkle. Neotechnology opens the Cypher language as a public project open Cypher. To see how the query can be conciseness using Cypher, let's see an example. This is an example of Cypher query for finding an ancestor-descendant pairs in the graph. In SQL, we use the recursive common table expression to express these kind of queries. And we have to specify in the query how to retrieve the relationships using joins with CTE. So as you can see in this example, it is not so simple. But in Cypher, this kind of relationship can be expressed in an intuitive form like this. The pattern is described using ASCII code diagrams. So it is easy to read and understand the meaning of the query than the SQL query. Essentially, the graph database is getting more popularities. There are many graph database vendors already and existing database vendors like IBM, Oracle, and SAP also provides graph features in their product. And recently, NoSQL vendors like MongoDB and Elastic Search also provides graph features in their product. So I think that it is time to provide the graph features in PostgreSQL 2. There are broad spectrum in graph database area. So what we want to implement is a graph database which provides the property graph model and Cypher query language, ACID transaction, and graph analysis functionalities. We chose to implement a graph database based on PostgreSQL because it has already great features we need, like a storage engine, transaction layer, and cost-based query optimizer. Of course, each module should be extended and optimized for the graph workload, but PostgreSQL gives a good starting point. Recently, we released Agents Graph version 1.1, and it is a Poked project of PostgreSQL for the graph data. First time, we considered to develop it as an extension, but we cannot develop it as an extension because we have to modify query parser and add plan nodes for graph processing. You can visit our homepage, but it is currently under developing, and Agents Graph is an open source project, so you can clone the source code from our GitHub. From now on, I'll explain some features of Agents Graph and how we implement it based on PostgreSQL. Let's start with Agents Graph's data model. Agents Graph's data model is the property graph model. The property graph model is a graph model in which vortexes and edges can have their properties. And the vortexes and edges can be grouped into labels. For example, you can make vortex labels like person, student, or teacher to distinguish them. We extend this model in two ways. First, we extend it for vortexes and edges to have JSON documents as their properties, rather than key value pairs. These properties can be indexed using any kind of indexes PostgreSQL provides. This is one of the great benefits from developing based on PostgreSQL. And we introduced a label hierarchy for the property graph model so that you can organize these labels as a hierarchy. Agents Graph supports multiple graphs, so you can make several graphs in a database. Before I show you an example, I introduced more details about Cypher. Cypher has many clauses. Some of them are borrowed from SQL and their meanings are familiar with you. For graph pattern matching, they use match clause. And for updating the graph, you can use create merge or set. This is the railroad diagram for Cypher syntax. You can make a query by chaining these clauses as you want. I'll show you an example about this later. This is an example of Agents Graph. We provide several details to create graph objects, vortex, and edge labels. The first sentence, create graph social network is a detail to make a graph object called social network. And then you can make vortex and edge labels. In this example, I made a vortex label person, conference, and community. And then I create two edge labels, attend, and related. So using these details, you can create graph objects and vortex labels. And if you want a label hierarchy, you can create a label using Inheris keyword, the first class SQL provides. And then you can make vortex and edge using Cypher query like this. The first query makes a vortex whose label is person and its properties are described using JSON formats. And then you made two more vortexes, one conference and one community. After that, to check out this vortex, we use match queries. In this match query, there is a single node pattern. So this query returns all vortexes in the graph. And we can see there are three vortexes whose label person, conference, and community. And then we can create property indexes for some interesting properties like this. You just need to assign specify property names to be indexed with its data type. And you can create relationships using Cypher match and create clauses like this. To make a relationship, we first have to find vortexes to be connected. And so we use match and create clauses together in a single query. This example, first find vortexes to make a relationship among them. In the match query, it finds person vortex and it assigns to a variable and finds conference and assigns to variable b and community. And then the next clause create will create a relationship between the vortexes. And after the creating relationships, I learn of match queries again. And in this time, I use the pass pattern. So this match query is to find who attend a conference and the conference is related to some community. And we can find this information. Agents graph is not layered architecture. Rather, it is developed in the level of core of post-classical engine. We implemented the graph objects and designed storage layouts to store graph data. We extend post-classical query engine to process Cypher query. For that, we modify post-classical query engine including passor, optimizer, and executors. But we maintain the storage and transaction layer. So most post-classical external modules and extensions can be used with Agents Graph 2. This is the storage layout. We use heap tables to store graph data and use v3 to index edge and vortex information. The heap table for vortex has two columns, vortex ID and properties. And edge table has four columns. Edge ID and start and end vortex ID and properties. And there are two composite indexes for edge tables, start and end start. We build two indexes to find the fast incoming edge and outgoing edges for specific vortex. And we use composite indexes to exploit index of these scans for graph traversal if possible. We haven't implemented a storage for graph data none yet. This storage layout works well for medium-sized graph data. But we have a plan to implement it for handling large-scale graph data. Cypher query is processed in the same query engine with SQL. We integrated Cypher query processing with SQL engine from the passor. So you can use post-classical expression or functions in Cypher queries where it is allowed. The result of Cypher query is a relation. So we treat Cypher query as a sub-query so that the existing query optimization techniques can be applied to Cypher queries too. As I already mentioned in Cypher, several clauses can be used in a single query. Basically, each clause makes results and the following clause leads the result and process them. When processing a Cypher query, first the query is transformed into a query tree structure. To make a query tree, first each clause is transformed into a query structure in post-classical engine. For match clause, it will be a query with join tables and the chain queries are combined as a sub-query. Let's look at this example. This query is composed of three clauses, two match clause and one return clause. When we process this query, the first match is transformed into a query with three tables as inputs. In the upper picture and inside the query, there is the query structure for the first match and there inputs actor table and acting and movie table. To process the second match, we create another query structure which includes the first query structure. The first query structure for the first match becomes an input to the second match query. After we create a query structure like this, the post-classical query optimizer engines optimize these queries further. In this case, the sub-query load-ups can be applied. After applying the sub-query load-up, the query structure will merge into one. We implemented some new plan nodes to improve the query performance. As an example, I present a VLE plan node. VLE means variable length at Gs. VLE pattern is a concise way to express a range of the matching paths. We already seen in the previous example for VLE patterns. In this example, in the match, there is a pattern whose length is variable. We use the star characters to express the variable length at Gs. This kind of query can be implemented using recursive common table expression as we saw previously. But we found that the CT approach has some performance issues. The CT approach is a breadth-first-search approach which needs to buffer all its intermediate results in memory. To improve the performance, we implemented a new node for VLE patterns which perform depth-first-style processing and need-no-buffering. Using this VLE plan node, we can improve the VLE patterns by orders of two or three. This is the explain output for this cyber query. This query uses variable length at G patterns so you can see that there is a new plan node whose name is nested loop VLE. This node conducts the graph traversers using depth-first-search patterns. There are many considerations about graph query processing. Among them, I want to explain three points here. First, graph pattern matching is usually more efficient to use random page release rather than the sequential release. So, it is recommended to decrease random page cost parameter value. Graph processing uses random IO extensively so it is more efficient to cache the data in memory or use SSD to store the graph data. We designed the HG indexes to use index omniscans when it is possible. When traversing the graph, HG indexes are used so it is more efficient to use index omniscans for HG indexes. And finally, the query optimization is crucial for graph processing. But usually, it is harder than SQL queries because graph queries involve more joins than normal SQL queries. We found that PostgreSQL Optimizer works well usually even for graph queries but it needs to be improved and it needs more research for that. We conducted performance comparison using LDBIS benchmark tool. LDBIS is a consortium for making a benchmark tool for graph workloads. There are big companies to participate in this project like Oracle, IBM, Huawei and SAP. There are three benchmarks for three benchmark workloads. The first one is social network benchmark. LDBIS benchmark simulates the social network services in which user creates friendship and write posts and comments and they can express they like the post work comments. It is some kind of OLTP workload similar to TPCC. And there are also graph analytics benchmark which conduct large-scale graph analytics. In this experiment, we used SMB interactive workloads. And this is the result. We compared Agents Graph version 1.0 with Neo4j's 3.1, the newest one. I want to emphasize that even though we had optimized tool database as much as we can, the result can be changed in any time according to the server configurations. For now, the result so that Agents Graph is much faster than Neo4j for updates and complex queries too. left side is the performance for the updated queries and right side is the complex queries. Update queries is relatively simple. Some people, if some people becomes friends with someone, then it creates relationships between the two person. Or some people can write posts or comments. So it is very simple update queries. And for the complex queries, it is usually read most queries and the query pattern is very complex. So it takes more times than simple update queries. But for both workloads, our Agents Graph is much better than Neo4j. Neo4j is the most popular graph database systems. They argue that they are most optimized for the graph database, graph data. And they have native storage and native graph processing engines. But although we use PostgreSQL or relational databases back-end systems, but we prove that our system is better than Neo4j. There are many reasons for Agents Graph is better than Neo4j. For the updated workloads, this is because of the efficient concurrency control of PostgreSQL. Neo4j, I think it is not matured for the high concurrency workloads. And for the complex queries, as I already said, the query optimizer is very important to improve the query performance. So when the query pattern becomes complex, it becomes harder to optimize the query. Because there are many joins and so the PostgreSQL query optimizer, we found that it is very good for optimizing the complex queries. And further, we can optimize the performance by using PostgreSQL advanced queries and advanced features. Yes, I conclude my presentation with our future roadmap. First, we plan to extend Agents Graph for distributed and parallel systems. Currently, we are considering applying PostgreSQL to Agents Graph. And we're implementing Graph Analysis Framework like BotX-centric programming model. The BotX-centric programming model is a new programming framework for the Graph Analysis algorithm. It is a very simple form of API. So users can easily implement their algorithms using the BotX-centric programming model. And finally, we are also considering integration with Big Data System for large-scale graph processing. Currently, we are considering to make Hadoop file system fdw using native C++ library. And we are also considering external fdw modules for pocket data format. Pocket is a data format which is developed by cloud data. And it is a columnar data format. So I think if we can use this kind of integration, we can extend our Agents Graph for the large-scale graph processing. Yes, that's it. Thank you for your attention. Any questions? Yes? You mean the self-relationship? Yeah, it is possible. Yeah, yeah, it is possible. Sorry, I'll repeat your questions, please. What? DH has directions. So when you design the graph modeling, you have to decide which one is the major and junior using the direction. Yes? Yes. The first thing is that Cypher is very popular. And we found that Cypher is easier to understand than other languages. I saw the OrantDB's language. Yeah, it is good. But we found that Cypher is more intuitive because they use ASCII code like diagrams. And it is very similar to Spark code language. And the Neo4j currently they want to Cypher as standard code language. And they make open Cypher community. And they promote large companies like Oracle to participate in the standardization. So I think Cypher can be developed more in the future. So I choose it. Yes? Yeah, GraphQL becomes very popular today. But I think that it is not for Graph database. Rather it is general. I don't know how can I say, but in GraphQL we cannot express curly patterns for variable length edges. But it looks like more curly language for JSON documents, I think. So the name is very confusing. Yes? So your question is comparison between Graph model and relational model? Yeah. It is good questions. Actually, in LDBs consortium, there are many relational database vendors like Botuoso, OpenLink, SOP2. They use relational model to conduct the LDBs benchmark. And the result is the relational database wins. Because I think that because the Graph database is add more abstraction to relational systems. And I think that one of the reason the LDBs performance better than Graph database is that LDBs database is more matured and more optimized. Actually Botuoso provides columnar storage. So they extensively compress the data. So I think Botuoso can win the other Graph database. But Graph database provides another benefits for managing the Graph data. Because we can curate, give me any BotX and give me any relationships. But you cannot curate this kind of curate in relational database. Because there is no concept of BotX and relationships. So I think the performance will be more improved in the future. And the Graph database is a benefit for data modeling. The conciseness can be help clients. I think it depends on the machine's resource. It depends on the, especially for size of memory. If you can load all of the Graph data into memory, then you can get best performance. So I think medium size Graph, we can say that two or three times than the memory. But if the data exceeds the main memory size much, 10 times or 20 times, then it can be classified as large scale. Yes, we provide our packages as entirety. Including Post-Classical and Agents Graph modules. Thank you very much.