 Hvala, češnja, češnja. To je seštja o neofoJ Streams projekti. Je Andres in Turbano, da je Mikaeli. Včešnja je Laro so, češnja je integrator in neofoJ. Laro so v 2004 v Venniji z Lorenzo Speramsoni, češnja je vse. V neofoJ stavljamo, češnja je neofoJ. We became the first Italian partners in we have a strong presence into the Neo4j ecosystem. This is because we are the creator of the official Neo4j JDBC driver. We are the creator of the Apache Zeppelin interpreter, the ATL 2, and we also developed over 90 epochs to procedure. Let's see the agenda. We will talk about the Neo4j Streams project, but we have a brief introduction of what is Apache Kafka, and we will talk about how we combine Neo4j and Kafka. Then we will talk about the three modules of the Neo4j Streams project, so the change data capture, and we have a demo. The sync, we have a demo, and in the end the streams procedure. The demos are all in a docker container. All the code is in a github, so after the presentation you can download it on github. And all the demo are on Apache Zeppelin. How many of you already know what is Apache Zeppelin? Oh, that's great. For the other group it's something like Jupyter. It's a notebook runner that allows to connect with Neo4j for instance or Apache Spark, and it allows to have the output of computation directly on the web browser. So, what is Neo4j Streams? It's a Neo4j plugin that enables Kafka streaming on Neo4j. And as I said before, what is Apache Kafka? It's a distributed streaming platform built on streaming capabilities. It allows publishing and subscribing streams of records. It stores these streams of records in a fault tolerant and durable way and it process the stream of records as they occur. How it works is based on two main concepts, topics and partitions. Topics is something, let me say, like a container for events of the same type. It's a category feed name in which the records are published. And each topic is organized in one or more partitions. And each topic can have one or more subscribers. So, each partition is an ordered sequence of records. And this order in sequence is identified by a number called offset, which is unique and sequential. The consumer is in charge of the commit of the offset. So, this allows the consumer to read the stream of records as the consumer wants. So, he can start from the beginning of the stream, he can start from a specific offset, he can start from the last committed offset because the commit is in charge to the consumer. So, how it's used? It's used for two general classes of applications. For building real-time streaming data pipelines, think about real-time ATL pipelines, so you can use Apache Kafka in combination with Apache Spark in order to build a real-time ATL pipeline. Or for building real-time streaming applications, think about in a microservices environment when you need to exchange messages between microservices. Then Apache Kafka, it's a well-suited application for that use case. So, what is a Neo4j streams? It's a Neo4j plugin, as I said in introduction, that enables Apache Kafka. And the project started almost, if I remember correctly, one year ago by Michael Unger. At that time there was only the change data capture mode. When I joined the project in October, we added the sync module, the streams procedure, and we also built a Kafka Connect plugin, which we'll see in the following slides. So, what is the change data capture? What is a change data capture? In databases is the same pattern that allows to track database changes all over the time. So, in response of that change, an action could be taken. What are well-suited use cases? Change data capture, I use it, for instance, in data warehouses, because data warehouse needs to track data change all over the time. The data is a dimension in a data warehouse. It's also used in database replication, because the transaction log is a change data event log. So, how it works? Once you install the jar into the Neo4j plugin directory, and once you configure your Kafka endpoints, the plugin is automatically up and running. And the change data capture module basically installs a look on each transaction. So, to each commit, there is our own newsletter that exposes events about creations, updates, and leads of nodes in relationship. Our structure provides information about before and after the data change, and the change data capture plugin allows to configure a property filtering for each topic. We will see in the demo how it works. And these events are sent asynchronously to Kafka, so the commit patch should not be affected by that. So, let's do our first demo. So, this is the Apache Zeppelin environment. We'll get our first notebook. Every box is a paragraph. So, let's talk about how the Neo4j stream CDC module deal with database changes. So, think about to this creation event. It's just one transaction where we are creating two nodes and one relationship. So, the CDC module will unpack this transaction into two events, two node creation, and one relationship creation. And that is streamed from the CDC module as the following general structure. So, all of you can read the... Great. So, as the general following structure that was inspired by the BISUM, we have a metafield, which contains all the transaction metadata. Then we have the payload field, which contains all the data related to the transaction, and we have our before and after information. So, let's create our first consumer. Oops. Sorry, some space. Let's do account just to be sure that the database is empty. And let's create our first transaction. So, these are the events that are be streamed by our module. Let's take our first event and beautify our JSON in order to show the structure. So, the payload with the ID, which is the identification of the entity, that the entity in this type is a node, and we have the before and after information. The before is null because it is a creation event. So, we don't have the data related to that ID before the transaction. And the field after contains two properties, properties, which contain the property of the node, and labels, which are the labels of the node. Then we have all the transaction metadata. So, we have the timestamp, the username that made the transaction, the transaction ID, and these two fields that are the transaction event count, as I said before, this is a creation about two nodes in one relationship. So, the transaction event count is about three events, and the transaction event ID identifies each event inside the transaction. We have the operation that is created, and the source that is all the network data related to the Neo4j instance. So, I created this paragraph in order to allow the people who downloads the code directly from GitHub to understand what happened. So, I will skip them today. So, now let's perform an update on a node. We basically take the node Andrea and we set a new field called surname. So, let's take again some space. Okay, let's see what happened. We have just one node. Let's check the structure. Now, we have also the before information, because we have the data before the database change. So, the properties of the node is only Andrea, and after information, it's about the new property. So, we have the new property. The operation changed accordingly to the update. Now, let's see how is the structure of our relationship. I remember the first transaction. So, this is the structure of our relationship. We have some fixed information, which is in addition to the we have for the relationship, also the start and the end field, which is the information about the start and the end node of the relationship. Neo4j is a fixed information. So, once you create a relationship between two nodes, you cannot change a start node or end node. And the other information are basically the same. We have the before and after information. The label, which is the relationship type, nodes, and the type, which is the entity type. In this case, a relationship. So, let's go further, and let's see how is a cancellation event. Here we have only the before information. So, we have only the data before the changes and not after, because the data is deleted. So, we don't have any data after the cancellation operation. And we have the operation field that changes accordingly. In the same way, for the relationship, we have the start and the end node and only the before information. Ok, just to be sure, now let's make the account. And let's talk about how we allow the property filtering that I introduced into the slides. So, you can configure into your Neo4j conf your property filtering by using this general configuration. So, we have a prefix, stream source, topics, nodes, then the topic name and then the filtering configuration. The syntax for the filtering configuration is the following. So, we have the label that identifies the node. And then inside this parenthesis we can put all the properties. The star is a wild card. Then we have these two kind of properties. So, we want to include prop1 and prop2 for the label1. And for the label3 we want to exclude the prop1 and the prop2. So, a full example could be for a person topic. We want to we want to have only the person nodes and we want to include in our filter the property, social ID and age. So, now we need to go into the compose file and I need to create a new person and just let me restart the database. So, just to be sure that database it's up and running it takes a while it takes a while it should be online now. OK. So, topic person and now we will create a new transaction where we are creating two nodes person and one relationship between them. Each node person has three properties so age, name and social ID and by our configuration we are excluding the name. OK. Just clean the terminal let me clean the terminal and launch the transaction and let's see how the CDC module deals with this property filtering. So, as you can see all the filtered properties are included in our event. Let's see what happened if we change if we add a property into our node so this property is not included into the filtering properties so it should be not shown into our event. We have now because it's an update event that before and after information and in both of them as you can see the property is filtered according to our configuration. Now let's change a property that it's included in our configuration so we will update the age and let's let's check how the data changes our event changed. So we have the before information the age before was 23 and now is 34 according to our configuration. So now let's just let me give some space let's see the cancellation event and again according to our configuration we have only two included properties. So, thank you. This is the demo related to our change data capture model. Now let's talk about the second module of the Neo4j Stream project, the SYNC. The SYNC allows to ingest data into Neo4j. Initially we thought about a generic consumer with a fixed projection of the event into nodes and relationship but then we thought about something let me say more smart so we want to give to the user the power to transform any stream any Kafka streamed event into arbitrary graph structure but let's see how we allow this. So we have the general configuration and for each topic you can provide your own cipher template query. So for instance we have our configuration for a general topic called myTopic and on the right side we have our cipher templated query. As you can see you will find a special reference to an event property here. This event refers to the Kafka event basically the Neo4j SYNC module takes a batch of Kafka events and then those events are unwinned and we will executed our templated cipher query. So the final statement for a property configuration like this it's exactly this cipher query. As I said before we also have a Kafka Connect SYNC plugin we released it two weeks ago, no one week ago, sorry but what is Kafka Connect is an open source component of Apache Kafka that allows Kafka to deal with external systems such as databases file systems and so on. The Kafka Connect SYNC plugin works in the exactly same way of the Neo4j streams SYNC plugin. The only difference is that the Kafka Connect SYNC plugin must be installed into Kafka and our plugin is available on Confluent Hub which is a marketplace of Kafka Connect plugins. So let's see the second demo and I can deal with it. No problem. After today you don't need to be to go to the gym. So let's see our configuration this is our query we will talk about the query later. Our general configuration this is we have an open data endpoint the open data is about the Italian Ministry of Health of Pharmacy stores and we basically download the CSV from this open data API, then we read it via Apache Spark and from Apache Spark we publish this data to a Kafka topic that is intercepted by stream plugin and these Kafka events are transformed into this graph data structure. So we have four nodes a pharmacy in the red this pharmacy is connected to a pharmacy type on the right side by a relationship is type the pharmacy is connected to address by the relationship as address and the address is connected to a city by the relationship is located in so our Cypher query basically transforms each event in this data structure in this graph structure so once you go home and try the notebook please start this paragraph before using the world notebook I did it before in order to gain some time for the demo so we are set the URL of the open data API I pre-downloaded the data set for the from the URL in order to gain some time and then we load the data into a Spark data frame we will print the schema so our CSV file is composed by several columns just simple some data to show the structure so we have this one is basically the address the next column is the pharmacy name and so on then we store the data as a temporary view and then we will do some material I will not focus on the ATL stuff because it's not the goal of today but I want to show you this table so we are creating here an open data Kafka stage table just to let you know which is a table composed by two columns the first one is the key the second one is the value so it basically is producer record of the Apache Kafka API now let's create the constraint on Neo4j in order to speed up the import process let's clean the database and let's make account just to be sure that the database is empty now in this paragraph we basically take the open data Kafka stage table and we will send it to the pharma topic which is the topic that we configured with our cypher query, our templated cypher query so let's go back to the previous paragraph and let's see how our data set are going up so let's now do a query over our data set, we basically with this query we are returning all the pharmacy store in the city of Turin so we have the node here which is the city as you can see here in the bottom left is the city of Turin these nodes in blue are our pharmacies and are the address ok let's go back to the slides ok sorry so the third pillar is the stream procedure so with the stream procedure we want to allow the users to deal with Kafka topics directly from cypher so the stream procedures are basically two we have the streams published procedures procedures that allows custom messages streaming from Neo4j to the configured Kafka environment by using the underlining configured producer so it basically shares the configuration with the producer of the CDC events then we have the second procedure which is the streams consume that allows to consume messages from a given topic and let's see how so we basically create a new consumer on the topic my topic ok procedure takes two variables in input and it basically returns nothing as it send the payload a synchrony to the stream so we have the two properties the first one is the topic so where we want to publish our data and the second one is the play load so it's the data we want to stream the streams published procedure supports all all the Neo4j data types so we have strings, numbers nodes and relationship as well and if you are sending a node or a a relationship to a topic which is configured with a property filtering configuration the relationship will be filtered in according to the provided configuration so let's publish our first event and let's see about the structure ok as you can see here the event that is generated by the streams published procedure is let me say quite different we have the payload field which is the data we streamed and the value is the data we streamed so we have hello from Neo4j as we we published with our procedure now let's publish something a little bit different we have array of maps as you can see here and let's see that our data has been streamed in according to the data streamed from the procedure now let's see how the streams consume the streams consume procedure works so we basically have two topic sorry two variables in input and it returns a list of collected events so we have the top, the first argument is the topic the second one is the configuration which is a map that we can pass as a configuration for instance, we can pass the time out that we want to list to the Kafka event so let's now create a producer ok, now what we will do is send this kind of data from the producer that we have just created and we will consume this data from by our procedure so the consumer starts to publish the producer starts to publish data to the topic person topic now let's do let's start the procedure ok, as you can see after a time out of 50 seconds of 5 seconds, sorry we get the events that we streamed from our consumer and you can use it directly into cipher in order to create arbitrary graph structure or for doing something something else whatever you want directly from cipher ok, let's go to the lessons learned what we see what we have seen today so we have seen how to use the CDC module in order to stream transaction events from Neo4j to other systems and we have seen also how our CDC module allows to define for each topic a property filtering configuration which is composed by inclusion or exclusion we have seen how to use the sync module in order to ingest data into Neo4j by providing our own business rules so, remember you can transform any Kafka events in an arbitrary graph structure and in the end we have seen how to use the stream procedure in order to transfer those data directly from cipher so please give us feedback if you try the project we need feedback of the community this is the link of the project on github and this is the code repository of the demo that I shown today so, thanks to everyone I think we have some time for the yeah I will leave here so he asked if the project works with the schema registry so, the streams plug in don't work with schema registry but the Kafka connect plug in works with the schema registry ok you're welcome any other question yeah so, Michael ask how we can use in a big environment the CDC module in order to in order to coordinate different systems so think about data warehouse ok you can build new kind of data warehouse which leverages that that change events so, as I said before CDC are commonly used to build data warehouses so, you can basically use our module in order to build data warehouse the data in real time and you can do it by using for instance more than processing framework big data processing framework such as Apache spark this is a way that you could use the CDC module any other question ok