 Hello! My name is Wojciech Trotsky. Welcome to the trial where we're going to judge how to effectively stream your data with GraphQL and Apache Kafka. Today we're going to talk about data streaming for web and mobile applications. We'll learn about GraphQL subscriptions and how to effectively use them with Apache Kafka. Finally, we'll talk about moving from development to production with Kafka and GraphQL. Let's get started. What's actually data streaming? The term streaming is used to describe this continuous flow of data where the server applications push data synchronously to the clients without making any initial requests. Streamed by definitions are targeted, giving ability to listen to specific chunks of the data from a server. They're also reactive, giving ability to react to various business processes or events in the backend. While even streaming and technologies like WebSockets are in constant use for more than a decade, even streaming becomes even more popular thanks to GraphQL subscriptions. So GraphQL subscriptions are this subset of GraphQL query language focused on data streaming. They give developers ability to subscribe and receive this unbounded stream of future data in their application. Subscriptions are the perfect way to implement streaming solution on the client as developers can request specific parts of the stream, provide a filter from the client as well, and use widely available protocols like WebSockets and MQTT. GraphQL subscriptions start with subscription query. That is sent from the client to the server. This query initialized connection to start receiving the data. It can contain the username of the object that we want to query. For example, users. Filter, which describes the criteria for event stream. Name of the fields we want to receive as part of the notification events. Subscriptions on the client are usually like acting on specific events. For example, we can acting on update, delete, and creation of the objects. We're really seeing subscriptions query on the client. Let's just talk about how subscriptions are handled on a server. So subscriptions are typically using this publish subscribe mechanism. Publishers are handlers in our code or external system that will publish the data. Even boss is any Q system that supports publish subscribe. For example, Redis, MQ. Even boss will send a message back to subscribers based on the topic. Those will maintain registry of currently active subscribers connected to WebSockets so users can receive this data back in the client. Let's dissect how GraphQL subscriptions work to see actual challenges when we implement them. So as we see, we could have multiple publishers that are actually producing events. When event is issued, we're going to get it delivered to all active subscribers on the client based on particle topic. Challenge in this scenario is filtering. If we're looking for a specific event, that is, for example, online by us, we'll still need to filter out the entire topic. This can be mitigated by more topics on the publisher side, but that will mean that our publishers will need to handle the complexity of more granular topics. As we can imagine, this architecture could be hard to scale and can lead to eventual performance problems. I have been experiencing some of the mentioned architecture challenges through our community that use our libraries. Almost two years ago, we have started working on a project called GraphBug where we implement this generic subscription solution for relational and no-school databases. GraphBug offers basic crude API for many database engines and allow developers to detect the changes using GraphQL subscriptions. We quickly realized that published subscribed model was very limited and started exploring different options. That led us to a very popular open source solution which is called Apache Kafka. Apache Kafka is considered the holy grail of published subscribed mechanisms due to a number of factors. It works great as general lock of events which fits well with the client side data streaming. It has additional features that other queues missing like event storage, offsets, consumer groups that allow developers to scale and also consume the message that were previously consumed. It also easy to scale and root your data. All of that using open source technologies. Before discussing some architecture patterns with Kafka, let's talk about specific cases we want to focus on. Let's imagine we want to build this mobile app with one of the application screens relying on real-time data. We should not rewrite your entire backend or adopt some architectures that are specific to some specific language technologies. In this context, GraphQL subscriptions should be considered feature and not general target to refactor your entire backend. When starting with event-driven architectures, it's really easy to get overwhelmed by complexity. This is what can happen when adopting Apache Kafka. When we start reading about Kafka, we often see those Kafka-centric architectures where backend is based entirely on flow of events. Event-driven architectures could be initially hard to understand and debug in case we want to notice some data inconsistency. When I have started exploring Apache Kafka, I could not find any reliable example on how to effectively use GraphQL subscriptions with a Kafka in Node.js. Libraries that claim to offer those solutions often were very simplistic and not really production-ready. Apache Kafka's ecosystem is also Java-centric, so it has a very small amount of libraries for other languages, especially for streaming. This led me to the question, can we have production-ready GraphQL subscriptions with Apache Kafka and utilize it with ecosystems like Node.js, Python, Gola, PHP? It will be hard to answer this question for any use cases, but for our projects, if our projects will use one of the open-source databases, we can utilize something like Change Data Capture Stack that works on top of the Kafka Connect. This technology allows developers to connect directly to the database. Kafka events will be published when data is stored in a database. Apart from using the Change Data Capture for building GraphQL subscriptions, we can also use them, you know, the same stream of data to build effective text search, caching validation, and audit log for our system. So Kafka Connect comes with many different connectors, mostly developed by community. Debezium is the leading open-source solution that works on top of the Kafka Connect API, working with major open-source databases. Debezium provides standard for multiple relational and noise-cooled databases. It has been used across the industry. Debezium will be providing us this raw stream for our database that often needs to be filtered and transformed in order to be consumed by clients and sometimes other parts of our backend. So Debezium provides this first-class handlers for those use cases by simple properties. There is no code involved. You can manage things and, you know, root things using just basic properties. And in some cases, if we want to transform a payload, we can even use the scripting languages like JavaScript. This is really important if we don't want to write and host any Java-based code. Joining Apache Kafka from Debezium GraphQL subscriptions can give us this reference architecture for client's data streaming. So our reference architecture starts with Debezium connecting with our database, like Postgres or MongoDB. When starting connector, we usually specify what tables we want to scan for changes and additional transformations. For example, if we want different topic names or change the content payload, we can use this transforms property. Next phase is a Kafka consumer that is a separate part of our application. Consumers reads the data from Kafka and prepares it to be sent to the GraphQL engine. Then our process streams are supplied to GraphQL subscription engine and sent to the clients using WebSockets. The most awesome feature of Debezium is that it does lots of Kafka heavy lifting for us. Debezium by default will create those topics, representing our database infrastructure, which will be valid for most of the use case. If we want to have different topic layout, we can always change it in Debezium, or have separate stream processing engine that transform it later. Topics and underlying partitions will be organized by default using primary keys, so we can have guaranteed order of the events on a table and collection level. Changes that are happening on the individual database will be always located in a single Kafka partitions. Once again, this guarantees us order. Let's talk about patterns that we can use to build this end-to-end streaming architectures using GraphQL. The simplest option would be to consume single event stream. This architecture typically using Apache Kafka topic and consumer group. Sadly, this is also ignoring most of the features that Kafka offers on top of the published subcribe. It's also very impractical in production to use it because it kind of restricts how we can scale. This approach is being used most of the time by already available pop-up implementation for Apache Kafka that tries to adapt pop-up to use a Kafka, but as I said before, they're not so efficient. The second approach, which is more Kafka specific, is to make our server app store the records directly in the database as kind of form of projection. Those can be saved as a result of stream filtering and aggregation. We can do some preprocessing. We can also produce the same way we can produce the separate topics for it. Well, this approach is pretty much standard for Java ecosystem with many blogs, libraries like Kafka Streams playing major role in creation of those artifacts. It's hard to find the counterparts for other languages like Node.js. Creating extra artifact will also increase the complexity over backend. Can we have the more lean approach? As alternative for both patterns, we can use this event marker consumption. In this approach, we can configure our producer to send only information that was changed, like information about what we changed within our database and ignoring the payload completely. When consuming, we can use the information to execute full query when subscription event is sent to the subscriber. This way, we can have a lightweight system that doesn't require long-term event retention, extra storage, and work perfectly with GraphQLO. Personally, I found the marker approach really practical for GraphQL subscriptions, since GraphQL subscriptions can be based on more than single database entity. If there is any extra data that GraphQL subscription requested, we can simply fetch it from the database. For that, we don't need to do some advanced streaming operations like joining the topic, mapping data, etc. It's fast and it has low storage requirements for the topic data. However, it might need some extra tweaking in cases where our database has some significant delays on replication. This can happen where our handling for event marker can read the stale data from the database, because it still haven't been updated through a replica set. Since our events are not carrying any data, this pattern cannot be effectively used with other solutions, and we can also can't go back in time to stream the changes for specific clients. While this could be seen as limitation, we found out that in practice clients could shoot used like concept of the Delta queries. Delta queries will be typically fetching all the changes the clients haven't seen and then initialize the stream. In this way, we not rely on the stream to bring us large chunks of the data, we're just consuming what's happening at the moment. So, how this will work with connected on graphical subscriptions? The trick is to utilize these fine-tuned consumers that fetch messages in a batch event. Those will be usually more efficient because we don't need to commit topics offsets that frequently. We can also fetch multiple topics that will give us this efficient way to satisfy individual subscriptions. We can compress events for the system where data changes frequently. For example, if the comment was created, updated and deleted or in our database, we'll not include them in a stream because technically there is no change in the system. Individual operation later is dynamically matched with active subscribers. Each of those might have a separate filter that will be applied and a message that will be sent directly to the client. This architecture is fully dynamic and can scale by increasing number of consumers and servers that hold individual user subscriptions. While complex, one concept I explain in a presentation sounds simple, actual integration of the services may require lots of time. That's why there are so many open-source communities out there that can help you to succeed. All elements of the reference architecture we presented here are backed by open-source projects sponsored by Recha. StreamZ will let us to run Apache Kafka in production on Kubernetes platform. Debezium will provide a reliable database based on event producers. An Aerogear community provides examples for various graphical integrations that bind all of those projects together. As the target talk is limited only to 20 minutes, we show mostly our motivation and ideas. If you have any questions or ideas, feel free to reach out to me directly. We plan to post series of the blog post. Source code with the example application will be also available as part of the Aerogear organization. You can reach me using Twitter, GitHub and work email. Since this talk is delivered remotely, I would love to get your feedback. You can leave your comments at WTrokey.com slash feedback. Thank you so much for joining me today. See you soon.