 So, a very quick introduction into Kafka, then some concepts about Kafka streams, introduction on Google protocol buffers, and what kind of stream processing you can do with that. All right, click over you. Apache Kafka is a message queue system nowadays. It's evolved into a distributed streaming platform, so you can put messages in several queues. Their queues are called Topic in the... Sorry, yeah. All right, so Kafka is a distributed streaming platform. You have message queues several, which are called Topics. Each topic is partitioned and replicated, so it allows you to scale out by putting more servers in your cluster and putting more partitions in your topic, and each topic is... Each partition is usually replicated, so if a machine goes down, you have the data also somewhere else on a different machine. And one important part, especially when it comes to Kafka streams is all of the consumer group logic, so you can have several consumers forming a consumer group and reading from a particular message queue, and this allows you also to scale the reading process, and it works also quite similar. So in this case, we have two different consumer groups which are doing different things each, but within each group you have several threads, and basically they coordinate with each other to distribute the different partitions among each other, so if one of your consumers in that group goes down, the remaining will take over the work that the consumer was doing, and if you add more consumers, then again the consumers will coordinate with each other to distribute the workload again, which allows you also to easily scale the reading, the consuming from your Kafka cluster. All right, so what is Kafka Streams? So there have been a lot of stream processing frameworks out there, and sometimes you might wonder why is there need for another one, why is Kafka Streams different? So one very important feature is that it's a really small library, it's not a huge framework that you get with Spark. It doesn't have any further external dependencies, so when you write a stream processing up in Kafka Streams, you just put a dependency on Kafka Streams in your Gradle, and that's it, and in the end what you get is a very simple small Java application, usually less than 10 megabytes, and you can run that one everywhere, so you can put it on Yarn, on Mesos, Docker, Kubernetes, all that. You can also just copy the char on a server somewhere and start it there, so it's pretty agnostic on how you want to deploy it, while most other stream processing frameworks, they really have very explicit requirements on where you deploy them. The Kafka Streams basically uses this Kafka consumer group logic that I introduced for scaling and fault tolerance, so other stream processing frameworks had to do really invest a lot of work to get elastic scaling and all that working, and Kafka Streams basically got it for free by using the consumer logic, and that consumer logic with Kafka has been already tested in BattleHart for many years by now, so it's also something that by now has proven to be working, and due to its very small footprint, it's ideal for microservice architecture, so you can have your different services all run Kafka Streams applications that basically do the transformations that read from the Kafka cluster and then expose that information back as a Kafka cluster, as a topic in the Kafka cluster. Kafka Streams has two different APIs, both in Java at the moment, high-level DSL, which is very similar to what you would get in Spark with those filter join and map functions that you can use, and a low-level processor API, which is more explicit and at the moment also a bit more powerful, it's similar to what you get in Storm and SAMHSA. It also has the concept of tables and streams that are separate, so streams are basically infinite event logs, so you have all these messages coming in and the assumption is that it goes on forever, and tables are finite with updates, deletes, inserts on the primary key, and both can be represented as message queues, as Kafka topics, by just taking the change log of a table, so we have that here, if a table and each action that modifies its content is a change log entry in a stream, and this allows you to basically have stuff, for example, static lookup tables that you join with incoming data streams, so one classic example is that you have customer events coming in and you want to look up customer information and enrich that. Kafka streams, of course, also allows stateful processing, and it stores its state locally in a ROXDB by default, you can also plug in any other DB if you look like, oh, doesn't skip, yeah, all right, so you have usually a ROXDB as a local state, and this state, this ROXDB is a key value store, it's again some kind of a table, which as we saw before can be seen as a change log, and a change log is a stream, so the way this state store is made available is that it's written back to Kafka as a topic, as a change log topic for your state store, and this is the way that Kafka streams handles rebalancing and fault tolerance, because any other instance of your application can just read the state of your streaming application from that change log and recover it, so instead of these check pointing or lineage that you get another streaming application, you really can recover the full state of your application in case it goes down or in case you want to launch additional instances of it, so this allows you to really scale your streaming applications up very seamlessly, you just launch another application on a different computer or different server for example, it connects to the consumer group, it reads from the change log, recovers the local state and takes over, nowadays in the new version you can actually even query the local state, so depending on your use case you don't even need to worry, okay, now my streaming application writes the data back to Kafka and then from Kafka I need to write it to some other kind of database and from that database my application can finally read it, now with this way you can actually query the state of your streaming application directly, you don't need additional infrastructure in between, which again allows you to build very lightweight applications that don't need a lot of additional infrastructure, a lot of tools and components are on the confluent platform and Kafka make the implicit assumption that you're primarily using Avro, so tools like Kafka Connect which is responsible for connecting Kafka to other systems like relational databases or Hadoop and they usually work with Avro messages, same with the schema registry which at the moment only works for Avro messages, so Avro usually is the natural choice if you start using Kafka, it doesn't require a lot of effort to implement since you already get everything, it's also a great choice and solution but one downside is that you always need to have the correct schema the message was written in in order to be able to read that message reliably and if you don't have the schema available to you for some reason then all your data is just a byte garbage that you can't really do anything with it and this is where Google Protocol buffers come in, it's very similar to Avro in a lot of features, it's a binary message, it has a defined schema, it has a support for lots of languages to read and write protobuf messages but a big difference is that you can also read these protobuf messages with a different version of the schema or even with no schema at all you can get quite some information out of it so for as an example if you're one of the tutorial protobuf messages which is a person and this person message has several fields, a name, an ID, a required fields and an email as a string and then four numbers as an embedded message so you can also have basically hierarchies of protobuf messages and each field basically consists of four parts, identify if it's required, optional or repeated, the type like string or integer, field name and then a field ID so this is one example person, John Doe, you can see here with full schema we get all the information we have the name, the ID, email and phone number and as a binary message it looks like this so you can still see all the strings in that there but everything else is just bytes that you can't as a human really read but now we can also read this message back with a different schema version so let's say we have an old schema that doesn't know about the email field yet we still get all the fields that we know about the name, the ID and the phone but also additionally we get this field that we don't know anything about the email and there we just get a field ID and not the field name but we can still even just by looking at it kind of guess that this is supposed to be an email so even though we haven't even heard of this field before we can still get the information out of there and read it and if we try to read the message here without any schema at all this is what we're gonna get we still have all the fields in the same order we just lose the field names instead of only the IDs of the field so even with knowing nothing at all about the message we can still recover quite a lot of information this is basically output of a protocc decode draw you can put any protobuf message in there and it will tell you what fields are in there and this is a kind of thing that would be very hard to do with AVRO all right another very cool feature are unknown fields so these are the fields that we don't know anything about and not only can you read them as I showed you with your schema you can also pass them on so if you read a message and you've write it create a new one from this one then all the unknown fields that you can find in the original message will be copied over to the new one which allows you to basically do things like decorations or extracting common fields or envelopes without needing to know the schema information of your original message you just put that message inside another message you will take over all the fields and send it on and you don't have to worry about breaking or destroying their original content it will still be fully there so now how can you use those two components together for a stream processing architecture so generally you still have all the usual message producers like logs databases APIs external sources they all write their information in protobuf to your Kafka cluster where you have streaming processors that enrich showing filter aggregate all the usual stuff with your topics and create new topics with the output and that one can go to your various consumers for example databases reports dashboards other applications machine learning or storing it in hadoop for example and you can exchange schema versions through for example a git repository or some other central place where everyone can retrieve these definitions and the developers engineers that write the message they write the input data they provide their protobuf definitions the engineers writing the streaming processors they also provide the definitions for their output while reading the one from the engineers and the consumers they just need to read the schema definitions to be able to parse everything correctly so how would that look like in an example here we have three different systems that emit events to Kafka so for example a web service that writes log events to your to a stream so log events from your website for example in there we have a timestamp field a session ID and a type like okay use a looked at a product or a customer put a product in the shopping card and then we have the optional fields customer ID and product ID then from your customer database you get these customer messages which again have a customer ID a name and an email and your product database provides the product messages and with a product ID a name and a price these then you have a streaming application joining these three topics and emitting a new message which is a very simple message my rich event message basically has three fields one event field one customer field and one product field so I actually don't need to read to write a lot of schema definitions because I can just import the previous ones there and the output of this streaming application is again a stream in my Kafka cluster of these rich events and then in that I can for example plug in another application that does something like sessionization so it takes the session ID out of there and groups by them and my message my session message is also again very simple session message just consists of repeated rich event messages so makes it very easy to define these things of course at this step I can also do further aggregations like include fields for counting how many different products a user has looked at or how many interactions they did with our website and so on but this is basically a very simple version and that one is then something that you can for example read by machine learning algorithms for personalization or recommendation engines you can put it in reports or you can let your analysts read it and do analysts and do their ad hoc reports on it now going now going back to this chart so the skid repository or whatever you're using I mean it's basically just like the avro schema registry that comes with a confluent platform of Kafka so why is this one what's the advantage of this one now it can always happen that not all the protobuf definitions there are actually up to date for example the developer could have forgotten to commit their message or the host of your git repository accidentally deleted their production database and is now trying to recover a backup so it could always happen that this one doesn't quite work and someone doesn't have all the schema versions that they need so what happens in that case let's for example assume that in your product database someone adds a new field for the color of your product and starts writing that to the Kafka product stream now your streaming application is starting to or is processing these events and putting them in the rich event message and all that happens is now that the product contains an unknown field which is the color but everything else will just keep on working and and even further down it causes no trouble at all my session will still contain these rich events and these rich events all will contain a product with unknown fields and even at the end the applications can still read these messages and if the bi analyst or a data scientist at the end looks at the messages he can find that unknown field it can look at the value and might even be able to find to determine that it's a supposed to be color maybe he was even the one that wrote the task to the engineer in the first place to put the field in there so the whole pipeline basically still works the same way and similar if you just delete a field like here the price for a protobuf there's no difference between a field that simply doesn't exist in an empty field with a null value they both are simply not included in the protobuf message so all you're going to have now is a product that contains a null price and since your price since the very beginning was marked as an optional hopefully the engineers writing the stream application is kind of put in logic to handle null prices already so your streaming applications will continue running uninterrupted so in general what does this allow you so means that the different teams that are responsive for data for producing the data for processing the data for consuming the data they can all move at their own speed there's no real strict alignment for releases necessary so if you have a traditional OLTP system that feeds into a data warehouse and you want to do schema changes across the system with several for example my SQL servers involved it's very very painful because you need to make sure that each of the queries takes that into account but here we can really do a hands-off in data engineering the the people in the middle in the middle writing the screening processes they don't need to do anything they just rewrite that read the data and write it to the output and everyone can basically upgrade to a new to a new version schema versions at their own speed and only the actual producers and consumers need to align on including new information in your messages and the pipeline in between just forwards them without needing to know about it of course not everything is great about protobuf there are also some downsides compared to overall so there's no lie you kind of need to handcraft your schema and be very considerate about it so you cannot reuse your field IDs that's the important thing and you really need to stick to them and you need to also think about what you may mark as required and optional because once something is required you basically can never change it there's also less implementations for it around Kafka and Hadoop ecosystem and less people using it so you will find out that you will often have to write the serializer deserializer yourself therefore Avro you basically find enough implementations of that a big issue is that Google kind of wants to remove these unknown fields in protobuf free it's a very long discussion now going on in the related issue on github and one more thing is that protobuf messages usually are one or two bytes larger than Avro messages nowadays I don't think that's a really big deal but for some people that is important factor all right thank you I hope you've learned something about Kafka streams and protobuf and maybe have some inspiration yourself do you have any comments or questions yeah I mean if you have so the question was if I only have two servers in my Kafka cluster and one of them goes down how can I make sure that I don't lose any data and things still keep on running well if you set your topics to all have a replication factor of two it means each of your partitions everything should be existing on both servers so in that case basically they would be identical they both would have the full data set and if one of them goes down the other one still has everything and the producers would then all go to that one running Kafka server and write everything there and the consumers would also start only reading from that one did that answer that question I mean usually the producer sends out the Kafka cluster sends out an act signal back to the producer that it received the message and processed it and you can also define when it does that so you can say only acknowledge a message if it has been replicated on both machines so you can make sure that nothing is going missing that's actually a talk by Gwen Shapira about when messages absolutely have to be there if you want to have a very strong guarantee that your messages are there recommend looking that up it's very helpful in that topic yeah yeah