 But I hope that you're here to listen to me talk about Kafka, because that's the room that you are in. So yeah, first things first, my name is Stella Cotton. I am an engineer at Heroku, and like I said, I'm going to talk to you today about Kafka. You might have heard that Heroku offers Kafka as a service. We have got a bunch of hosted plans from like tiny plans to giant plans. We have like an engineering team that's strictly dedicated to doing cool stuff to get Kafka running on Heroku in like super high capacity. I am not on that team. If you were here to see that talk, this is the wrong talk. I don't actually know anything about running a Kafka cluster or tuning Kafka to handle like super high load. So who am I? I am a super regular Rails engineer, like many of you. I wasn't actually familiar with Kafka at all when I joined Heroku. I was like this mysterious technology that suddenly was like everywhere. And like all these hosted Kafka solutions, not just on Heroku, but at other providers. And Kafka like systems like Kinesis, they just sprang up. And it seemed important, but I wasn't sure why. And then when I joined Heroku, I'm suddenly in this world where like not only is like Heroku an important part of our product offering, but it's actually like a really integral part of our system architecture overall. So today I'd like to talk about three areas that I hope will help other Rails engineers become more familiar with Kafka. We're going to start with like what is Kafka? We'll talk about how Kafka can power your services. And a few, just like two practical considerations and challenges that were kind of unfamiliar to me when I started using event driven systems. So what is Kafka? The docs on Kafka.apache.org, that's actually like the docs are super good, by the way. They describe it as a distributed streaming platform. But that doesn't really mean a lot to me as a web developer. But one of the classic use cases that people talk about for Kafka is data flow. So if you're running like an e-commerce website, for example, you want to know more about what your users are doing like on that platform. You want to track each page they visit, each button they click. But if you have a lot of users, this can be a lot of data. And if we wanted to be able to send that data from our web applications to a data store that our analytics team uses, how could we record and stream that high volume of data? One way is to use Kafka. And the basic data structure that powers Kafka is this idea of an append only log. And when you think about logs, people here, like what do you think about? For most web developers, that's going to be an application log. When something notable happens in our web applications, we're going to log it in chronological order. We're going to append each record to the record prior. And then once that record is persisted in this application log, it's going to be there indefinitely until we truncate earlier versions of our logs. So in a similar fashion, Kafka is also an append only log. Kafka has an idea of producers, and those are going to be applications that produce log events. You can have one producer of events, or you can have multiple producers of events. But unlike an application log, which is typically written so that you as a human can consume the log later on using your eyeballs. In the Kafka world, applications are going to be the consumers of these events. Like with producers, you can have one consumer, you can have a bunch of consumers, and a big question for web developers is often like, what's different about Kafka than something like Sidekick or Rescue? And in Sidekick and Rescue, events are typically going to be added to a queue. But once something picks it off to actually do the work, that disappears. In Kafka, it doesn't matter how many consumers are reading these events. Those events are going to continue to persist for other consumers to consume them until a specific retention period is over. So let's go back to our original example, our e-commerce app. We want this e-commerce application to create an event each time a user does something on the platform. We write each of these events or records, which Kafka is what Kafka calls events. And you want to do that to a user event log. And Kafka is going to call a log a topic, basically. And if you have multiple services, they can all write events to this user event topic. And each of these Kafka records that we write, it's going to have a key and a value, like a hash and a timestamp. And Kafka's not going to do any validation on this data. It's just going to pass along binary data no matter what kind of format it's in. So in this scenario, we're using JSON, but you can use a lot of different kinds of data formats. And the communication that happens between these clients that are writing to Kafka and reading off of Kafka is going to happen over a persisted TCP socket connection. So it's not going to have a TCP handshake for every single event. So how can our Rails app specifically interact with the Kafka cluster? We can use one of a few Ruby libraries. Ruby Kafka is a lower level library for producing and consuming events. It's going to give you a lot of flexibility, but it also has a lot more configuration. So if you want to use a simpler interface without a lot of boiler plate setup, there's Delivery Boy and Race Car, which are also maintained by Zendesk. And also Phobos. And these are going to be wrappers around Ruby Kafka to kind of abstract away a lot of that configuration. You've also got Karafka, which is a different standalone library. And similarly to Ruby Kafka, it has a wrapper called Waterdrop. And it's based on the same implementation that we saw before with Delivery Boy. So my team, we actually use a custom gem that predates Ruby Kafka. So I haven't actually used these in production, but Ruby Kafka is going to be the gem that Heroku is going to recommend that you use if you look at our Dev Center documentation. So, a brief look at how you can use Delivery Boy. We're going to use the simple version. Like I mentioned earlier, built on top of Ruby Kafka. And we can use it to publish events to our Kafka topic. First, we're going to install the gem, the usual, run a generator, and it's going to generate this config file, which you're probably pretty familiar if you use databases. And Delivery Boy is meant for getting things up and running very, very quickly. So you don't have to do any other configuration, except the brokers. Like, give them the address in the port, which might be localhost if you're running it locally. And the Delivery Boy docs will tell you how you can configure more things, but this is like the MVP. And using Delivery Boy, we can write an event outside the thread of our web execution to a user event topic. You can see we passed in user event as the topic. And each of these topics can be made of one or more partitions. And in general, you want two or more partitions. Partitions are a way to partition the data that you're sending in an event in a topic. And that allows you to scale your topic up as you have more and more events being written to that topic. And you can have multiple services that are writing to the same partitions. And it's the producer's job to say which partition you're going to send those events to. Delivery Boy is going to help you balance events across those partitions. You can let it just assign your event randomly to a partition, or you can actually give it a specific key. And why would you want to give a partition key? It's so that specific kinds of events go inside a single partition. Because Kafka only guarantees that events are delivered in order, like in our application log, inside a partition, not inside a whole topic. So for example, if you want to make sure that your user events related to a specific user all goes to the same partition, you pass any user ID there. And that way you could make sure that every event related to a user shows up in order, so it doesn't look like they're clicking all around on the website out of order because it's going to different partitions. And under the hood, Ruby Kafka is using a hashing function. So it's going to convert this into an integer and do a mod to divide by the number of partitions. So as long as your partition count stays the same, you can guarantee that it always goes to the same place. It's a little tricky if you start to increase your partitions. So now we're writing events to our user topic. That's really the only thing we really need to do. If you have a Kafka cluster running, you can also use another gem called racecar. Same thing, wrapper around Ruby Kafka maintained by Zendesk. And a same very little upfront configuration, just this generator. This config file will look familiar because it's pretty much the same. But this will also create a folder called consumers in your application too. And an event consumer subscribes to our user event topic with that subscribe to method at the top. And this is just going to print out any data that it returns. So we're going to run that consumer code inside of its own process. So it's going to be running separately from the process that runs our web application. And in order to consume events, racecar is going to create a group of consumers. It's going to be a collection of one or more Kafka consumers. And it's going to read off of that user event topic. And each consumer inside that consumer group is going to be assigned one or more partitions. And each of these are going to keep track of where they are in these partitions using an offset, which is like a bookmark, a digital bookmark. And the best part is that when one consumer goes away, if it fails, those topics are going to get reassigned to other consumers. So as long as you've got at least one consumer process running, you've got a good availability story. So we talked a little bit about what is Kafka under the hood. And let's talk more about the technical features that make it valuable. So one is that Kafka can handle extremely high throughput. One of the key performance characteristics that I thought was pretty interesting is that Kafka has the message broker, rather Kafka does not use the message broker, like the Kafka cluster, to track where all the consumers are. Like a traditional enterprise queueing system, like AMQP, is going to actually have the event infrastructure itself keep track of all those consumers. And so as your number of consumers scales up, your event infrastructure is actually going to have more load on it. Because it's going to have to track larger and larger state. And another thing is that consumer agreement is actually not trivial. It's not easy for the broker itself to know where the consumers are. Because like do you mark that message has been processed as soon as it gets sent over the network? Well, if you do and the consumer is down and it can't process it, how does the consumer say, hey, event infrastructure, actually can you resend that? You could try a multi-acknowledgement process like with TCP, but it's going to add performance overhead. So how does Kafka get around this? It just pushes all of that work out to the consumer itself. The consumer service is in charge of telling like where am I, what bookmark am I at in that ordered commit log. And so this is kind of cool because it means that reading and writing event data is constant time, O of 1. So the consumer either knows that it has an offset. It says, give me this specific place in the log. Or I want to start at the very beginning and read all the events. Or I want to start at the very end. And so there's no scanning over large sets of data to figure out where it needs to be. And so the more data that exists, it doesn't matter. It doesn't change the amount of time to look up. And so Kafka is going to perform similarly, whether you have a very small amount of data in your topics or a large amount of data. And it also runs as a cluster on one or more servers. It could be scaled out horizontally by adding more machines, extremely reliable. And data is written to disk and then replicated across multiple brokers. So it has this scalable and fault tolerant profile. So if you'd like to know more about actual data distribution and replication side of Kafka, this is a pretty good blog post. And by the way, the slides will be on the website. And I'll tweet them out at the end. So don't bother trying to write down these links, because that would be a little complicated. And so to give you a more concrete number around what scalable looks like, Netflix, LinkedIn, and Microsoft are literally sending over a trillion messages per day through their Kafka clusters. So Kafka is cool. It's awesome. It can get data from one place to another. But we're not in a data science track. We're a track about services. So why do you care about this technology in the context of services? So some of the properties that make Kafka valuable for event pipeline systems are also going to make it a pretty interesting fault tolerant replacement for RPC between services. If you aren't familiar with the term RPC, there's a lot of different arguments about what it means. But it means remote procedure call. And TLDR, it's a shorthand for one service talking to another service using a request like an API call. So what does this mean in practice? Let's go back to our example of the e-commerce site. User is going to place an order. It's going to hit this Create API in our order service. And when that happens, you want to create an order record, charge the credit card, send out a confirmation email. And so in a monolithic system, this is going to usually be like this method call. It's going to have blocking execution. You might be using Sidekick or something to handle sending the email. But as your system grows more and more complex, you might start extracting these out to their own service. And you can use RPC remote procedure calls to talk between these services. So what's some of the challenges that you might find with using RPC? And so one is that the upstream service is responsible for the downstream services availability. If the email system is having a really bad day, the upstream service is responsible for knowing whether or not that email service is available, and if not retrying any failing requests or failing altogether. So how might an event stream help in this situation? So in this event-oriented world, an upstream service, our order as API, is going to write an event to Kafka, saying that an order was created. Because Kafka has this at least once guarantee, it means that that event is going to be written to Kafka at least once, and will be available for a downstream consumer to consume. So if our email service is down, that event is still there. That request is still there. And when the downstream consumer comes back online, it can pick back up. It can use its bookmark and say, I need to pick back up from XYZ and continue to process those events in order. So another challenge that larger and larger organizations find is coordination. In increasingly complex systems, integrating a new downstream service means a change to an upstream service. So if you'd like to integrate a new fulfillment provider, for example, it's going to kick off a fulfillment process when an order is created. In an RPC world, you need to change that upstream service to make an API call out to your new fulfillment service. In an event-oriented world, you would add a new consumer inside the fulfillment service that's going to consume the order created event topic. So there are some upsides. What's the big downside? In our first example, the dependency between those services is super clear. Like you look at that method and you see, oh, this, this, and this. The upstream service knows exactly which downstream services depend on knowing that the order was created. By abstracting away that connection, you gain speed, you gain independence, but you do sacrifice clarity. So you've got a Kafka. You're comfortable with the trade-offs. You understand what you're getting into. You'd like to start incorporating events into your service or in an architecture. So what might this look like from an architectural perspective? Martin Fowler discusses how, when people talk about event-driven applications, they can actually be talking about incredibly different kinds of applications. And I found this personally to be true even inside of Heroku. And it could be a big pain point when you're having a discussion about challenges, about trade-offs, in these kind of systems. So he's trying to bring a shared understanding to what an event-driven system is. And he started outlining a few architectural patterns that he sees most frequently. So I'm going to zoom through this to cover these. But you can learn more on his website or even better. He gave a keynote at GoToChicago that really covers this in depth. So the first pattern he talks about is an event notification pattern. This is the absolute bare minimum simplest event-driven architecture. One service simply notifies the downstream services that an event happened. The event has very little information. It's just saying an order was created at this time. If the downstream service needs more information about what happens, it's going to need to make a network call back up to the order service to fill that information out. The second pattern he talks about is event-carried state transfer. And in this pattern, the upstream is actually going to augment that event with additional information so that your downstream consumer can keep a local copy of that data and not have to do that network call. And it can be actually super straightforward when everything that the downstream services needs is encapsulated inside that event generated by the upstream service, which is awesome. Except one of the challenges here is that you might need data from multiple systems. So for example, our order service, it's going to create an order, write the event. Fulfillment service is going to consume that event. And the fulfillment service is going to need certain details about the order, which is totally fine because we have that order information. And it's going to be passed along inside the event. But if it also needs to talk to a customer service, for example, to know who to send a package to, it's going to either need to make a network call or it's like to actually retrieve that information like we saw earlier, or it needs to find a way to persist a local copy of the data that it needs. And the idea is that you can also consider building that local copy off of events that the customer service is going to write. And so the fulfillment service is going to consume a separate set of events from your customer service that it can then persist locally and join it inside of its own database. A third pattern that Fowler talks about is event sourced architecture. This takes the idea of an event-driven system like even further. And it's saying that not only is each piece of communication between your services kicked off by an event, but it says that by storing every single event or storing a representation of every single event and replaying all of these events, you could drop all your databases, completely rebuild the state of your application as it exists in this moment just by replaying that event stream. So Fowler talks about an audit log being a good use for this scenario. He also talks about this really interesting high-performance trading system, which is not really something that's relevant to my interest, but might be relevant to yours. But having an audit log of everything that happens and actually relying on it to rebuild your application state is two very different levels of technical commitment, like seriously different. There are going to be additional challenges that come into play when you're looking to be able to recapture this state of the world. So one is code changes. So I worked on a payment system in the past and a big challenge that we had was if you tried to recalculate prior financial statements, they depended on business logic that lived inside the code, hard-coded, in the past, and that had changed. And so if you tried to recalculate all that money that you needed to send out to users based on the state of the world that day, you might actually get different values, and that's a problem. And that was not an event-driven system, so this is an issue even in regular systems, but it's something that you really need to understand when you're relying on the ability to rebuild your state of the world from events. And in a similar fashion and other challenge is that that state of the world might not be the same for third-party integrations. That might have changed, and an API call out to a third-party provider might return a different set of data or even an internal provider. Fowler talks about some more strategies for handling these issues. It's not an insurmountable task, but it is a big deal. And the final pattern that Fowler talks about is command query responsibility segregation, which if you've ever talked to anybody about event-driven architectures, you might, I would say 50% of the time, they think you're talking about this, and you may or may not be talking about this. So you might be familiar with CRUD, and that's a way of encapsulating logic for creating, reading, updating, deleting a record in a database inside your service. And it's at the heart of Rails controllers, like, this is very Rails-y. But the idea of CQRS is that instead of thinking about all those things as existing in the same domain, that you can actually split them out. So the service that writes to your system and the service that reads from your system are actually split. And at its simplest, one service is responsible for writing orders, so any method or API call that feels like an order.create, an order.update taxes, whatever, that's gonna go into your order updater service. It's gonna go one direction. And reading methods, or API calls, getOrderById, those are gonna live in another service. And it's not just an event-oriented architecture thing, you can do this with API calls. But you'll often see event-oriented architectures. You'll see event systems nestled in the diagrams, usually at the place where commands are actually written. The writer service, this command handler, is gonna read off of the event stream, process these commands, store them to a write database, and then any queries are gonna happen to a read-only database. When you wanna use it, crud is much simpler. Separating out read and write logic into two different services really does add a lot of overhead. The biggest reason I think people use it is performance. So if you're in a system where you have a large difference between reads and writes, or a system where your writes are super, super slow compared to your reads, you can optimize performance separately for those systems. But it's an increase in complexity, so you really need to understand the trade-offs. I found this blog post to be pretty good, succinct, like the title says, when to use and not use CQRS. And I would also, if you're thinking about implementing any of these patterns in your system, take a look at Fowler's blog post or and watch his keynote to get a really deep dive into the trade-offs and considerations for these. So we talked about what Kafka is, talk about how this technology can change the way your services communicate with each other. So next let's talk about two practical considerations that you might run into when integrating events into your Rails applications. It's definitely not a comprehensive list, but it was like two things that really surprised me when I got started. So the first thing to consider are slow consumers. Like the most important thing to keep in mind in an event-driven system is that your service needs to be able to process events as quickly as the upstream service produces them. Otherwise you are gonna drift slowly and more slowly and slowly behind, and you may not be realizing it because you're not like seeing timeouts, you're not seeing API calls fail. You're just slower. You're just at a different state of the world than everybody upstream. And also you might start to see timeouts, like it's the one place where you will see timeouts are gonna be on your socket connection with the Kafka brokers. If you're not like processing events fast enough and like telling and completing that round trip, your socket connection can timeout and then re-establishing that actually has a time cost. Like it's expensive to create those sockets so that can add latency to an already sourced system making things worse. So if you're consumer slow, how do you speed it up? So there's gonna be a lot of talks at RailsConf today and RailsConf prior about performance in Rails. But a more Kafka specific example is that you can increase the number of consumers in your consumer group so that you can process more events in parallel. So what does that look like in practice? So you remember in race car you run a consumer process by passing in a class name like user event consumer. If you're using a proc file for something like Foreman or on Heroku, you can actually start race car with the same class multiple times and it's gonna automatically have each of those processes have it be an individual consumer that's gonna join the same consumer group. And so you want at least two of these consumer processes running so if one goes down you're gonna have the other failed partitions be reassigned. But it also means that you can parallelize work across as many consumers as you have topic partitions. So of course with any scaling issue you can't just add consumers forever. Eventually you're gonna hit scaling limits on shared resources like databases. So you just can't scale forever but it will give you a little bit of wiggle room. And if you don't actually care about the strict ordering inside those topic partitions you can use a queuing system like Sidekick to help manage work in parallel and take a little bit of pressure off of the system. And it's extremely valuable in this case to have metrics and alerting like paging around how far behind you are from when an event was added to the queue. So Ruby Kafka is instrumented with active support notifications but it also has a stats and data dog reporters that are automatically included. And you wanna know if you're drifting a certain amount behind when the events are added. And the Ruby Kafka documentation which is also super good has a lot of suggestions for what you should be monitoring in your rail systems. And going one step further on my team before we put a new service into production because we use Kafka for a lot of our services I will run game day failure scenarios on staging but consuming off the production stream where we'll move our offset back by a certain amount to pretend like our service has been down for a day or two and see how long it takes us to catch up because if you have a serious outage on your downstream service you're gonna wanna know how much will room do you have and talk about what failure modes are acceptable for your service because if you're running an order service for example like you might wanna optimize for finishing those orders no matter what. It might take you four hours it might take you a day but you want it to be done but if you're running a chat bot service users are gonna be extremely confused if four hours later they start seeing things appear inside of their chat. So you'll wanna talk about what that means to process data at a delay. And they're also, so a big thing in Kafka to talk about is this exactly once versus at least once you can actually design a Kafka system to have an exactly once guarantee with newer versions of Kafka. So it means that consumers could assume that a message in a Kafka queue has only been sent once there's never a chance that messages will be not sent at all or that it'll be sent more than once and Confluent has a blog that talks a little bit more about this but the reality is with Ruby Kafka you should assume that your messages will be delivered at least once so it's gonna be there but you might see it more than once. So that's really important and one thing that this means is that you need to design consumers to expect duplicated events. So you can either rely on your database and use like Upsert for item potency or. Yes, thank you so much. Got the dance in too. All right, so design your consumers for failure. Rely on Upsert to lean on your database or you can include a unique identifier in each event and you can just skip events that you've already seen before but either of those things are gonna make your application more resilient in the face of failure because you can move back an offset and replay events and not be worried that you're gonna duplicate the same work twice. And the second thing that was really surprising to me was Kafka's very permissive attitude towards data. You can send anything in bytes and it's gonna send it back out. It's not gonna do any verification. This feature makes it extremely flexible because you don't need to adopt a specific format like a serialization format to get started. Like we talked about before, freedom isn't free, freedom is a blessing and a curse. What happens when a service upstream decides to change an event that it produces? If you just change that event payload, there's a really good chance that one of your downstream consumers is gonna break and your colleagues are gonna be really upset and they're gonna start seeing exceptions, could take down their service but you have no idea because you didn't really know that anybody was using that data. So before you start using events in your architecture, choose a data format, preferably one, not multiple ones, and evaluate how that data format can help you register schemas and evolve them over time. It's an issue in RPC systems as well but the explicitness of those calls means that somebody changing an upstream service has like a built-in method for being like, oh wait, somebody might actually be using this information especially when that's like an outgoing event. Schema registries can provide a centralized view of schemas that are in use in your system. This is also, it's just like much easier to think about validation, about evolution of schemas before you actually do all of this event-driven architecture than afterwards. And there are a lot of different data formats but I'm gonna talk about two JSON which most people in this room are probably familiar with. It's the format of the web and Avro which might be new to some of you and I know there are probably at least two Australians in the audience. There's like a 75% chance I'm gonna slip up on the next two slides and call it arvo but I do not mean afternoon, I mean Avro. So just ignore me. So JSON's a pretty common data format. Pros, human readable, super nice as a human, I really appreciate that. And there's also JSON support in basically every language. But there are a couple of cons. The payloads, like the actual size of JSON payloads can be really large. It requires you to send keys along with the values which is nice because it's flexible so it doesn't matter what the order is in. But it's also the same for every payload. So when you send them over it can mean that those payloads are larger than other formats. So if size is a big concern for you, JSON could be a bad fit. And there's also no built in documentation inside that payload. So you just see a value but you don't actually know what it means. And schema evolution is challenging. There's no built in support for A-listing one key to another if you want to rename a field or setting a default key value so that you can evolve over time. So Confluent, they're the team that built Apache Kafka and they now have a hosted Kafka product. They recommend using Avro like many other protocol formats. It's not human readable over the wire. You have to serialize and deserialize it because it's set in binary. But the upside is that there is super robust schema support. A full Avro object is sent over and it includes like this idea of the schema and the data. So it has support for types. It's got primitive ones like ints, complex ones like dates. And it includes embedded documentation so you can actually say what each of those fields actually do or mean in your system. And it has some built in tools that help you evolve your schemas in backwards compatible ways over time. So that's pretty cool. Avro Builder is a gem created by Salsify and that actually creates a very Ruby-ish DSL that helps you create these schemas. So if you're curious to learn more about Avro, the Salsify engineering blog here has a really great write up on Avro and Avro Builder. So that is it for my talk today. I have a couple more slides though so don't start leaving because I always feel really weird with people like, and thank you, and then they have a couple more slides. So if you're curious to learn more actually about Heroku and how we run our hosted Kafka products or how we use it internally, we do have two talks that my coworkers, Jeff and Paolo, have given that you can watch. And if you're interested in Postgres related to Kafka but different, my coworker Gabe is giving a talk like literally in the next talk spot in room 306 about Postgres 10 performance and you. And then finally, I will be at the Heroku booth today. After this and like during lunch, if you wanna come by and say hi, you can ask me questions, you can get swag, we've got t-shirts and socks, and we also have some jobs, you know, the use. So yeah, thank you so much.