 Hello, everyone. Welcome to this conference. I hope you're looking forward to a fantastic day. It's my first time at speaking at Open Source Summit, and I'm really excited about it. Today, I'm going to talk about Apache Kafka and the rise of a streaming platform. But before I begin, can I get a quick show of hands of people who have heard about Apache Kafka or used it? All right, that's most of you. It's still going to be a relevant talk. I promise you that. I've been working in this real-time data and stream processing space for close to a decade now. And during these years, one of the biggest shifts I've noticed is the rise of real-time data in production. And this is happening as part of companies becoming more digital. As part of my day job, I get to talk to companies that are natively digital, lots of them back in Silicon Valley. But I also get to talk to brick and mortar businesses that are transforming themselves into digital ones. And one of those trends I've noticed is that the modern sort of enterprise is moving to being digital. And this is the real impetus for the rise of real-time. So in my talk today, I want to put forth a simple thesis. I think we're witnessing the rise of a major new category of infrastructure software. And this new way of thinking about your data is not something we've had to do very often. If you've been in the industry long enough, you've seen this kind of thing happen with a number of major technologies, like database systems that shaped a whole category of business applications, data warehouses that shaped a whole category of business intelligence and analytics apps. I think this whole area of stream processing is going to be like that. We are watching the emergence of another category of infrastructure software, which is the streaming platform. But then what's the evidence that this is happening? Let's take a look at Kafka's journey so far. It started back in Silicon Valley several years ago. We created Kafka. We open sourced it. And very soon, a growing list of the most technically sophisticated Silicon Valley companies started to rebuild their architecture around Kafka. And after that, it turned out that this wasn't just a phenomenon that was limited to the Silicon Valley crowd, like a few things we do there. To give you an example, one of the things that we hadn't thought about when we created Kafka was this whole space of the Internet of Things. I think Jim mentioned that before. When I first heard about this, I was a bit skeptical. But then it turns out that companies are doing amazing things, connecting cars globally around the world, capturing streams of events from the car, feeding those streams back into the features of those cars, into apps that go along with the car, into analytics about the customer base. And this sort of thing is happening with a lot of industry verticals from manufacturing all the way to logistics. Having been in a web tech company myself, I have a pretty good idea of what it means when you can instrument your business at that level and optimize it. Not only is this happening with the Internet of Things space, but another industry that's undergoing this is the whole financial services and banking space. Now, this is an entire industry that's built on real-time data. But now, having a modern distributed platform like Kafka is making a lot more things possible. And being able to break down the silos that exist in these banks and get data flowing in real time, breaking the monolith into event-driven microservices, this transformation is making a lot more applications possible from instant credit card processing to real-time fraud detection, and so on. Even traditional businesses like the retail industry are adopting Kafka aggressively to compete better to become more efficient. And today, we know that about a third of Fortune 500 uses Kafka in mission-critical applications. And this includes the top banks, insurance, and travel companies. And so we all know that we are onto something here. I think all this data points to the emergence of this new phenomenon, the streaming platform. This is the thing that's powering these use cases. This is the architecture of these technically savvy companies. I think this is destined to be a major infrastructure platform that will exist in every company in the world. But then what's the role of this streaming platform? What is it supposed to do in a company? The role of a streaming platform is to sit at the center of a company, be able to interconnect all your microservices, be able to capture streams of events from applications, connect your data systems, and do all that in real time and at global scale. This streaming platform is what allows a company to have a central nervous system that allows capturing everything happening in your business in real time. But then what does it look like? There are a couple of core technical capabilities you need to have around streams of events. The first is the ability to publish and subscribe to streams of data. Now we've had messaging systems that have done this for a long time now. I think the real difference now is the ability to store these streams of events and do that in a distributed replicated manner at scale. And the final capability is the ability to process these streams of events, to act as a center for stream processing. Initially, what started off as a messaging system, Kafka today has evolved into being a full-fledged distributed streaming platform that embodies these characteristics of being one. So now switching gears a little bit. When people encounter this new idea of a streaming platform, like some of you might be thinking now, I think they come from a variety of different backgrounds. And each background lets them to viewing this technology in a slightly different way. The first lens is the enterprise messaging lens. There has been a whole category of software around real-time delivery of messages between applications. And many people would think of Apache Kafka and more broadly, a streaming platform is just an evolution of messaging. Messaging done right, if you might. I think there is some truth to this view. A streaming platform does support many of the kind of core applications that enterprise messaging systems support. And there are lots of initiatives out there to replace these enterprise messaging systems with Apache Kafka. However, thinking of Kafka as just a messaging system is overly limiting. There are at least three big differences. The first is that a streaming platform is built on a modern distributed systems foundation and can scale to the scope of an entire company, whereas enterprise messaging systems were built to support a handful of applications. Now, this might seem like a small difference, but it completely changes the nature and the scope of what this platform is supposed to do for your company. It isn't just the case that Kafka can handle more messages than enterprise messaging systems, but it's that Kafka can act as a backbone for not just a handful of applications, but for hundreds or thousands of microservices. And because it has proven ability to do that in some of the largest tech companies on Earth, it can act as a true integration plane for your company. The second difference is that Kafka is a true storage system for streams of data. There are Kafka clusters out there that store petabytes of data. There are some that store data indefinitely. This ability to store streams of data in a messaging system, it didn't arrive in Kafka by accident. We added that to solve one big problem that we envisioned a lot of companies have, which is integrating the batch analytics world with the online request response world. This is the ability that's required to integrate batch analytics with real-time messaging to have one unified way of processing all your data in real-time. The final difference between Kafka and a streaming platform in messaging is the ability to process these streams of data. Not only is this possible through the native streams API in Apache Kafka, but there are lots of stream processing systems out there that are built to work with the streaming abstractions that Kafka provides out of the box. OK, so moving on to the second lens, this is about a real-time version of Hadoop, or even a warehouse. I think there is some truth to this view as well. After all, just like a data warehouse or Hadoop, Kafka can act as a place where data comes together from the rest of your organization in one central location. In fact, now Kafka also has the kind of rich SQL layer that you've come to expect from Hadoop or data warehouses. KSQL is the open-source streaming SQL engine for Apache Kafka. It allows you to do sophisticated stream processing operations from stream table joins to aggregation, sessionization, and a lot more. I think something like KSQL is a big step forward in enabling a streaming-first world. With months of history stored in Apache Kafka and exactly once processing now possible on it, KSQL on Apache Kafka enables a lot more things that were previously not easily possible. The first is enabling real-time monitoring and analytics, allowing you to shift away from batch analytics for things that are critical to your business. The second is making streaming ETL possible natively in Apache Kafka so you don't have to duct-tape together a bunch of batch ETL scripts to get data flowing in an organization. In short, it is a true bridge between the batch analytics world and the online databases world. So that means that there are some parallels between the Hadoop and data warehouse stack and Kafka now. But the difference is that, be it SQL queries or processing jobs or applications, things that are built with Apache Kafka are naturally built to continuously update with every single event that arrives rather than in a batch fashion. And that particular difference, it changes the role of the streaming platform in an organization relative to what you might think Hadoop and a data warehouse is supposed to do. Now, data warehouses are very good at solving the traditional domains of a warehouse and why they were created to act as a center for business intelligence and analytics. I think a streaming platform is unlikely to displace data warehouses for what they are built to do. But where a data warehouse at Hadoop might fall short is when you're trying to build applications that feed directly back into the business. After all, for creating reports, a batch ETL script suffices and works just fine. But for powering a much more real time and richer customer experience, it is a non-starter. Your customers do not understand 24-hour stale data. And the simple mechanics of building an application that depends on a batch ETL cycle feeding back into that application is extremely complex. So then I think the domains and use cases where a streaming platform truly shines are the kinds of examples that I showed early in my talk. These are not examples of things that involve reporting your business or analyzing it after the fact, but it is very much about directly powering it. So that brings me to the third and final lens for viewing a streaming platform. And that is about ETL and data integration. There have been a whole generation of technologies to handle data movement. We've had enterprise integration tools and enterprise service buses that handle low, quick, slow data. And then we've had ETL tools that handle scalable data flow, but not in real time. So this gives us a hard choice, scalability and flexibility on one hand and latency on the other. I think one view of a streaming platform is kind of a unification and up-leveling of this. In Kafka, the E and L are Kafka's connect APIs. They allow you to build and use connectors to a variety of different systems. And there are dozens of connectors out there that you can use today. And the T is stream processing, be it using Kafka Streams API or any other stream processing system available out there. So there are parallels between this ETL view and a streaming platform as well. But what the traditional ETL view misses is the use of this platform as an application development platform. A streaming platform isn't just meant for getting data from place A to place B and munching it along the way. It is a true infrastructure platform that allows building sophisticated applications on top of it. These lenses in isolation, they don't communicate the full picture. They make it harder to see the full power of a streaming platform because each group has their own use cases and their own vocabulary. But I think this is the process of understanding a new category of software. In fact, I think it is the hallmark of a new category where you have something that cuts across a number of use cases in a way that wasn't possible before. So what does it look like when you can use the streaming platform and put into practice in an organization? You have a real time platform that powers applications like a messaging system, that powers data flow, like an ETL tool, and that acts as a central hub for all data processing and analytics, like Hadoop or a data warehouse cluster. So what does the future hold? I think the future here is pretty bright. We are seeing tons of innovation happening in the stream processing space. There are lots of stream processing systems out there. And there are lots of streaming data services that are released by the public cloud provider. Confluent's role is to make the streaming platform more accessible to companies, make it something you can download and use and put into production quickly, make it something that you can use in a public cloud. The Confluent platform is meant to be a full open source streaming platform. We believe as people want to create applications around a platform that is completely open. So the Confluent platform is an open source distribution of Apache Kafka with all kinds of developer tools, clients in various languages, connectors, to lots of different kinds of systems. It is meant to get you started with Apache Kafka quickly. And I'll say something. I think this is not only important on premise, but also in the public cloud. It turns out that these two major trends, which is the rise of real time as part of digitization and the move to the public cloud, they're happening at roughly the same time. So what you want is a hosted streaming platform that is open, that has open APIs that you can program to. So you can preserve the optionality of switching between cloud providers if you choose to, without having to rewrite all the applications. So that's the thesis that I would put forward, that this streaming platform category is really going to be one of the biggest and most exciting new categories of infrastructure software during our time. So as you think about your data, as you go back to work, think about your applications, think about how the streaming platform notion changes your view of using Apache Kafka in a company. Thank you very much. And we're done. Hold on. Stay up here. If you don't mind, I want to ask a couple questions of you. So when you first got involved with Kafka, were you at LinkedIn at the time? That's right. I was at LinkedIn eight years ago. And so in charge of streaming there. And so tell us how Kafka came about in your involvement. Was it one of these scratch your own itch open source moments? So I came about working on Kafka, even thinking about it, sort of by accident. I was hired to work on Search at LinkedIn. And the thing about Search is it's only useful if you have access to all the data in the company. And that was the problem that we had at LinkedIn, was there were two trends that were playing out. One is that we needed access to a lot more data sources than just the database feeds. And there were a lot of distributed systems starting to be put into place. Hadoop was one, but then there was Elastic, and there were lots of systems. And the question was, how do you solve this n-square data flow problem between applications of all sorts and systems of all sorts? And the thing was enterprise messaging systems did not scale, and the ETL tools were not real time. So we thought that there has to be a real platform that brought these two worlds together. And we ended up creating Kafka. Very cool. So you are a big leader in open source, and I know a lot of developers look up to you. What advice would you give to someone who wants to get started in open source or participate in a project like Kafka? Any advice that you give? Because I know there are a lot of folks out here who are just getting started. Yeah, I think the thing I like about open source is that it's fundamentally meritocratic by nature. So this is what I've found useful, is you can go read some docs, join the community, ask questions, and just take up some new visual and get started with it. I know that a lot of open source communities are pretty accepting, are inviting to new developers. I know Apache Kafka is, but you do need some patients to stick with it because it's all for free. Committers may not get to your patch immediately, but that shouldn't be the thing that discourages you. All right, well, good advice. Thank you so much for coming. All right.