 All right, let's get started, guys. Good morning. Hope you are enjoying the conference so far. How many sessions do we have before started, like today? Two, three, two? Oh, cool. So your brains are probably on the peak of understanding everything that's going to be told. That's good. That's good. So this session, which has this very lengthy and weird title, it literally aims to teach you about everything that you would like to know about Kafka, but in a different way that you might think. The whole purpose of this presentation is to address two main topics. The first one, what is Kafka? I think that's the most interesting one for people that are beginning with Apache Kafka and would like to understand more what to do with it and what to spectrum it. The second one is even more important. The second section of this presentation is to kind of address a couple questions that we've been seeing and stack overflow. And those are kind of a recording questions that developers from the community keep asking about Kafka. And I'm going to try to address some of those questions right here. So I think it's going to be interesting for people that are starting with Kafka and people that are actually working with Kafka for many, many years. And this is my Twitter handle right here. I work for this company here called Confluent. How many of you were in my presentation yesterday from 50% of you? So I think it's worth to explain what Confluent is. Confluent is this company that has been founded by original creators of Apache Kafka. So basically the three two gentlemen is in one lady that were kind of created a technology on LinkedIn. They left the company. They founded this Confluent Inc company and that's the company that's pragmatically speaking behind Kafka, though technically Kafka belongs to Apache and there's a huge community of developers working on it. But 80% of the commits from Apache Kafka comes from Confluent. So that's the origins of the company. That's what we do pragmatically. So as I told you before, the main purpose of this presentation is kind of to address this right here. How many of you have heard about that Kafka is a queue? Like, I mean, don't be shy. Like pretty much everybody kind of thinks that Kafka is a queue, so that's the whole problem. What we are trying to address here in this presentation is to dismissify this, right? Because this is not true, all right? Which is kind of interesting because if you think about it 99% of the people in the whole world think that Kafka is a queue. But it is not, believe it or not. So we're going to try to address here in this presentation, right? And my name is Ricardo. I work as a developer advocate for Confluent as I've explained it before. This is my blog and if you want something about Kafka or Confluent, this is my email. The main purpose of my job besides going to conferences and doing evangelism, and that's why we call ourselves developer advocates is because we literally advocate for you. So if you have any problems with the Kafka documentation, any PR that you have filed on the Kafka project that's taken too long for developers to pick up and kind of discuss or apply or even reject, talk to me and I am literally your advocate on behalf of Confluent's engineering. That's what I do for a living, right? So use me, right? In a correct sense of the word, all right? Use me as in your favor, okay? All right, let's get started. I think the first question I would like to pose here is how many of you have heard about this term called distributed streaming platforms before? Good, that is good to know. 50% of the room, but still good. I have done this presentation kind of a six times now and believe it or not, this is the average of people raising their hands, 50% of the room. Like how many of them have heard about distributed streaming platform? And the reason for this is because it's a brand new concept. So for those of you that never have heard of before, you are more than excused, that's fine, all right? I'm going to explain what this is here. But the interesting thing is that most people go to the Kafka website, they go straight to the download session to start using it. Perhaps they go to the documentation to understand a little bit more, but they often forgot about this term right here. They never read it, never. Like we kind of ask a lot of people about, do you know that Kafka is the distributed streaming platform and they ask it, really? Like nobody knows this. And before I actually went forward, the answer is right in front of you about what a distributed streaming platform is. Right here. I'm not going to even say it. What do you think a distributed streaming platform is? Only looking to this screenshot of Apache Kafka. What it is. Don't tell me that's the platform that's distributed and that's streaming. Come on, because you're literally reading. It's more intuitive than this. It is a platform that allows you to, yes, do public subscribe, but also to process records and events as well as it's a platform that offers storage and persistent capabilities. That's it. I mean, in three main pillars, that's what a distributed streaming platform is. If you're one of those people like me that create these simple direct maps in their heads like key value maps for understanding things, Apache Kafka is this, right? Or this is Apache Kafka. Create the relationship as you want it, okay? Let's dig a little deeper in this and before digging a little deeper we have to kind of come back in time. How many of you have watched Avengers Endgame, by the way? So this is a huge spoiler for those of you that haven't watched it before. But I think we are in the period of revealing spoiler at this point because the movie has been launched, why, two months ago? Three months ago, I don't know. So this is a spoiler. All right, don't be heard. Don't be mad about this showing this right here because the whole solution of the movie is they're going back in time, right? To retrieve the infinity stones and to defeat Thanos and all of that. Oops, I just told the spoiler. Come on, sorry. All right, let's go. So if we go back 30 years ago, 25 years ago or how many years ago you want it, you're gonna see that database is where there's huge, very powerful systems that solves all the problems in the world, right? That's a little bit of an augmentation, but the point here is that if you go back 30 years ago, most of the systems were built on top of databases and they'll have basically two layers, right? The database itself, where most of the business logic we're using in the form of start procedures or triggers or something like this, right? And you have the application with basically actually as a producer and consumer of the database. Read and write from the database. And the database is this entity that remembers things, where you could reliably understand that all your information will be safe there. That's fair to understand, right? 30 years ago, okay? But these days, databases are still good, right? Don't get me wrong. I'm not saying that database sucks these days. Databases still have their places, but very specific places. And by databases, I'm talking about SQL databases and no SQL database, any type of database. The point I'm trying to make here is that we're living in the era that not every problem in computing, it's solved by simply putting your data in the database. Fair enough, right? How many of you that develop microservices on top of distributed architectures or Kubernetes would agree with that, right? Because databases are not the silver bullet for every problem in computer. That's the whole point, right? Again, they still have their use cases. They're still good, okay? All right, some of the problems that database kind of brings to us. The first problem is that it's limited and it makes a lot of mess, all right? Why is it limited? Because of the problem of volume, all right? If you go back 30 years ago, you could foresee an entire ERP system that would be comprised of by roughly, I don't know, 30 or 45 tables, mix of tables or bills. And that's that. You could be an entire ERP system all on top of a database and you'd be great to go. But because it didn't happen, internet back then, all right? By the time we start thinking in scalability, right? Databases start to show their true limitation, like are any of you a DBA or working with a DBA? You know what DBA is, right? So imagine that you go for a DBA, right? A very good and very like, very professional DBA and you ask them, hey, I'm a developer and I'm starting to think about creating these applications where I would like to store five terabytes of data and this application is going to consume this five terabytes. Can you use a database for this? What's gonna be the answer of the DBA? Hell no. That's the answer that's gonna give it to you. Because databases cannot handle that volume. Database are very good because they're a transactional system, but if you start dealing with multiple amounts of data in a single database, you have to start doing things like partitioning, replication, or division by silos because the database itself cannot keep handling volumes and volumes and volumes of data forever, all right? It's gonna slow down the database. It's not gonna work. It's proving, all right? It's not me that is telling you right this. This is the industry kind of approved this, all right? So what happens is that you end up having to do some kind of a trickery in order to overcome this limitation. For example, how many of you have thought about this? Why do we have data warehouses? Have you thought about this? The very own reasons why someone came up with this idea about let's come up with this database that's only focused on analytics and not use the databases that are used for the transactional systems. Why can someone ask me this? Answer me this. Why? Because the database cannot handle. They cannot handle analytics as well as transactions at the same time. It will slow down the database, all right? That's why we came up with this idea about using ETL or a batch. ETL is extract and farm load, right? So we have this data warehouse where periodically, usually at midnight, we're gonna have this small program that's gonna extract data from the transactional, bring to the data warehouse, and then we're gonna answer all the company's questions through the data warehouse. So the data warehouse is conceptually a replication of the transactional, but with a different data modeling. A data modeling that's more focused in answering questions rather than allowing transactions and throughput, okay? For enough? Transactional, data warehouse, that's why they exist. Cool, all right? And as for those of you that kind of have created some ETL programs before, you know that running ETL programs kind of hurts the transactional databases. Hurts because you might end up with locks problems, you might end up with databases that sometimes is more focused on processing the ETL than actually handling the transactional records that is coming from the transactional systems. And that's why people choose to run these ETL or batch programs at midnight because it's outside of business hours, right? That's, are you getting the feeling that we are coming up with a lot of workarounds because one fundamental problem that databases cannot keep handling throughput? Can you get this feeling? Right? So I told you that it is a database are limited and creates a lot of mess. The mess problem is kind of a consequences of the limitation because by the time we start creating this multiple databases, one for data warehouse, one for the transaction system, one because I have a partner that is gonna read the data from the transaction system but just like the data warehouse, I cannot kind of compete the throughput with two systems. So I'm gonna create a third database that I'm going to keep replicating just for the partner to read the data, something like this. And because the company got acquired by another company, all the database kind of emerged together. So in five years of existence of one single company, you might end up with a picture just like this. A bunch of databases that are keep replicating across the entire organization, sometimes beyond the organization because as I said, the example of the partners, some other organization might be interested on the data, right? So database architecture creates a lot of mess. Again, this is not me telling you this, it's the industry, all right? That's why every time I show this picture, sometimes some people kind of start crying. Like when they say, yeah, man, this is my life. This is why I do everything all day, all right? And maintaining these ETL programs is kind of a very hard as well. For example, you just wrote an ETL program that extracts data from the transaction and load it up into a data warehouse system. You just finish your job. You took one week doing this. You probably never slept doing this. And then someone comes to you and say, yeah, you know that column that I've used to call order ID? Yeah, I would name it to order identifier because it sounds cooler. So what's gonna happen with your ETL program? It's gonna break, right? It's gonna break. The ETL program is going to break and things like this, data schema changes keep happening all the time. So it's kind of hard to keep up. It forces the company or the organization to have a pretty much a dedicated team only for this. And this is very costly. This is what we call the collateral cost of having databases because those teams, those human labors, developers, they're very costly. Unless they're paid by hugs or kisses and pretty much I think I don't know, right? They're not. So we came up with some workarounds across the years to kind of minimize all this mess and limitations that we've seen with database. By the way, how many of you kind of watched for Ragnarok? Yeah, so that's what I figured because this slide right here, it's only gonna make sense or even be funny if you have watched that movie. But I think what I'm trying to say here is that another day, another Doug is the explosion of a whole new database have came up in the last two years. Like I think there's three types of phenomena that's going on in IT industry. People creating programming languages every day. People creating JavaScript frameworks every day. And people creating databases every day. Like so that's why another day, another Doug. All right, so you don't need to actually watch for Ragnarok, but I would recommend because it's a very funny movie. So let's think about some of the databases that we kind of have found through the years. The first one I think is the most popular one is Hadoop. Hadoop, how many of you have heard about it? Big Data, very cool. I think Big Data is probably one of the best kind of LinkedIn skills that people in the last 10 years put in their profile. Like I know Big Data because I did a real award in Hadoop, so something like this. It's kind of a cool working with Big Data. So Hadoop is a very spectacular technology. I have a huge respect for Hadoop. I have worked with Hadoop before. I kind of made a living in the past using Hadoop. But Hadoop solves one problem in particular, which is the problem of volume. Remember the example I told you before about databases only cannot keep it up with like five terabytes of data? Hadoop does, because it would design it to handle large amounts of volume. But still, Hadoop continues to be just like database. It's a technology where you store data. Data is persisted in a file system called ATFS. And you still have to come up with programs or applications that's going to read the data out of Hadoop, bringing to memory and start processing. The philosophy of how you do computing with Hadoop didn't change, and that sense is just like databases. Again, it only solved the problem of scalability. You can't handle large amounts of volume with Hadoop. So for this sake of, all right, so how do we overcome this problem of having to process the data later, right? Because that's the whole problem that I like to store it. I acquire the data, I store it, and someone else have to process later. So we came up when I say we is the industry, okay? We as an IT community. We came up with different type of, as I call it, specialized databases that are strictly focused on minimizing as much as possible the amount of processing or post-processing that you have to do with raw data that you have stored in the database. I will give you one perfect example. How many of you know this database called Cassandra, Radis, Kotli, Bee, Mendeves? Those examples that I gave here, those are no-SQL databases, right? The example I like to give is graph-oriented databases. For example, when you are designing an application that will behave as a social network, the social network itself, it's being modulated as a graph, right? Like a three that have leafies and each one of these leafies has more leafies and you can have a very concept of a graph in this. This is one single record, the whole graph, right? Some of those databases were designed so you can store the entire graph in the database. So when the application is going to pull the graph out to start reading it or using it, as you call it, you don't have to process the whole graph again, right? So that's the whole point of using no-SQL database because you store very efficiently, just like Hadoop. It's very good as well to handle large amounts of volumes, all right? And when you are going to retrieve it, you don't need to reprocess the whole data because conceptually, some of those no-SQL database already processed the data during storage, right? That's the whole point. That's the beauty of no-SQL database because when you retrieve it, here's the key, give me the value, it's very fast, right? And you don't have to reprocess just to overcome the information that value comprises. Make sense? All right? But as I said, another day, another duck. We came up with this workaround of having different type of databases in the entire organization and that will, again, kind of go back to the same problem right here, right? I mean, if you have working companies that you might have, Oracles, Sybase, DB2, Postgrease, MySQL, CowsDB, Cassandra, MongoDB, lots of databases spread all over the organization that you are responsible to make it happen, the data flowing one place around, right? This is still a problem. So again, I'm gonna pose the question, can you see it that we are not effectively solving the problem? We're continuously coming up with excuses and excuses and excuses and workarounds and workarounds and workarounds and we're not effectively solving the problem. Okay? Make sense? All right. So let's talk about, oh, and before, actually I forgot, I always forgot about this piece. I have to practice a little more about this presentation. Messaging technologies, have you heard about this? Like, remember the question that I originally posed in the beginning of the presentation, Kafka is not a queue, right? So queues are an example of messaging technologies, right? Messaging technologies are really cool, right? I've worked with some of these technologies right here, JMS for sure, Sonic, ActiveMQ, a little bit of Amazon SQS sometimes, and they're very good for a single purpose which is moving data from point A to point B. Some people, some developers also use messaging technologies to come up with some sort of buffering strategies for your applications. For example, I have a backhand microservice that cannot handle 10 concurrent processes per second. So what I do, instead of my, I don't know, my API gateway delivering the message as they happen through the backhand service, I put a messaging technology in between them so the messaging layer will throttle the processing so I kind of offload a little bit of my backhand. That's one of the many use cases that you can solve with messaging, all right? So messaging technologies are really good, all right? But in the end of the day, they're just pipes. What I mean by they're just pipes, they're just a way, as I mentioned before, to move data from point A to point B. They are dumb. Why dumb? Because they cannot, they don't know what's happening. They're basically a carrier, right? They're basically allowing the movement of the data but you cannot process the data as the data is being moved, all right? Make sense? Okay, all right. So let's go back in time, one more time but this time not 30 years ago. Now we are going back in time, we are in 2019. That's a weird question to ask, like it's just like I don't know what I'm doing here. Of course we are in 2019 but let's go back in time like 11 years ago, right? To the peak of this company right here called LinkedIn. How many of you know this company? All right, cool, probably pretty popular these days. So LinkedIn, just like every other company or every organization, they have all the problems that I have discussed before. They have this explosion of database inside the company. They were having lots of problems maintaining ETL programs to replicate the data. They were consuming a lot of messaging technology to keep moving data from point A to B. All these problems, LinkedIn were suffering but LinkedIn choose to say enough. And they come up with this idea, this internal project, right? Two, okay, the title of this project is let's solve the problem of minimizing considerably the amount of databases that we have inside the company because that's cost. Minimize considerably the amount of messaging technology that we have inside the company because again, this is cost. Minimize considerably the amount of ETL programs that we have to keep maintaining all the time because again, this is cost, right? Remember, back 11 years ago, LinkedIn was startup, right? So cost is a very huge problem for them, right? And more importantly, minimize the complexity of our architecture. So instead of having the, remember the mass diagram that I've showed before, let's minimize the problem. I don't want a mass. I have my life to be funny and beautiful and I have to laugh. I don't want to cry when I go to work, right? That's what LinkedIn was kind of doing. So that's why LinkedIn is kind of at the backyard of this technology that you know today as Apache Kafka, right? Two gentlemen, Jake Raps and June Howe and one lady, Neha, I've already forgot her last name, Neha, Narkere, Narkere or Narkere. I probably mispronounced her last name. Sorry, she used to be my boss actually. That's a thing. They worked about, I think it was nine months to create the first draft of the technology. Back then it wasn't called Apache Kafka back then, but it was this draft of this architecture that allowed all of this, all right? And the project internally was so successful because they literally solved all this problem that I kind of elated to before that they decided to, okay, let's bring this to the community. Let's transform this into a Apache incubator project and guess what? One year later, this technology, this project came out from Apache incubator to Apache GA. Apache Kafka was probably one of the first of many successful projects as Apache because Apache have a very strict set of policies to transform an incubator project into a GA, right? They're very picky about this. They kind of expect that the technology is being used throughout the community around the world for very long years before they decided to promote as GA. Apache Kafka took one year. That's why it's so successful, right? Forget about scalability aspects of Apache Kafka. Apache Kafka as a code itself was very successful for the eyes of the Apache Software Foundation, right? Which is pretty cool if you think about it. And if you ask me what Apache Kafka is, I told you before, the distributed streaming platform, but what that really means is if you think about what is the advantages of messaging technologies, which is fast, low latency, and the advantages of ETL and databases architecture, which is highly scalable, highly durable, highly persistent and highly ordered. If you mix them together, that is what distributed streaming platform are, and therefore that is what Apache Kafka really is, okay? It's a platform, if it's not clear yet, it's a platform that allows you to process data as they happen at scale. Terabytes of data is not an intimidation for Apache Kafka, right? And more importantly, it can act just like a messaging technology, right? But without the limitation of messaging technology, I'm going to discuss this later on, right? And still efficiently and durably and consistently doing all the things that a typical database could do, like transactions, ACID properties, like, you know, atomicity, consistency, isolation and durability, asset, properties of a database can do all of this. We're going to discuss this before, okay? So that was Apache Kafka really is. So if there is one thing that I would like you to bring back when you leave this presentation is, what Apache Kafka is not? What Apache Kafka is not? It's not a queue. It's a distributed streaming platform even more powerful than a simple queue. It's not dumb as a queue for the God of sake, all right? That's what's the whole problem. That's what we, at Confluent, we are as a company kind of behind Apache Kafka, we're trying to educate people around the world about this concept because actually there was our mistake because back then, not sure if it's our mistake alone, but we take the tool on us because that's kind of a show's responsibility. We presented the technology as a messaging technology 10 years ago. So it's kind of a natural for people to compare Apache Kafka to other messaging technologies such as ReptMQ, ActiveMQ, IBM, MQ or something like this. So, but we are slowly remembered for people that is not just that, okay? All right? So back to the question. What is a distributed streaming platform? It's all boils down, all right? I just explained what a distributed streaming platform is, but it all boils down to making you understand that a distributed streaming platform just like a database can be seen as your single source of truth, right? You know what that means? Like, you know that phrase that we used to use for databases that we treat the database as your source of truth, like all the applications right into the database. So other application can confidently believe that the data that is coming from that database is reliable. It's consistent. It's durable. I can trust my processings and my decisions based on that data because the database is believable, all right? So you can do the same thing with Apache Kafka and all started with this guy right here. Do you know how many of you know what this guy are? What this guy is? Pat Callan, do you know, have you heard about a very small company called AWS? And maybe Azure? Have you heard about this cloud provider that's very small Azure from Microsoft? This is the guy that basically architected the whole high-scalability platform behind AWS and Azure. He's probably best known for this accomplishment. So in other words, he's a very good guy. Like he's very smart. And he wrote this paper, right? Basically the name of the paper, Immutability Chains Everything, right? He's very good at distributed systems design, right? And solving distributed system problems with very smart and simple solutions. And he basically kind of came up with the strategy of thinking in data in terms of logs, right? And this is how Apache Kafka structures the data as a log. That's how the name Apache Kafka came from. So there's this author called Franz Kafka, which basically is known because he writes small stories or journals, as he calls it. So this mini-stories is formed as a sequence one after the other. If you read the third journal, it has a link with the previous journal because the store is always continuously to be delivered. That's how he creates the log. That's how the name Kafka came from, from this author, Franz Kafka, right? And when we start thinking now about this concept of Kafka logs, right? We are going to realize that Kafka as a technology has three main pillars. It's basically the three main pillars that I've discussed before. Remember, public subscribe, storage and persistency, as well as the ability to process data, right? That's the main foundation of what Kafka does. So let's quickly discuss each one of them, right? And then I'm gonna do a very cool demo. Probably you're gonna enjoy it. That's gonna showcase the power of Kafka, right? And how you design systems around Kafka. So because everything that I've told you right now, although I think it's interesting, it's kind of a, it looks like a little abstract in your mind at this point, right? I know this because that's the feedback I kind of received from everybody that believes this presentation. So don't mind if it's too abstract, right? But let's discuss this first, and then we can do the demo. First of all, it's messaging, doing, done right, right? What I kind of meant when I say this. A homework for you. Not for now, of course. When you leave this presentation and if you are still interested in learning more about Kafka, go. Okay. When you were, after the presentation, you go to the Kafka site, kafka.apache.org, I think. You go to the documentation session, right? And do a control F, like to search and search for this word, so the sentence called, don't fear the file system. Pretty cool, right? And basically, what the author says, and basically, by the way, the author is Jake Krebs, the guy that actually, one of the guys that created Apache Kafka. He poses a very interesting question about how you should see file systems, right? Like for example, what ran, ran, memory ran means, the acronym ran, what that means? Random access memory. So the first word, random. If you are dealing with a storage that you have to randomly retrieve data, right? To start processing, memory or ran is gonna be very efficient because you are randomly searching for data on that storage, right? However, if you try to do the same thing with a disk, it's going to be terribly slow. Why? Because this little thing here called needle or sectors of the disk, the hard disk, we're not designing for doing things randomly, right? Makes sense? Searching things in a disk requires you to move the needle very, very, very fast. And still, even if you have the best storage disk and the whole word, the number of RPMs or rotations per minute is going to be infinite slower and smaller than ran, right? Without me explaining back to you, how do you think we should search for things, this and this type of architecture? Just look at the picture. I'm not gonna even explain. It's very explainable by itself. Come again? Sorted it, I like it, it's right, technically speaking, but I'm looking for another word actually. Think about the format of the disk, which is a circle and think about this movement of the needle that has to do. Like think about clockwise. What is doing this clockwise? What is this for you? Sequencing, perfect. It's searching for things or processing things and a sequencing of, right? So back to the documentation part I was explaining before. Don't fear the file system. He came up with this idea about what if instead of using those algorithms and data structures that we typically kind of a model in database, which is B3, that allows easily retrievable data when you are randomly trying to search for data. What if we use data structures, that structures the data as a log? Remember the log concept, right? One after the other, right? Where each position of the data has an offset that uniquely identifies the position of the log and when you were about to search data, remember that this table here is a log, right? That's why I keep moving in front of you. And when you were about to search data, the only thing you have to do is move the offset forward or backward, right? No struggle in that. I only need to, if I know the offset, remember the exact position of the data, I move the offset forward or back. How fast do you think this is gonna be? Even for hard disks. And he proves, right? There's an article actually, he refers to an ACM journal article. He proves that, believe it or not, this is faster than ran. Faster, imagine this. So it's all about how you structure the data, the algorithm and data structures, as well as which type of storage do you use it, right? So what is the advantages of this strategy? First of all, it scales much better. I mean, what is the, like, if we buy a machine these days, I think a typical regular Dell machine lets you gonna buy for your data center would have what, 128 gigabytes of memory? Something like this, or 256 gigabytes of memory, tops? Why, let's, what, 412 gigabytes of data, right? That's the hand memory available for you to use, right? But it's too limited. It's too small for these days, specifically. How much your, how much your laptop has as a hard disk? How much storage do you have? Terabytes? One terabyte, two terabytes? Imagine a server having five terabytes, 10 terabytes, 100 terabytes, one petabyte, right? Of disks, because you can cluster disks to come up with one petabyte of data, right? Imagine you using those disks to store data and process data on those disks as the way I told you before, sequencing and using, moving the offsets. That's the magic of Apache Kafka, guys. Why Kafka is so much scalable than the other brokers or messaging technology? Because of this, it uses the disk for this. And you know something else that Kafka does that's pretty cool? Actually, there are two more things. One of them, have you heard about this term of operating systems, modern operating system called OS page cache? Linux uses, Unix system uses, which is a region of your available memory that uses for cache data that has been stored in the file system, right? Kafka uses this heavily, right? A typical Kafka broker has only two, three gigabytes of memory for the heap memory. It's a JVM, right? So it's two terabytes of memory for JVM. It's more than enough for the broker. The actual data is not loaded into the heap memory. It's loaded into OS page cache. So that means if you compare Kafka to ActiveMQ, for example, which is purely reading in Java, it runs in a JVM. ActiveMQ is gonna load up all the data into the JVM, which is, I can load all the data I want into JVM without having garbage collection problems? No, you cannot. 16 gigabytes of memory data in a memory for a JVM is already too much for them, all right? Not for Kafka. The second very designed is actually the third because the first one using the disk. The third design decision that Kafka does, which is very cool, there is an API from the Linux and Unix file system called SendFile. Have you heard about it? It's very uncommon, but basically what the SendFile API function does is, all right, let me create an analogy. Have you heard about InfiniBand? InfiniBand is kind of a competitor of Ethernet type of networks, whereas Ethernet has a throughput of 10 or maybe 10 gigabytes per second, InfiniBand can have like 80 gigabytes per second. There are some InfiniBand architectures that actually go beyond that, like 120 gigabytes per second in a network level. So the throughput is even higher, right? So what the SendFile API does is, imagine that your application that runs on top of the operating system is trying to send data out to another application running in another machine, right? And a typical Ethernet architecture, what's gonna happen? You're gonna have the seven layers of the network topology and have buffer copies between them, right? Copy to this layer, copy to this layer, copy to this layer until the final layer, which is the network card that's going to transmit the bits to the other network card in the other branch, right? You know what the SendFile function does? Here's your application, write straight on the network card. Straight, without buffer copies. So it doesn't burn CPU cycles for transmitting data. So let's come back a little bit. The data is already on the OS page cache, so there's no I.O. at all. I can transmit that data straight from the OS page cache, straight to the network card without any I.O. That's why Kafka is so much scalable than the other mastering technologies. Can you understand why right now? Can you have a clear picture about why Kafka is so scalable? So let's do something very funny, and I'm gonna actually ask your help. Pick up your phones and scan this barcode right here. And when you do it, I'm not gonna even explain what the application that's going to pull pop in your browser is going to do. I'm pretty sure that you are in your 16 years old, 25 years old, or maybe four years old, even six years old, you are gonna know what you have to do, all right? And keep playing. You can start playing if you want it, if you are already there. Let me know when I can skip this and open another window. Okay, so if you are there already, start playing. Like I said, I don't have to explain what this means. Very old backman game. The only thing differently that you have to understand is that this backman game that you are playing right now, I assume you are playing. Sending continuously, how the name of the little ball that backman eats when he start moving it? What? Pellets. Pellets, every time he eats a pellet, it keep increasing your score, right? So that's emitting events to Apache Kafka cluster that's running on the cloud, right? And I'm consuming right here, right? As well as your names, as well as your level that you are currently are and how many lives you have right here, right? So this is what we call the raw topic. So imagine those database, I'm sorry, those applications for the transactional layer in a database architecture that keeps writing data into the transactional database, right? So this is what you're doing right now. You're writing data into the database. And then remember the data warehouse, why we have data warehouse to do analytics. So instead of copying the data to the data warehouse, we're going to do the analytics right here on top of the transactional. There's no need to replicate that anymore, people, right? Because Kafka can handle it, right? That's the whole point. So you're using the data. So what I'm gonna do right now, we model in Kafka, everything is, you know this concept of tables that we have in databases usually? We still have the concept of tables here in Kafka as well, right? But before thinking in tables, you have to think in its streams, right? Streams are tables that keeps receiving records continuously. It never stopped, right? So if we look right now how many streams we have, we shouldn't have no streams and no tables, right? All we have right now is the raw Kafka topics that are receiving data, okay? So what I'm gonna do now is actually, we call this pipeline. We're gonna design a pipeline with a number of instructions that's going to model our analytics, for example. The first step is actually bring up the concept of the topic into a stream, right? So that's what this command is doing. I'm creating a stream called user game, which is being built on top of the Kafka topic, right? And then I'm creating another stream called user losses. Every time you game over on Pac-Man, it's going to send an event through this user losses. That's a game over kind of a signal. Three minutes. No, in three minutes we have to start. Really, okay. So one minute then. All right, so what I'm gonna do is just run this real quick. I can show you outside. And if you're playing Pac-Man, keep playing because you have to generate the events, okay? So now statistics, yeah, statistics. Yep, so you see here that you have the highest score, the highest level you achieve in the number of losses per user, this is a table. Imagine a table, right? But I do aggregations to come up with a table. But all of this are being continuously updated because the streams are keep coming in. So this is being recomputed in real time as they happen. All right, it's pretty cool, right? Guys, sorry, he has to start speaking. I am not, thank you for coming. And I will be outside if you wanna see this, how this works, all right? Thank you very much. Thank you.