 Hi, welcome to my talk, Building Modern Data Streaming Applications, I'm Tim Spann. So a little bit about me. I like to work with a number of different streaming technologies in the open source. I call that either the flank stack, flipping stack depends on what tools are in there. Generally, NIFI, Kafka, Flink, Java, maybe Pulsar, Spark, Iceberg. There's a lot of open source projects that work well together to build really cool streaming apps. Got a blog at datamotion.dev. I run a couple of meetups, always looking to reach out to people interested in streaming. So get in touch. Every week I put a newsletter out, covers all the streaming tech, open source, lots of cool tools that I find. Different presentations, articles, podcasts, videos, all kinds of cool content. You can either check it out at the LinkedIn newsletter page, or I also keep all the data in GitHub just to keep it easy. Again, I run a meetup. I've got one in Princeton, one in Philadelphia, one in New York. I also make sure whenever possible, these are hybrid. Record it on YouTube so you can see them, even if you can't come out and have some pizza with us. Let's get down into the nitty gritty streaming. Now, a lot of people see the word streaming, they're like, if it's not Netflix, they're like, oh, it's going to be tough. A lot of handwritten, very complex distributed coding. Lots of things could go wrong easily for you to crash and burn. Well, don't have to do that anymore. I'm excited when we go down to the really low level stuff or anything that's real time in most people's minds. We could do this with some open source tools without a lot of heavy coding, not knowledge of complex systems without having to write in machine learning, low level code, Scala, really hard stuff. We don't really have to do that. Now, in the old days, you could probably get by with a batch of maybe once an hour, once a day around there. To me, a batch, in my mind, a batch, micro batch, a minute. You need your code to run every minute. It's not real time, but it's pretty quick. We start thinking real time in most people's minds around the second, maybe 10 milliseconds. There's a pretty big range there, but generally around once a second, that's pretty good. I can get a lot of data running, and maybe for most businesses, we can run there and we'll call that real time. Very often, we need to go down half that or less where some of the latency is important. I might be doing transaction processing. I might need for a workflow to determine what's going on. I might be doing something with Kafka, where it's exactly once. I might be doing things when I'm searching through tweets as they're happening really fast. This is generally the space we play with when we're talking real time in the context of tools like Kafka, Flint, NIFI, Spark, those sort of things. If we get lower than that, and you're talking super low latency, those are still hand coded in some really low level things or specialty frameworks. There, you're concerned with how close am I to some other machine, what kind of networking. You have a big dish pointing at another dish. Right now, you're on your own there. When I'm talking streaming, it's usually the destination that's exciting, but we do stuff along the way. So we start somewhere, go somewhere else. It's a little more complex than that, but in general, that could be your use case. I have a file. I want to put it in the snowflake. I have something in Oracle. I want to put that into Clare Data Warehouse. You know, or maybe I have a thousand data sources. I don't want to send them to a thousand different data sources, maybe from SQL Server to SQL Server and the snowflake and the cloud era and the redshift. And to a data, other data store and the Mongo and the S3 and the Google cloud storage. And maybe into a Kafka topic in Confluent, Kafka topic in cloud era, open source one, something on premise. There is a ton of sources like standard, you know, on-premises data sources that people have had for decades, things like relational databases, enterprise apps. That data push and pull, big data cloud services that people are using, you know, big cloud storage, things like Kinesis, Azure Event Hub. These things we can grab data from streaming style, some kind of cloud lake house, data lake warehouse, whatever you want to call it. Different business services we want to get data coming into and out of. Some kind of analytic service, whether it's a data platform, log processing system, machine learning, lots of different things we can interact with. Logs, lots of things produce logs. And often for just security, other times maybe they're doing something they shouldn't be doing, or just keeping an eye on what's going on. And things like Kafka and NIFI, get this data, transform it, deliver it. That's the streaming we're thinking of, we're not talking about TV shows, movies, though they have a lot of metadata. And those are often part of this chain, but we're thinking usually textual data, tabular data, things like JSON, kind of separated values, Avro, maybe a PDF text, those sort of things. You could do that all day, every day, distribute this data very straightforward and we'll go into that. Now, even there I had two little guys in there. When I'm doing a real-time app, because it is hundreds of things, it's a thousand tables from this one, it's some sensors, it's a truck, it's all these different things, we need teamwork. Just like in the real world, team has to come together, we got our quarterback, you got all the different positions working together, solve a common problem. Well, there's a ton of great open source tools there and we pick and choose the ones that make sense for that play. If you say I need to grab things off my laptop, do some simple enrichment, transform it and distribute it to the cloud, NIFI does all that, pushes it into Kafka, then some cloud application can grab it. And maybe I need to do analytics or join that with all the streams that came off of all the different laptops for my company or for my project or whatever this aggregate it is. And Flink is great for aggregating all these different streams off of the same Kafka topic, different Kafka topics, really powerful. So at the end, maybe I'll stream this into Iceberg on a huge lake house, so I could do continuous analytics on them later, or maybe put something else against that, write apps against it. Again, once it's in a Kafka topic, this is a great place to be as we saw in the previous slide, distribute that to whoever needs it, wherever they are, lots of different channels in and out of that, really nice. Now, in a common application, we will have a bunch of data sources. Again, we mentioned some of them logs, social media, marketing data, clickstream. NIFI is a great tool to ingest them. We'll show you that. The tool that's WYSIWYG, open source, easy to use, get data in, route it, transform it, clean it up, get it into some topics. So downstream can be used as part of real-time streaming apps, whether that's Flink or some other system, really easy to do. And a simple follow through here is NIFI to and from Kafka, Flink to and from Kafka, and whoever you want to draw that diagram, very easy channel for getting your data distributed very quickly. Don't have to worry about losing data, get do things in order. Lots of benefits of using streaming. There's also another project that's pretty powerful out there called Apache Pulsar. This is similar to Kafka and you could spend about 12 hours going into the differences, the similarities, but I'll try to do this really quick just to give you the highlights. What's cool about Pulsar is that it runs in multiple layers that can be isolated and separated and can grow and shrink separately. This becomes really important when you have giant systems and, you know, storage can be massive. Well, with this, I could use tiered storage and that's separated from all the communications. I could do standard streaming applications like we were just talking about, but also some of those old school, other types of messaging. The Kafka style messaging isn't the only way that you will talk between different systems, talk between different apps. Pub sub style or work you style messaging is still common. A ton of different apps use it using Pulsar. I can unite them under one cluster or one set of clusters. Very nice support for a lot of different protocols, whether that's Kafka, a MQP, MQTT web sockets gives you a lot of options to get to have all your messaging done in one place. Of course it's open source with this distributed layered architecture. Of course this works great in cloud native systems, whether you're doing that in the public cloud private cloud in Kubernetes or in a standard set of clustering on bare metal or VMs. Very easy to do runs in Docker running in your laptop start getting used to it and you can build up some pretty cool apps. And you can decide which protocol you're going to use or use existing apps that you have. Like I've got some Kafka apps. I'll just push him into what I think is a Kafka broker but could actually be Pulsar. Same with MQTT apps very common in IoT or AMQP which also commonly implemented in Rabbit. Use all those different protocols to get data into Pulsar into various topics that can be partitioned. We partition topics so we can share that data and have it consumed faster. If we look here there's different ways to consume data in Pulsar, which is very unique out there in the world. This is key shared. This lets me have a subscription to the data and whoever is subscribed to that key in there will get that in order exactly once very nice. This is very commonly done in similar to Kafka. What's cool is because of support for multiple protocols, you'd have Pulsar consumers or Kafka consumers in failover I'll have a dedicated consumer of my data. If something goes wrong, I'll fail over to another one so I don't stop. Don't lose any data. Don't get out of order or anything. Great way to do it. If I want more messaging style things I get set up something like an exclusive subscription. And this means only one consumer can use this. No one else can. Great for security. Great to keep things exactly once in order. If that one stopped consuming it, you've stopped consuming. You're not getting that data. There's no backup. No one else can do it until that consumer is back online. Now, multiple people who want to read this data each can have your own subscription and you'll get your own timeline and your own ordering. So you could have a thousand different apps with their own subscriptions getting their own data that will increase storage. But that is if that's what you need. That's what you're going to do. Another messaging. This one is great if you need to do a work queue or push data through the system fast as possible with shared. I've got 10,000 little consumers sitting there on all VMs, pods, docker containers or whatever they may be. And they will just grab the first message they get, consume it, acknowledge they got it and keep going. And if you need to process more data faster, just keep sending more consumers at it. And these can be in different languages. I could have them in Python, Go, Java, Scala, Kotlin, Node.js, C Sharp, Spark, Flank, Nifl. I just have as many as you want. Whoever gets it gets it. And you just process it as quick as possible. Great way to share workloads across lots of different machines. Very extensible. They're very nice. Now, in Pulsar, Kafka, and pretty much everywhere else, when events or data come in, there's a lot of different terms you could use, objects, components, this, that, message. That's what they call it in Pulsar, pretty common. You've got the value in there, that's your data payload that you really care about in Pulsar, it's raw bytes. But what's nice is you could have a schema map to that. So if you know it's coming in as JSON or Avro or Protobuf, you could format it appropriately, set up a contract on your data. That is great. That is not the whole message, though. I have a key. It's not required. I recommend it. This has helped to improve performance by partitioning data, maybe compact the topic in Pulsar, maybe be used for lookups, auditing. Maybe for uniqueness, have a key. Properties, these are cool. If I want to not change that payload but add some extra data or some metadata, I could just put in my own key value pairs there as many as I need. And you extend your thing and that'll be readable from the other side if they want it. If they don't want it, they can ignore it. By default, producer name will be created for you. It is never a good name. But we want to know who made this data, who got it into the topic in case you're trying to debug or just want to make sure you're getting the right data from the right producer. Give it your own name. Sequence ID, this is automated. So if we need things in exactly once or in order, we have it. Great bit of information there. All these fields can be accessed by consumers, but also through different systems connected to it like Flink can get access to all these inside of SQL as if they were fields in the table. Same thing with Spark, same thing with if we're mapping Trino to it. Great way to get your data. Pulsar, like everything else in open source has a tremendous ecosystem because everyone likes to play well with each other because there's more than one way to get your data. We mentioned the different protocols. I touched on some of the languages. There's even more than that, but those are the most common. Lots of ways to automatically get your data in and out. Great thing about Pulsar and Kafka. There's a lot of connectors where you set up a source, set up some config pointed somewhere, and it'll get data into the topics for you, and then you can consume it with any of these processing engines that you have out there. Really nice. And the same for the sync, something gets in a topic, sends it somewhere, don't have to write code. Again, minimizing code for things that don't add value. You can work on business logic, solving problems, making things interesting, making more money for your company, or improving your projects you're working on. Work on the good stuff, not the boilerplate things that can be done very easily with these systems. Pretty cool thing there. Now we touched on Kafka a little bit. Kafka and Pulsar are very similar. Not exactly the same. It's horizontally scalable. It's distributed among different servers. Data is partitioned, so it could be split up and distributed. Again, that way you don't lose data and for increasing performance. Replication of data between different servers. Make sure that if things fail, you'll be able to continue running and you won't lose data. Really important. Data is very fast. It's designed to be performance. We created LinkedIn more than 10 years ago and has been open source as Apache project for a while, heavily used around the world. And the name Kafka, he sounds familiar, named after the very interesting writer friends Kafka. I went to his museum and I wore my Apache Kafka shirt. It was fun. Highly recommend if you're in Prague, go see the museum. There's also a good cookie shop around the corner. What am I going to use Kafka for? You can use it for real-time tracking of what's going on on different websites. You know, that can be pushed in from JavaScript on your machine, or every time it does a submit, could go through the backend. It depends on how you run your websites. Grabbing any kind of logs, aggregating them together, maybe putting together multiple sources of data, streaming them to wherever they need to go. Great way to get statistics from lots of apps, monitor them together, get all the metrics, push them into dashboards. Very nice way to do all those sort of things. Stream processing, and we showed that with Flink. We take the raw data, connect it together, set it between different topics. You know, aggregate, change the formatting, maybe add schemas, cool stuff. Get data as fast as possible into a system. This is very important if it's a large amount of data. Kafka lets me manage this automatically among different servers. I don't have to hand code to figure out how am I going to distribute a workload along, you know, between different applications that might be doing it. Names are going to sound familiar. We already mentioned a couple of them already. So let's define them. Again, Kafka is for messaging. Topic is what we put messages into. There's a great way to logically define where data is, and also gives us a common way to know what we're talking about. Message comes into a topic, message gets consumed from a topic, and we have these topics. And if I want to put data somewhere, I put it into a topic. I have different data. Let me create a new topic. Producer publishes the messages to that topic. Consumer subscribes and gets them off. We call the servers in a cluster or broker. And you should have at least three, five more. If you have a managed service, you don't really have to worry. Someone else will do that for you, which is awesome. Very reliable, distributed system for messaging, decouples your apps. Once I get my message into a topic, my job is done as a producer. This is great. I'll go and do whatever I have to do. I don't have to be tied to whoever's consuming it. That way we could be in different languages, different versions of languages. One could be in the cloud. One could be in a different cloud or on premise. This is really great way to keep systems safe from each other, scale out, work with partners, friends, all those sort of things. Very easily to horizontally scale this out. Organize things by topics so you can have lots of different use cases. What's nice is many to many relationships here. One source can go downstream to a lot of systems and happen at the same time. You don't have to manually do anything. You don't have to worry about, is this possible? Do I have to change something? Pain? You don't have to worry about it. Flink, as we saw on the chart, is my little favorite pet thing here, an awesome logo. So what most people use Flink for is the SQL engine that it has. So I could do streaming analytics, things like continuous SQL. I'll show you some example. I could do some really advanced processing of these events as they come through. Join streams together, do aggregates on them, and do it while these events happen. So I'm not waiting for, like we saw in that real-time chart. I'm not waiting for a batch every hour. I'm not waiting every minute. Not waiting for a clock to say, oh, a second, let's check again. It is, the event comes into the topic, I get it. And then it's in my query result, or it's in my custom application. Really awesome thing. Here's a couple of example queries. I'll show you mine, but being able to do windows of data over time. Sometimes, yeah, I just care about that event as it comes in. But I might also want to know about events that happened in the last five minutes, plus that one. So say a credit card processor. I got a message five minutes ago, someone took out $1,000 with their card at this location. And now within that same five minutes, I got the exact one again. That may be a very big concern for me. And I don't want to wait an hour or some later time to figure that out. As soon as I see that there's two seemingly duplicate transactions, something's up. Maybe it's fraud, maybe it's some other problem. Let me do an alert. What's cool with the Flink SQL? I can do these kinds of selects. Something comes up, do an insert into another topic. Someone's waiting on that topic. Something comes into it. Maybe I send a message. Maybe I close a card down. Maybe that causes a phone call. Really cool. Some things you want to do right away. NIFI is the breakfast of streaming. It's first thing I do in the morning, get data into the system, clean it up, send it on its way. It's as delicious as bacon and as necessary as a whole bunch of pancakes. It is a really cool tool if you haven't seen it before. It is a streaming tool for that first and last part of your day. Get data into the stream, whether that's a Kafka topic or right into a data store. Do some basic routing encryption, maybe decryption. Put things together, take them apart. Change a schema, validate some data. Put it into a Kafka topic, let people do things right away, and I'll show you some cool things we could do with it. It is open source as well. It supports a ton of different sources, a ton of things that can be processed and a ton of places it can deliver to. What's cool with NIFI, it's all WYSIWYG, drag and drop, cool system, extensible with Java and Python, but it is, for the most part, very low code, but very scalable, very solid, supports things like back pressure, has built-in lineage and provenance, guarantees you delivery, huge ecosystem for upgrades and add-on for this, runs in Kubernetes, runs in Docker, runs on bare metal, runs on my laptop. And if you don't want to have to manage your own clusters, someone like Clutter will do it for you as its own development lifecycle like any software where it can support pushing things into version control, using DevOps or command line tools, a REST interface to move things in between different environments, blue-green deploy, all those fun things out there. Major feature to me is the ability to know what's happening while you're doing it. This is provenance, this gives me a rich data lineage of everything going on in my system, so I know who touched my data, what it looked like, how big it was, why it's awesome, all those sort of things, and if I really need this information, I could push it into Kafka, push it into a database or into a file or some third-party system or push it into NIFI, and NIFI could use this provenance lineage data as first-class data and run its own flow on it. A really cool way to know what's going on in the system, and I'll show you some of that while we're going in there. We mentioned being able to extend it, extend any of the pieces you want with Java and with Python. I've written a bunch of my own to do specialty things I wanted to do, like run some natural language processing, some point I'm going to get a chat GBT thing in there, track text, convert things, pull things out of images, grab your web camera feed, detect what language you're using, send stuff to Imager, you know, what have you, and there's a ton of people adding their own all the time, so if you looked at NIFI a year ago, two years ago, it is advanced. When I started using it, it was in version 6 and version 20 now. Always new features, cool things. One of the cool things is a stateless engine, so if you don't want to run a cluster that's always running, always processing data, you're right a flow, which is your little app, package it up, press deploy, run it as a Lambda, runs when something happens and goes away. You don't want to run a server, don't run a server, however you want to run this, pretty awesome. Data comes in if it looks like a table, handle it like records, and that includes things like XML, JSON, Avro, log file, Parquet, switch them back and forth with no code, throw in some SQL, again, post things to Slack, great way to move files or do a chat. I can connect servers to other servers, certainly I could do that with Kafka or HDP or some random protocol, we support so many protocols, but built into a product with full security, I can communicate between clusters, so a great way for you to securely send message as well as that audit trail between different servers and that could be a small one running on a Raspberry Pi or in a truck, go up to a local server, go up to a server in an on-premise status place and then put it into the cloud, put it between clouds, easy way to do it, has built-in load balancing that you can control at each step in your app, so each part of an app could be running in different servers and you control how that data moves around between them, pretty awesome, helpful for things like Kubernetes where nodes might be pulled in and out of the cluster dynamically, make sure your data keeps moving with it, very different ways to move data around, NIFI can communicate with Pulsar, which is awesome, easy way to get data into and out of your Pulsar cluster, same as we do with Kafka, really easy there. Once you're ready to be in production, having it hosted on Kubernetes is pretty important. Cloudera Dataflow is one option there and that provides you with a ready flow gallery of pre-built flows for you so you don't have to write them yourself, set in a couple of parameters, deploy, nice way to do that and you pull them out of your flow catalog, again having a repository for where you keep your applications, source code important and that can be backed up to GitHub if you need to. NIFI is just adding the ability to use customization with Python. Right now it's a little bit extra work and it's not as clean as can be. These custom processors let me put Python script somewhere and then drag it and drop it onto the screen, no fuss, no muss, very easy. And can NIFI be used for more than toy examples? Yeah, billions of events a second on scaling nodes, however many you need, 150 to process, you know, trillions of events a day, not an issue. Whatever your workload is, pretty easy and these aren't giant boxes. If you want the specs, I'll give you the link to the article, really cool stuff there. This is one of the things that's awesome. So let's look at some additional resources before we show you some demos here. A couple of things that come up, like all tech, there's some tech debt that can come up and maybe for now you push it in the corner until you need it, but there's a couple of things that will help you with streaming to avoid some of that. Make sure you have version controlling of everything. Sometimes it's streaming, it's hard to know where the assets are. In NIFI, you have to download your flow as Jason, push it into version control or do that through the NIFI registry or data catalog. If you're going to go into production, really make sense to operationalize it with Kubernetes, then you have the repositories which are critical for NIFI that are underneath the server. This is where the provenance, the flow files and data are stored in a file system. Have that away from the nodes in case nodes have to be restarted. You're not stuck until that node can point to that data since the data is outside of the cluster. Very important. Again, setting that up can be a little tricky. Sometimes that's easier to just use third-party. Always use DevOps, use the APIs, the REST API. You could start, stop things, deploy them, back them up, change parameters, all those sort of things. Use latest JDK for Java. Don't be running JDK8, be running at least 11 or 17 if you can. From the latest version of Python 3, definitely don't try to get away with Python 2 and even some of the older 3 are just going to cause problems. Size your stream appropriately and use the parts of the stack that make sense. Don't use NIFI as your messaging hub. Put them in Kafka. Unit and integration test wherever possible. Things like Kafka and Flink, depending on how you call them. There's various stuff like test containers and JUnit, those sort of things. Backup, backup, backup. Download them as JSON or wherever the file format is. Get that into GitHub, get that in a Google Drive, get that on a ZipDisk. Wherever you can put it, back it up every time, all the way. At least three copies of data, three servers. One thing can fail when you have a backup but having three gets you in a really good position, especially for things that need that. Helpful for Zookeeper as long as that's still around. There's a couple of resources. Again, you'll have these slides so I'm not going to focus on that. Scan away, scan my cat. All your data flows that belong to us. Thank you. Let's get into some code before we run out of time. Time goes pretty quick. Now, if I haven't shown you before, this is Apache NIFI. This is the local version. The cloud version only shows one flow at a time, so it's not too crowded. And then when you're done, deploy it with the open source local one. All the developments and all the running, same environments, everything's there. Lots of stuff can be going on at once. It makes it busy but kind of exciting. Good for a developer. Maybe not best for production. So this particular flow, I'm grabbing air quality. And I know the air quality REST API is huge, but they only give you back a certain number of rows at a time. So I've created a little flow cycle that keeps incrementing a counter there for this API. So every time I run it, I'm getting the next page. And there's a lot of pages that we run in a while. So I get data back to make it easier for me to work with, split out the individual records based on what the data looks like. And if we go into that data provenance, we can take a look and see. There's some built-in things, like I said, the size, unique ID, how long it's been sitting in this queue, a bunch of different attributes, any that we've changed. So we broke this up into 250 bytes. So that's how many records we got in that one batch. And we're going to break that out. Because they're really the ones I care about are the air quality measurements per city. And as you could see, there's a bunch of different sensors. Every location is different ones because some people run their own sensors. Some are governments. Usually the PM25 ones and the PM101s are very interesting. Sometimes people will have other ones. There's a lot of variants there, depending on what they have. And as you can see, we can see all that data. And if I wanted this data, any of it can be grabbed as individual attributes. Or I could grab that whole stream within NIFI and send that out as a reporting task somewhere, which is pretty cool. But don't need to go into that now. So we have the data in there. I split it out. I'm going to take out a couple of fields that I want. And I'm going to build a new record based on all these split out records. And I'm adding in a couple of extra parameters here. And now I'm going to take all those different parameters, make it into a new JSON file, going to add a timestamp. And then here, I'm going to do some SQL on it. And I take a select on all those records coming in. Doesn't matter the type, what they look like. And I'm going to look at some parameters. The value isn't null and it's PM10. I can look at the PM252. We do as many as sequels as you want. I'm going to filter it down here. For all of them, I'm just going to send them to Kafka. But for some of them, as long as there's some data there and it's not tiny, I'm just going to push it to Kafka. Again, I don't have to do anything special. Point to my broker or brokers. I can make it a parameter. Here, I'm going to send it to the topic implied from the schema name came in. So I could reuse this. And I'm just going to send it out. It didn't fail. But we could take a look at what changed. I sent out one message. Pretty straightforward. And then we could check that. Make sure that message went into Kafka. And it said that the topic name. You have to take a look at that. What was the topic name was the schema. So open a queue. So we'll have to go find that in open a queue. So we'll go here to the streams here. I'm showing some ADS B data. I'll show you that next. So open a queue is this data here. You can see we have a couple of records showed up. And we'll look for the latest one. Okay. There's the timestamp. There's that key I was talking about. There's the data. It's just Jason. Pretty straightforward. Not hard to do that. Pretty good there. Fun stuff. And as you can see, we got a ton of other data here. And we get sorted by who's reading it. How big it is, all that sort of stuff. So we looked at our air quality data. And I could start and stop things live, which is pretty cool. There's other data here I'm doing for transit data for my state here in New Jersey. I can get transit data from all over and then send it to where it needs to go. You can see here I got a number of different things I'm doing. This data comes from a local government agency. That provides updated events pretty frequently. In fact, near real time, like in the one second or so area for all these three states. And, you know, we run one. We get a bunch of data. I converted from XML, just RSS feed into Jason. As you see here, 600 came out in one instance. Here I'm parsing some of the data so I could see what the lat long is. Because then I could put it on a map, which is nice for analytics. So I've got some data coming in here. I've got it stopped. So I don't like things running when I'm trying to just look at slides. So I put a delay on here because it sends the data through so fast. It does between one in 5,000 records a second on my laptop. So I purposely slowed that down so I get a trickle of data. And the reason I want to do that is I want to show you the data as it's coming in. Let's take a look at this transcom data here. I'll restart my job here. This is just a flink sequel. Very simple. This posts a job out to flink. And as you see here, it goes to the flink dashboard. And I could see that that query is running. It gives it a funny name, but what are you going to do? So it's starting to run. So we should be starting to get data back. And as you can see, it's getting data. Data is coming through. Something's going on at Yankee Stadium. Maybe game happening or some other events. As you can see, different data coming through. And if I wanted to, I could look at this table. See if it's worth what fields I might want here. Title of the event date. The lat-long like we mentioned. Pretty straightforward. I've got some other data coming in here. Got weather data. I'm not sure which one this one was pointing to. I've got to reload sometimes when it sit around. I want something to be insecure. Here I'm grabbing all the weather data for the United States. It's all in a big Ziptex ML file. Actually a lot of XML files. 2,500 or so. I break them up. Convert them into JSON. Add a timestamp and a unique ID. Send them over to our buddy here in, into the Kafka topic for weather. And pretty straightforward. As you saw before. Okay, this one's a little more interesting offscreen. I have a ADS-B antenna that pulls from local air traffic. Now that's what I'm getting here off my little device here. These are the planes in the area where I live. And I'm getting real time data from each of these planes coming off their antennas, saying things like who they are, what's their altitude, what's their speed. Of course these things are important. If you're in the air, you want to know there's a plane coming. For me, I'm just nosy. I want to see what's going on. But yeah, so I'm consuming those messages off that device. They're going into a giant JSON string that I'm getting back. And a little too big to want to do anything with. I mean, I don't think I want to run a SQL query off of, you know, all these different messages in one big batch. So I bring them in, split them up, take just the fields that I'm interested in. Add a couple of more at a timestamp. Filter out the ones where there's no lat long. They should be sending them, but sometimes they don't. And then I send that to Kafka. I also send myself a note in Slack just so I want to see them coming in. So we've got those ADSB records coming in. In fact, a lot of them, maybe too much. We'll slow down that feed there. Then we'll go over here and we could see that table. This is the table that I'm sending data into. It's not a table. It's a Kafka topic. And it just happens to have a schema. If we look here, this is how the table was built. Automatically for me, because it has a schema and a schema registry, which if I look over here, here's my schema. It's in the first version. Those are all the fields. If they're no, what's their data type? You get the idea because I have that. That'll set it up for me here so I get all these cool fields to work with. And I've got a query I'm running here. This is a little more complex than that select star. I'm doing maxes and minns and averages on a bunch of fields that are of interest to me, such as altitude, speed, things that are kind of important for planes. And I put a road count for each of the planes, how many records we're getting for them. The ICAO is the identifier for the plane, same with items like here that's an American Airlines flights. And you could track that down. Again, I could join that with a field of data. Maybe I got from an airline that gives me details about the plane, type of plane where it came from. All those sort of things you could do here and you could do that in your live query. Again, if we look here, this one's running. It's doing a group by getting together the source, putting that calculation on there and sending me some records based on what's going on here. Obviously, this data also has Latin long. So again, I could put it on my own map. And if we want to take a look at that data, let's go to our topics. And I put it into ASP. This is that raw data. And this is my refined data that NIFI cleaned up for me. Here it looks like gibberish. It's not. It is in Avro and I could see the schema it's using. And now it translated into something useful. That's the real time. Now I can look at this. I've got a unique ID, altitude, all those sort of fields that are important to me. And then I could join that with say weather or that transit data to see what's going on in a very specific area, put that on a map. I can also send out, like I said, alerts based on what's going on with the data. I just sent some messages out here for those aircraft as they're going over and put some debug information for me for Kafka unique ID. So if I want to look that back in the table or in Kafka, I could do it. Cool thing. Not too, not too hard though. That was pretty straightforward. Now the one nice thing with this is if I want to share this data with someone else and they're not a Kafka person, I can create a materialized view off of it and set up whatever fields I want. And then it just returns a rest, Jason endpoint of that data. So I can import that into say a Jupyter notebook or something. Pretty straightforward, but just to give you an idea what we could do with the power of open source streaming. I hope you enjoyed my talk. It's been great telling you about the cool open source. I will post these slides and you'll have this video and I'll link to the source code so you can try it out yourself. Thanks a lot. Talk to you later.