 Hi. Welcome to Learning the Basics of Apache NIFI for IoT. I'm Timothy Spann. I'm the principal data flow field engineer at Cloud Era. And let's get into it. As I mentioned, I'm a D zone leader, big data MVP, and I run a virtual meetup. So join me at our next event. What we're going to cover today is the basics of Apache NIFI. So first up, what is NIFI? We'll give you some of the general capabilities. I'm going to do a tour in the environment live, show you what we can do, what everything does, so you learn your way around the environment. Then I'll show you a bunch of examples doing Internet of Things flows, how we do data processing, a couple of options, how we could get data in from different edge devices, and then what we could finally do with that data, all from one easy to use open source tool. So Apache NIFI is great because it doesn't just work with one type of data. So if I have structured data, that's fine. Semi-structured like JSON or CSV works, unstructured data, which can be as instruction as binary data, a zip file, an image, a video, any sort of binary data that maybe you don't need to process, but you need to move from one place to another. We could do that. And you could do it with almost any type of data you might have in all different varieties of industries or home data. Done this with military construction data. So today we'll cover a couple of different common data sources that you get when you're getting from IoT, things like MQTT, REST and web and Internet sockets, Apache Kafka, which is one of the messaging buses. I'll show you how we interact with files and different types of logs, outputs from Python, which could be running on devices. And when we're transferring files from different servers or devices, we often will use something like FTP. And I'll show you some of the best practices when you're trying to build these flows so you can solve any of the issues you might be having right now with edge data ingest. As a reference, this is a typical layout for a simple NIFI-driven application, whether it's IoT or not. So we've got data sources coming in. NIFI does the ingest, does some transformation validation. And in most cases, we'll push it to some storage layer such as Impala. Or I'll publish this for some data syndication with Kafka so they can be used by other servers, other systems, other consumers of data. And very often that data will also feed into machine learning data science applications, where we might want to do some model execution to get some classification results, whether that's deep learning or just regular machine learning. And we've got a couple other open source tools that help us do that. So today, we're pulling in some data, sitting on some routers and devices. We've got logs coming in. NIFI is handling all of this and connecting us right into Kudu. I'm also going to show you a little bit of what Flink brings to the table. Apache Flink is another open source tool designed for doing ultra-fast event processing. And what's nice for our use case today, I'm going to take device data, get it through NIFI to clean it up, get it into Kafka so I could share it. And then Flink is going to read it, but it's going to do it in an interesting way. You don't have to write any complex applications. You don't have to deploy things in a difficult manner. I'm just going to write a simple select star to be able to grab that data, event a time as it comes into the system. And you can imagine what you could do with the power of SQL. And then what's data if no one can look at it? So I'm bringing it into Kudu table. Kudu is a data storage engine that we have Apache Impala on top of. That shows up just like any other database table. So we'll use the open source visualization component Apache Hue. So let me run SQL, see the results. I could also do some simple charts. But at a minimum, this gives you an idea. I can get this data. It's coming in. It's in a format that people expect if I need to build reports, extract data, plug this into your reporting tool. I could do that all from whatever open environment you have. Now I mentioned what knife I can do, but where did it come from? You know, what's some of the background here? So this is right from Wiki. I think they summed it up pretty well. Knife has been around for almost 14 years at this point. It was originally a US government project. They needed to be able to grab data from all different systems wherever they may be. You know, basically extracting it from systems, doing some kind of transformation, loading it somewhere else. It was originally called Niagara files, because the idea of a waterfall of data, they shortened it to Knife, which is little catchier and uses less space. And then in 2014, this technology, which had been mature at that point, became open source and donated to the Apache Foundation. And that's been out there for people to use for a number of years. You'll get these slides and you'll see a number of resources that I've included, including my blog and my Everything Apache Knife GitHub that has a lot of information that'll help you, you know, on your journey to learning Apache Knife. Now, one of the things that makes it really powerful is, you'll see when we go into the environment, it's an open source web-based tool. It's all drag or drop. I don't have to write scripts. I don't have to compile. I don't have to deploy code. I'm dragging and dropping connecting boxes, doing flow-based programming, but behind the scenes, this is not just for simple use cases. I can scale to millions of events a second, can ingest terabytes or petabytes of data, and I can make it dynamically controlled based on what I'm doing. So if I want to do, you know, more data than maybe I can handle, I could set up back pressure so that it'll queue up while I'm waiting to process. I have ability to adjust what kind of guarantees I have on the delivery of data. I could decide what the throughput levels are, how much latency I support. All those sort of things are adjustable right through the UI. Some of them can be preconfigured through configuration files or through administrative tools like Apache and Bari or Cloud Era Manager. Everything allows for full security. We have SSL everywhere. HTTPS can be supported for ingest, for communications, for exports. Same with a secure FTP, pretty much all the secure TCP IP channels. And what's beyond that and something that was part of the original focus of the product is that we have full governance and data provenance. What this means is every piece of data that comes into the system, I have a full record of it, a full audit log. You don't have to hand wire something up to do that. You don't have to be thinking of a strategy to figure out what's happening with my data. I know it every step of the way, what data I have, what size it is, when it arrived, how it changed, who changed it, what was the value before, what was it after. This is great if you've got secure data or things that you really need to keep track of. And then beyond that, you may say, well, I like NIFI, but it's missing one critical connector. One critical piece of data it doesn't work with. It doesn't have a connection pool for some certain item. There's some little thing missing. Fortunately, you can extend this. Everything again is open source. The API to add your own processors and controllers is very straightforward. We give you a Maven archetype for developers out there. You type one command line prompt. It builds you a runnable example. You just have to put in your code to do something and whatever parameters you want to do. And then you drop it, comes into the screen like one of the boxes you see here. So if you wanted to write your own type of routing, you could add that. That might let you integrate with one of your own internal systems that may be proprietary or may not have an open source component for it. Or if you want to connect to an existing security or monitoring system, those connectors are built there and they're very extensible. If I need to create a new kind of task, very easy to do. Now, one thing that's unique about NIFI, but will be very familiar for people who've used the internet and used the web. How it works is when we bring in data, we have the content. And that could be part of a file, it could be JSON, whatever it is. We call that in the flow file the content. Now, there's another piece which resembles the headers and HTTP. These we call attributes. These are key value pairs that we put metadata in or any data that makes sense from the previous action. So if I'm loading a file, I want to know the name of the file, the size, the path I got it from when it was last changed. Any of that sort of metadata that you have access to, we like to put it there. If you write your own, you could add whatever values you want to these attributes. It's also a way that you can not change your data, keep it idempotent and add enrichment around it without changing that data. So having that header and the content live together but be slightly separate there is important. Again, they're all tracked with the provenance. Now, overall, NIFI does some pretty, it's a pretty simple three step process that most people have. When you're trying to look at a problem, you say I need to acquire some kind of data usually. And NIFI has almost 100 or more out of the box, different connectors to get and acquire that data. Now, there's also a ton in the open source. I've written some because sometimes you come across something that maybe is only interesting to you or it's a smaller community or it doesn't belong in the standard part of NIFI. So you write a processor to do that. But out of the box, you've got JDBC, you've got all the Hadoop components, all the cloud sources, all the TCP IP sources, all the major file formats, whether it's XML, HL7, JSON, a ton of different things are supported. And when I bring it in, I could do those things I mentioned before, whether that's routing to determine what happens next or maybe I throw the data away, transformation, and that could be it's zip file, unzip it. It's a tar file, split it up into multiple files. If it has secure encryption, decrypt it. If it doesn't have encryption, encrypt it. Grab IP data and geo-enrich it to get the lat long. Translate it from JSON to Avro. Lots of different things you could do. Make two copies of it. Turn two records into one record. Lots of those sort of things. And there's no limit to how many you could do. I could take one source of data, send it to 50 different destinations. And that's the final part. You did some processing. Now that data is changed or I've changed those attribute metadata around it. Let me deliver that somewhere, many somewheres. So maybe we'll see in the examples I'm sending it to Kafka. I send it to HDFS. I send it to an Apollo table. I send it to a Slack channel. Could have sent it to a database, file directory, email. Pretty much anywhere you need to send your data in an enterprise system or in some kind of research facility, you're able to do that. And now under the covers, we've got 300 plus of these processors, plus hundreds more in the open source. You could just pull in yourself. One thing I wanted to add there is this is not a fragile system designed to be clustered, scale up to, you know, hundreds of nodes, millions of events a second with guaranteed delivery there. It's a nice system. And even though it does look like just a GUI, there is version control under that. So I need to make a flow to do something. I save my versions and then I can have a DevOps tool, push that to another server or make that available in the open source for people to use. And I'll give you some of those flows in a link later. You can download some of mine that we're showing today. And you can use them in your own environment. Now we touched on the provenance. This is the lineage that tells me what everything happened. And what's nice is it's all indexed. It all has a unique ID for every file or event that came in the system. It's got a timestamp. It's got everything that changed. This comes in really important if someone tells you, yeah, I gave you a file with 10,000 records. And then when you deliver it to where your final end point is, you only had 500 or none. What happened to that data? Did it really come in? What did it look like? You have all the data to know. And it's not just in some table or directory somewhere. This is live data that you could use as part of your coding. So you could check, you know, number of records at some point and say, okay, I only got a hundred records. They told me a thousand when it started. It was, you know, 200, you know, what happened? Let me send an alert. Let me change the data. Make me retry the data, all those sort of things there. Or if someone's trying to do, you know, some fixing later and they come into the system, you know, where is this particular record? I can look up the key, match that to a unique ID, then say if it came to Kafka, I can match it to an offset and a topic and a partition. And all of this provenance data, I could push to things like a patchy Atlas or to your own tracking system, which is stored in a table and people could search, okay, where did this key come in? How did it change? You know, how big was it when it came in? All that sort of data is available to you. This metadata can be very valuable if you have audits or if there's any government regulations around your data. Or you just don't want to lose data so you want to know what happened to it. Another thing that's come in handy to me is sometimes you have different types of data. I have data that comes in every hour. That's a standard load. I expect it every hour. It's the standard data. We process it when we process it. But on that same channel, I could also have an alert that says server seven is running out of disk space. It has minutes before it's going to crash. I don't want to wait for that big load of data that I process every hour before I could send that message. I don't want to have to write a separate NIFI cluster to just do alerts. Every step in here, and you'll see these cues when we go into the examples, these all have the ability to have priority. So I can prioritize those alerts to say if an alert comes by, you know, this attribute will be set, push them in front of these other messages. They happen every hour. They can wait. Now, another thing that's come into play in the last couple of versions of NIFI is a lot of people have structured data. It's very common. Structured data can also mean semi-structured data that follows a standard schema, things like CSV and JSON. They might always look the same. You know, I always have the same field names. Always have the same types to them. This is a schema data I expect from a microservice or it comes out of a database. It always looks the same and I'm going to put it in a table with the same format or a known format. So I could use these schemas or NIFI can guess what that schema is for you. And it could do things like run a SQL statement against that data while it's in process. This is nice. I don't have to learn regular expressions. I don't have to try to figure out some custom logic of how to translate data between types. You know, how do I route this one? Okay, I have something that routes CSV. Now they tell me the data is JSON or XML. Now I've got to figure a way to do that. And then at the end, I always want the data to come out as Avro. You know, where do I do that step to convert it? How do I make that conversion easy? By having a standard schema, I can easily read it and write it to no matter what type it is. And do those translations as part of a routing. So I can have SQL on the events as they come in, route it based on that SQL, transform it based on that SQL, all in a single step operating in thousands of records at a time. So it's very fast, but it does all the functionality you need. And I'll show you a bunch of examples of that. This is a great paradigm if you know what your data looks like. It doesn't matter if it's coming from sensors. Works great with IoT. I'll show you a couple different sources of that. Another featured NIFI has is some people will tell me they don't want to have a cluster somewhere. You know, I run NIFI when I need it. You know, maybe it's I want to consume a record from a file, push it to Kafka or read from Kafka, manipulate, transform the data, push it to another topic. It happens as a job or I only want it to happen when an event occurs or I want to schedule it. I don't want to set up a cluster for that. Maybe I'm running on Kubernetes, maybe in Docker containers, or maybe I just don't want to run a server. The stateless engine lets you do that now. It gives you the functionality of NIFI, but I don't have to run that UI. I don't run a server unless I'm running your flow. So I'm just running flow at a time and I just run it once, complete it, or I can have it continuously running until someone wants to kill it. That is very helpful for a lot of use cases, especially ones where you want something to happen in like a one transaction or one event. That's what the stateless engine is there. So if you've seen NIFI in the past and you didn't need a web UI, you don't want to use any disk for your actions, but you like the idea of NIFI simple flows, the stateless is the great engine for you. It comes with the standard build of NIFI. You just run this command line you see at the top, run the shell stateless mode. You set up a configuration file that looks like this, tells me what NIFI registry, we mentioned that for version control, which one of those I'm using, if they're a login, what bucket you're using, which is something like dev or production, what individual unique ID for that flow, it'll grab that flow, apply your parameters, apply your SSL and any Kerberos credentials you might have, and then just run it for you. Makes it really easy. This is something you could trigger with DevOps tools, trigger in a cloud environment, spin it up, run it, shut it down. Something that's very important if you have those kind of event at a time, you don't wanna leave a cluster running. Now those parameters also come in handy if you are running a cluster. We can connect those same parameters, set a group, share them, have them secure. It makes it very easy if I want to run that same flow, but I have different values there. It's a different server name, different login, different Kafka topic, different table name. You know parameters. This makes it very reusable. Something you could save once, pull out of the registry, put in another cluster, or run it in stateless mode. That's a nice one. I'll just show you today, retry flow file. This is a programmatic way for you to decide what happens if that system I'm interacting with doesn't respond correctly. Maybe it's offline, maybe it failed. You know, there's lots of things. Maybe my network is slow. And let you configure that programatically to decide, I'm gonna try three times. I'm gonna put a weight in there. Maybe 10 seconds between tries. If I can't get it working in three times, let me do something else. So in this case, you see beneath here, I've got data coming in. I try to push that to Kafka. If Kafka server is not up, or I can't get to it for some reason, I'll try a couple times. And then when that doesn't work, I push it to something very stable like HDFS or a object store. Now, one thing you'll notice when I go into the demo is that each little connection between these steps in a flow, there's a little name thing that says queued on it and it may have a number on there. This is where we do back pressure. These are little queues that make sure there's data that while my process is running, if it's running too slow, the data will queue up here. And this is a configurable back pressure for you to decide how many things do I want to queue up before I tell the process before me to stop running. And this will do this for you automatically. You could set a fixed size, or you could use our machine learning with a couple of different mathematical equations or create your own that decide, okay, I'm okay with this growing 14% based on average usage and I'll let it dynamically adjust the size of these queues so we don't run out of space and trigger back pressure. Those are all options you can set. There's a lot of options in each individual step. What's nice is this is each step. So if I have a thousand of them, but I'm only care about one, because that's the downstream one that goes to Oracle, that one I can configure, and I don't have to worry about the other ones, but I can set defaults if need be. Now, another thing that's important, if you've done Spark, if you've done Impala, Parquet files are extremely common. So we've added readers and writers for them so I could do something like take adjacent file off a directory, do a SQL query on it and write it instantly as Parquet. So you've got one step to take raw data, convert it into data that's friendly for Spark and Impala, fast indexed, and you don't have to do much else. And also because it's using these record paradigms, I can grab thousands of records at a time and send them out. So I don't have to worry about any kind of small files issue. I've got links to a couple of articles that help you explore that. Some other features of NIFI that are nice. We've got a bunch of different reporting tasks. So if you have existing reporting tools, we can publish into that. We have a couple of different ways to do enrichment on these records, but I think the most critical, if you're doing any kind of ELT or ETL, is these lookup services. We had a couple before. Some were based on CVS. Some were based on XML files or Mongo or Redis. But we've added some real enterprise ones here. We've got one for database. This connects to any JDBC store, commonly relational database. The Kudu one, this is a great one. Kudu's really fast, so is HBase. So this will let me look at a field in a record and replace it with something. We also have one to do this with REST. So if you've got REST-based microservices you want to use to take a value and transform it or augment it or just do a lookup. I get the name Tim in and I want to look up my client ID and put that in the record. I could do that in two steps, very fast at scale, thousands of records at a time. Makes great for loading data. Now we mentioned running all this stateless NIFI. That's a great way to do it. But we also sometimes need an actual cluster and that needs to be stateful. And we need to be able to have three, five, seven, a regular cluster amount in case a server goes down or gets slow or crashes. So we use ZooKeeper to keep that in sync. Every one of the nodes has a ZooKeeper client. And then the ZooKeeper will elect a cluster coordinator who decides how jobs are managed and who's in charge. And then you've got a primary node. We use this because in some use cases you will need to run a job on only one server. And that could be because a file exists and if five servers try to access at the same time, it's not going to make sense. Or there's a lock on it. It's a server like Oracle and maybe I can't hit it with 25 concurrent connections and have a good experience. Or I only want one copy of the data and it really can't be broken up at that time. Let one NIFI node read it and then I'll distribute it after that. NIFI has built in load balancing and distribution. We mentioned version control. It's about as easy to connect as you can imagine. We talked about data enrichment and transformation. Those lookup records make it really easy. It could be as simple as looking up via a file or we showed some of them before with Kudu or databases. Ingesting databases and putting them into multiple data stores is as simple as a handful of steps. I've included some articles there. Finally, I'll show you some example IoT flows. This one you're seeing right here is about as simple as it comes but in two steps, I read data from Kafka, put it in a permanent store. When you look at this, there's no mention of field names. There's no mention of anything other than a table name and a topic. Then we use a schema to know what that data is, transform it into the type it needs to be and store it where it needs to go, makes for a rapid development experience that you could push right to production. As part of this, NIFI does have some hurdles to get past. It's so easy to do. People will often not reuse things. They won't think of using parameters. They won't realize that what you do in NIFI can be very reusable. They'll make a lot of one-off things. They'll put a lot of hard coding in there. Then when someone else wants to use it, they'll have to write it from scratch. You're not getting your best experience with NIFI then. Use parameters, make things reusable, put them in a chunks as different process groups so other people can use them in your system. If you need a custom processor to do something that's maybe specific to your company, or if you already have Java business logic, let's wrap that in a custom processor so we could deploy it. Use the ones that are supported out there in the industry. Clutter has written and tested a lot of these in the enterprise. Those are known to be fast performance and do what you expect. If you could use record processors, do that every day. If you know what this data is, it is faster, easier, much better experience. Let's look at a demo here. It's nice to talk about it, but let's see what NIFI really can do and what is the power. Now, this is perhaps the simplest example I could show you. I'm grabbing some transaction logs and all I need to specify is the directory. I'm telling it, let's start at the beginning of the file. I'm going to split those up. One line becomes a flow file. As you see here, we already have 300,000. It builds up pretty quickly. This is that queue I was talking about. What's unique about NIFI is I could start and stop things without breaking anything. I can configure this right here. I can name it. If this is something you might want to check later, you might want to name in it. Remember that priority? I could change that right here. If I have an attribute, I could add it as the prioritizer. I could change how many objects, how big they are, how I want to handle load balancing between nodes. I could round robin it, partition it based on one of these attributes, send it always to one. I could also compress that data if it made sense for me. What's also nice is that data sitting in the queue, it's not lost. It's waiting. As you see here, we could see a number of the attributes involved in this particular event. Name of the file, that unique ID we have for each one, how long it's been sitting in the queue, how big it is, what type it is, those sort of things. I have access to the actual data, which is a line from a log. Then when I'm done, I could just push it to Kafka. I have a broker. I set my login information, what topic it's going to, by using transactions. Do I have a schema? What's the name of this producer? That sort of thing. Let's send 300,000 messages there. Pretty straightforward. This is one flow just to do that. As you see here, I stopped this. This is still running and it's running after. We don't lose data because someone stopped something where something went wrong. Like here it's mad I stopped it. Sure. What we can see here too is we get a summary of every piece of information of things going on. You see I stopped this processor. That's how many records just came in. This is when I'm reading the task. I could see a graph of everything going on with the data. Those sort of things pretty, pretty cool. What's nice is underneath the covers, everything is a REST API. So if I want to know this data, I could just put it into developer console, see all those REST calls and do it myself. So I don't want to use the GUI. You can do the same with the command line tool that it comes with. Very easy process. I've got another NIFI flow here to do a little different things. I've got a couple of different options here. What we could do with NIFI really depends on what makes sense for your case of data and what you're doing. So for this one, I'm getting data from NVIDIA Xavier box, which is a pretty powerful edge device that has GPUs, that has a lot of RAM. It does a lot for an end device. And I could see what's happening with this box. So I'm sending data from that device over HTTP. And I have all the information I need to know about it. Again, that unique ID. I've got a schema for this data. I've got the data here. It's a number of records. And I could decide here, how do I want to route this? Looking at the data, I can look at any of those attributes. I can look into the data. Here I'm just going to distribute it based on a couple of attributes. If you'll notice here, these are images. Like we have mentioned, we could deal with structured, unstructured, semi-structured data in the same flow. Here you'll look, this looks like a little different from most of the data you're used to, because this is an image. And we'll just bring in an image. I could do whatever processing I want to do on images. I could send that to a deep learning library. I have some built into NIFI for doing things like single shot detection, all those things. We have a lot of options you could do with images. We mentioned those queries. I can do queries on data. Here, I don't have a schema yet. So I'm just going to infer what it is. And then when I'm done, I still want it as JSON. So I'll just keep it as JSON. That's fine for doing what we're doing here. Not a big deal. As you can see here, I've got a lot of different flows within here. And I'm processing over a gigabyte of data as the data comes in. Another common flow, I've got things coming off of a Google Coral box. Again, doing some routing, checking if it's valid data. And here, again, I've got some images coming in and some sensor readings. So here, I'm going to do another query on those values. One of the values is a temperature. So I'm going to check to see if that temperature is over 80 degrees. So hopefully it's not. If it's really hot, I'm going to do something with that data. That's our warm data here. And we could just take a look at some of the data in here. And we could look at the actual data. And if we look in here for temperature, for Fahrenheit, we could see it's 96 degrees. That's pretty warm. That's probably the reason why it exceeded that. We might want to send an alert. We might want to do other processing. Here, I'm just going to send that to a Slack channel. And in Slack, I formatted the message. Just tells me it tells me it sent a message. It just went to Slack. And if I want to configure this to do whatever I want in the message, I'm putting in the attributes I want to display. And some boilerplate text just to say, you know, this is what happened in the system. And we could take a look what channel was sending to coral. I could go take a look at coral. And I can see that our data got sent here. There's a unique ID. There's the label from the deep learning system. Here's a start time. Here's a scoring 39%. Just to give you an idea of what's going on in the system. Let's make sure we load a couple of their copies here. So here I'm querying and get another data source. This is thermal data. Again, I have the queue here queuing up as more data keeps coming in from the edge. And it's a lot of data. But it doesn't matter to NIFI. I'm going to turn on routing. And now I'm sending a whole lot of data through the system. And that happened almost instantaneously. As you see, it's coming through the system. And I've got some queries here. This just gives me all the data. But in that one step, taking CSVs and converting them into JSON. Don't have to do anything specific there. And now I've got JSON versions of the data. Here I'm taking a look and taking JSON, writing it up as cleaner JSON. And then I'm pushing it to our Kafka topic. And I can look into Kafka and see everything that's going on in Kafka. Let's pick an example here. I can look at any of these topics, which ones are getting data in the last six hours. Who's getting a lot of messages? These were the messages from that, that log messages we were looking at. And there's a lot of them. And we could take a look at all of them. Obviously that's waiting for someone who's interested in that data to consume it. I've got other logs. I've got gas sensors, tons of different data. We were just looking at that webcam one. So I could take a look at that data. And this is showing me the results of some pictures that I'm taking on a live web camera. And then we ran TensorFlow against it and we could see what the results they think they are. It's very easy for me to send that data. But how do I write those flows on the edge? NIFI's piece of it. The other piece is a smaller version of NIFI that we call Minify. And to deploy a flow in Minify, I set up something similar to NIFI. Here I'm running a shell script. I can run a Python script. I can run native code, whatever it is. And then I'm also grabbing images. You saw those images coming into the system. So when I execute this process, it sends data to me and that I'm sending to NIFI over HTTP. But it's also grabbing those images. So that's a simple way for me to develop these applications. And then when I change them, I could publish it to any agent that's associated with this agent class, which is raspy and Java. And it'll just send this flow to them. And that's the code they'll run. And they'll run it every, we could take a look. Every 60 seconds, it's going to run that capture of sensor data. Again, a very easy way for you to send those agents out there and get results from them, see what they're doing, any events they're doing. When they got deployed, very easy. I deployed one of them a couple of days ago, one of them today. Very easy. So those flows are running. So we saw the logs. We saw things go to a Slack channel. Over here, we're doing SQL queries against those events. And again, if I want to run additional SQL queries, I could just stop and add more. I could do some query, I could do select a certain field. Maybe I only want one field. Like we saw temp F as a field. Maybe I just want that field. Maybe I wanted to do the cast on that to convert the type. I can do that. Maybe I want to do a sum. Maybe I want to do an aggregate, whatever I want to do with this data, very easy. Here I've had this one sitting for a little while from one of my other devices at 21,000 events. I've got a couple of different types of data. I've got one where I'm reading energy readings. One where I'm reading sensors and other I'm getting images. Another I'm running classification with the open Veno framework on that device. As you see here, I'm pushing through a number of records real time. Over here, I've got another query. Again, you know, the query, this time is a parameter. You can see those parameters. And we have a bunch of SQL queries. I could change them here. I'm looking at temperature. Pretty frequent thing for sensors. Got a couple of different sequels there. So if I want to share that sequel, very easy. And then I'm reading it. And just sending it to Kafka. Pretty common use case, but I'm sending it to a couple of different flows. If we look at all the different topics here, I've got a bunch of data. I've got a bunch of data. I've got a bunch of data. And now we should be getting a bunch of data into this SCADA one. And then we'll start getting data into the energy one. I've got a lot of topics here. I've got a lot of different IOT data. This one I just started sending data into. Data might look weird. You're like, why is it like that? It's Avro. So I need to use a schema to convert it into something that's viewable by me. I've got humidity, pressure, temperature. You know, I've got the IP address of that device, a unique ID for it, a time. So I've got good time series IOT data here that I could do with what I want. Now one thing that I can do with this type of data is I can do flink, flink sequel. Flink is let me connect to this catalog. Let me show you all the catalogs that have tables. I'm using the registry catalog. Remember, I showed you those schemas in the registry. That is this guy here. I've got one for breakout. Got weather, this SCADA data we just saw. And from there, I could show you there's a bunch of tables. Those match up to those schemas. So now I can certainly figure out what's in there. Like we were looking at the SCADA one. What does that one have? A bunch of fields. Do you want to take a look at them? Let's take a look. Let's do IP address, temperature, gas code, low noise from SCADA. Just a regular sequel. What's different about this sequel is it's not a regular sequel. This is a live distributed application that's just been built to do that query. And it's grabbing this and this is running on a cluster. It's been deployed on top of an Apache yarn cluster. It does everything you need to do. I've just written a real time streaming application that just uses sequel. It's that simple. And if you look at the top of this display, it's pulled in 133 pages of data from Kafka topic. And I can look at the details on that. Pretty simple. If I wanted to look at another table, I could do that. We have data coming in from breakout. You take a look at that one. As you see here, this job just got canceled. I've written a bunch of jobs. What's interesting here is this sequel supports things like insert. So if I wanted to take the results from breakout and join them together with the results from SCADA, if I had some ID that would join them, I could do that join and do an insert into a third table. That table would be defined by a topic here. It could be whatever has the same number of fields. I have one in here for what I call global sensor events, which is all the data together. And you could just define it how you want. These schemas are very easy to work with. And that would just show up here and start populating into that other Kafka topic. So we've got things in Kafka. We're able to look at data as it's ingested with Flink. The last thing that we wanted to show you is we also want to store this data. You have data coming in, a lot of different sources routing in a lot of different ways. But how do I store it? Here, I got a lot of messages coming in from MQTT, as I'm reading from an MQTT broker on a very specific topic. As it's coming in, I'm going to write that to a Kudu table with an upsert. So if I get repeated data, I don't care. Remember that retry. If it's not available, let's retry again. I'm also going to take that data as JSON and validate it. Make sure it matches that schema and then push it to a Kafka topic. So we can look, is that breakout data in Kafka? We could find that data and go right there. Common way to do that is just we'll go here and search. Here's this Kafka topic. We've got an alert on there because no one's been reading the data. It's Avro has a schema. Boom, the data's coming in. Is it getting into a table? I can see it right here. Just came in. I'm grabbing some of the fields. I've got over 100 records. I could see what I want with them. I could choose what fields they are. This is Apache Hue. You know, if I want to see different columns, I just add them to my query. Screen's a little small here, so I can't show you everything, but it shows you how easy it is to load data. And now that's in Impala, I can connect this with Tableau or anything that uses JDBC to access a table. Very easy to write reports now that you've got it there. The same thing I have with that SCADA data we were looking at. Same thing I have that stored in a table. Very easy for you to do queries, sort them, do whatever you'd want to do. Within Kafka, we have the ability to change that and join them together if we wanted. Very simple to do. And then we'll get our running nodes here. You can see all the jobs that completed. Very easy to do. From the NIFI side, any of these simple flows do more than just pull from IoT. It makes it very easy for you to acquire that data, validate it, transform it, enrich it, and then push it into enterprise systems, whether they're in the cloud, on-premise, whatever your data store is. It could be Mongo. It could be a relational data store. It could be an object store. Any of these things. So if I wanted to send that data to another place, very easy to do. This makes for a very powerful tool for doing everything you need to do with IoT data. Hopefully this was, showed you the basics of using Apache NIFI. I've included a number of articles there that dive deeper into specific pieces and parts which might be important. Things like, how do I write a query? What's supported in the SQL? Behind the scenes, this is Apache Calcite, which is a common SQL engine that's used in NIFI, Phoenix, and a number of other things. Very easy for me to do IoT at scale. I'm running in my house here only a few devices. I think I have six behind me. So it's only six of them. Maybe they're publishing a few records a second each. Maybe that's only a few thousand images. This is very easy for you to start off at that level with a few million records and then move your way up to billions, trillions. As much data as you have, we can get that in one environment. And it doesn't matter if the data is binary, if it is JSON, CSV, XML, it's coming off a raw device and you need to convert it. I could do all that here with that full provenance to know everything that's going on, full tracking so I know how many records have come in, how big are they, what type are they, are there errors? What do I want to do if there's an error? How do I handle that? Interacting with storage systems, interacting with distributed event processing, interacting with schemas that let me know what my data is supposed to look like so I could validate it to say this one better, it has to be an int that has to have this field or a field is optional because it could be null. All of that from one environment. Being able to see that data as it's coming in based on how many messages are coming through the system. What do they look like? Very easy to monitor this. I could see who's consuming my data. I have an application over here that's consuming the energy data. This is a Kafka Connect app that's just sitting on this topic consuming that data, writing it to HCFS. It's very easy to share this data between systems. I hope you've learned a little bit about NIFI. To learn more, we've got an open community where we chat about NIFI and a number of other open source projects. Also, I run an event virtually online, a future of data. You could join that one or any of the other ones we have around the world. This has been great. Hopefully you've learned something here. And thanks for coming to my talk about learning the basics of Apache NIFI for IoT.