 My name is Andy LaPresto, I work for Hortonworks. I am a Committer and Project Management Committee member of Apache NIFI and the sub-project Apache NIFI Minify. So a little bit about me, I have worked on software engineering mostly security topics for the last 13 years. About a year and a half ago I joined Hortonworks and started working on the NIFI project and this slides actually from a conference from last week. But this also serves as an introduction to NIFI if anybody is not familiar with the platform. I'm going to very briefly go through what Dataflow is, why there are challenges and what are they and why we want to talk about it today. I'll give you a very, very brief overview of NIFI as well and then we'll focus on IoT challenges, Minify and explore some of the features there. So what is Dataflow? This is a high level summary of Dataflow. Dataflow is we produce some data at some location, some point in time. We want to perform analysis or perform other activities with it later. How do we get it from A to B? I've attended a lot of talks this week and I continually see either a variation of this slide or in the vendor demos. We have all the cool IoT stuff on one side and we have the cloud on the other side and then we just have an arrow between it and that's somebody's responsible for that. Maybe it ends up being an HTTP, like REST API call or some kind of data push and it's a blind push, maybe you're using Kafka. What I am here to talk about is the arrow between them and what we can do there and what the challenges are and hopefully how we can make your life easier and get all the cool data out of these devices that you have spent time working on. Moving data effectively is very hard and this is, XKCD has an example for every problem in the world but this is one of the key ones is that there have been so many evolutions of this problem, so many different people and organizations have tried to solve it that we have all these competing standards. The next person is going to come along and say, I have a better idea and now we just have another competing standard. One of the things the NIFI tries to do is be very conservative in what we do but liberal in what we accept and integrate with whatever is out there as best we can to make it so that you don't have to spend your time writing DSL transformations between two things that you don't really care about. You just want the data from one place to another. To get a little bit more into that, obviously we talked about standards but that can be a lot of different things. We can talk about formats and protocols, we can talk about exactly once delivery, the veracity and validity of information. How do you know if you're getting good information? How do you know if you're getting the correct information? Ensuring security on that information, overcoming security and as a security person, I take huge offense to that completely true statement that one of the biggest challenges is overcoming the security requirements that are put in place by somebody else, right? And it's usually somebody who doesn't actually know what your product does. They're just checking off checklists, they yell at you and you have to go change something because they don't know what they're talking about, right? That's me, I'm that person. But I want to try and make that better for everybody. We've got compliance, we've got schemas. Sometimes the consumers change, sometimes the producers change. You have credential management. If you're in an organization, a lot of times it's just some other team or some other person who's causing all the problems. They don't know what they're doing. So we have to overcome that. The network, if you have spotty network, if you have limited bandwidth, you have links going up and down. And then, as I said, exactly once delivery. How do we make sure that that data is not going to be duplicated? So we'll talk about NiFi, and again, this is very brief. There's, I mean, I have entire other presentations just on this part. If you have any questions, feel free to talk to me. I'll stick around afterwards. NiFi is a robust platform for data flow management and monitoring. What it does is try to connect whatever you have and get it wherever you want it. And do it in a reproducible way that you have very clear insight into. And you have the capability to immediately affect that data flow. All these slides are online, so if you want high quality versions, you're welcome to them. NiFi provides data flow management using a number of capabilities. So data buffering, which is something that, especially with IoT devices, is huge. It sometimes makes me smile when I just see X to Y and the big arrow between them, because what happens if all of your sensors start sending out 1,000 times as much data as you expected? I mean, you have DDOS to your own service. We've got flow-specific quality of service. Prioritized queuing to handle that. Data provenance is something I'll get into a little bit more in depth as we move forward, but believe me when I say that provenance is gonna be one of the big features that NiFi brings to the table that a lot of other solutions don't have. And I think you'll find it very compelling once we get into it. It can be push-pull, it can consume, it can produce, publish. There's a very fine-grained history of the data that's happening at the operations that are happening to that data. You can build flow templates and export them. And it has a visual command and control, which is something that's very uncommon in this space. There are a few terms that I'll be using quite a bit and I just wanna give some context to them. So I will say flow file. Think of that as like an atomic unit of data. That's some, it's not really a packet, but it's that piece of data that's moving around through your system. We have a flow file processor. That's a black box. That's just something that is performing an operation on that data. We have a connection. It's a queue between two processors. We have a flow controller. That's like the system scheduler. That's the overarching process management and resource management thread that's controlling everything. And then process groups, just a logical abstraction. It contains multiple processors and connections within the group. So NIFI is data agnostic. We don't care what the data is that we're moving around. But we understand that the users do. You don't just have binary data. You have some protocol or some specification or some messages that are coming through. And so you need to be able to handle that without having to transform it into some arbitrary, interstandard format that everybody else has to agree on. You don't care about that. You have your data and you want to operate on your data. One of the best analogies for the construct of a flow file itself is to think about HTTP data. So when you make an HTTP request or response, there are attributes. There are headers. And those are usually key value pairs. You have very strict definitions for what a header definition can be, what the header name is and the value, usually arbitrary text. But then you have content, you have binary content. So it can be HTML text, it could be a video, it could be JSON. Same thing with our flow file. The flow file has attributes and these are small in memory key value pairs. And then we have binary content and this is arbitrary. It could be anything and it could be very small. I mean, it could be zero bytes. It could be gigabytes of data. It could be video. It could be hardware output from large industrial machines and ICS, things like that. So can vary greatly on not only the size, but the internal format and construct as well. I mentioned earlier that we have visual command control. So what we're trying to do, and the project is Apache open source for about two years now, but it's been in development for almost ten. And it was developed by the US National Security Agency in house. And was designed for large scale data flow monitoring and manipulation, especially on systems that were not necessarily robust in their connectivity and performance. What happens is you have, okay, you have some jumble of Perl scripts and file X fill from directories. That's great until something doesn't work. And then you have to call the one developer in the hemisphere who knows how that system works and they're probably not available right now. So you have 24 hour operation centers where you have people staffing this stuff and the data has to come through. Data's happening, whether you can catch it or not, it's happening. And if you don't catch it, you drop it, you lose it, you may lose it forever. So they needed something where at any time, anybody with the domain knowledge, not necessarily technical knowledge, could manage the system. So that leads to something like this. And you can see, so on what we call the canvas here, you have these processors, you have connections between them. Over on the, my right, your left, is the navigate and the operate palettes. And those allow you to perform the actions that you're going to. Think about it, if you've ever used Photoshop, it's very similar to that. Along the top, you have the components that you can drag on to the flow. And then in the far right, it's 2017, I guess we're still calling it the hamburger menu. The hamburger menu is where you can access some of the global controls, like access controls, users and policies, things like that. So I mentioned data provenance earlier, and data provenance really is a very detailed legacy lineage of the data that comes into your system. So if you look on the left here, we have the edge, right? The edge is everywhere that we don't have a data center. It's all the devices, it's all the hardware, it's laptops, mobile phones, computers, whatever it is, anything that's producing this data. And we're sending it back over this data plane to our core, right? Where we have our large clusters and big data systems. Everybody has things set up to operate on that data. Well, how do I know that I'm getting accurate data, that I'm getting at the right time, that I, which hops it went through and how it got to me? All of those pieces of metadata we call the provenance. And that provenance, you can see on the left here, what we call a provenance chain, a lineage chain. And it's small here, we can get into it later, but it's not that important for men if I write now. But that is the history of every piece of data that's ever come through your system, and allows you to go back, analyze it, look at what's happening at certain snapshots in time, look at what was happening when it sent to foreign systems or back into your own system. And then realize, all right, we had this data, it was coming through. The other team said they never got it. So we can go back and check in our provenance graph exactly when it was sent to that system, what protocol it was sent over. You know what, it was two phase commit. So you guys acknowledge that you received that data, it's not our fault. And they go, okay, fine, but we're all on the same team, right? You go, yeah, one click, you can replay that data, send it back to them. Everybody's fine. We have integration right now out of the box with over 180 different processors. So that's pretty much anything that's been on Apache's homepage in the last few months you connect to out of the box. Now we're gonna talk about IoT challenges. Because that's, I think, why most people are here today. Obviously, IoT is not just throwing a laptop or throwing a clustered computer system out into the wild and saying, now it's a thing, so it's Internet of Things. You have limited computational capabilities. You have limited network and power. Obviously, these are small devices sometimes that are not physically accessible. Or the scale renders it, as somebody said in the last presentation, you can't go around with a USB stick to 1.4 million cars, right? That's, you can't update things that way, that's not feasible. So, you might have to do things where you're controlling entire classes of agents or endpoints. They're not frequently updated, especially if you're using lowest bidder manufacturer who's saying, okay, yeah, here's a chip, and, sorry, that one only has seven pins, I don't know, I wasn't counting. You may not be able to get the new firmware on there or whatever it is you need to do. We talked about competing standards and protocols, that's old hat. Scalability obviously is an issue, and then privacy and security. How many people want their email credentials leaked because they're a connected refrigerator, didn't get a firmware update, right? Obviously more examples than I can list here, but you guys have heard of the Marai Botnet, the DIN DDS attack. When it has its own Wikipedia article, that means it's bad, right? That means it's no longer just affecting us, that's affecting real people in the world. You guys, I don't know, has anybody used Shodan before? So Shodan is essentially a search engine for the Internet of Things. So you type in a protocol or a device category, right? I wanna look at internet connected cameras. Well, it goes out and starts pinging, well, it doesn't start pinging them when you type it in. But it has this essentially a graph and some indexes of all these devices that are out connected to the network. And happily accepting, yeah, MQTT connection status zero. Yeah, you're connected, you didn't need to password or anything. You can look up ICS, industrial control systems, SCADA systems. If you don't wanna sleep tonight, go check out Shodan. So I'm here to tell you that NIFI solves everything. Small asterisks for some values of everything. Runs on the JVM, has the UI that I showed, has security built in. We've got TLS, 1.2, mutual authentication, client certificates for everything. Very, very fine-grained control of authentication and authorization with pluggable implementations. We've got encrypted data anyway you wanna encrypt it and you can handle essentially almost any format and protocol. And if we don't have it already, it's a pull request away or you can write your own. We also have processors that allow you to run arbitrary script. So if you have some weird proto buff format that we haven't seen before, but you already have an existing implementation in Groovy or Python or Lua, you just drop that into your execute script processor and you're up and running, 20 seconds. So NIFI does a lot of great work with IoT. This is a Raspberry Pi, version three, Model B. NIFI out of the box, AMQP, MQTT, UDP, TCP, any acronym you wanna throw at it. Especially for communicating with small devices, messaging systems, that kind of stuff, great, out of the box. If you do a little bit of surgery, you can get NIFI running on a Raspberry. So you can start dropping stuff. An example I've heard of theoretically is a small little box, looks like a power adapter, plugs into the wall, has a radio on board, captures any Wi-Fi signal in the area, it's doing processing on it. Nobody ever notices it, looks like a power box. That's theoretical, nobody's ever done that. No, but you can run NIFI on a Raspberry Pi, that's not that bad. Tim Spann, a colleague of mine, he got a Raspberry Pi, he got a little, what's it called, a sensor attachment. It's 8 by 8 grid, it has some sensors on it, temperature, pressure, humidity. Set up a little NIFI flow, it's six processors, and he's got a Python Flask server running on the Pi. So the Python script, I think, was maybe 30 lines. It's reading from all these sensors, it's wrapping it in JSON, pushing it up to an endpoint. NIFI is consuming that MQTT, it's doing some processing, it could write it to HTFS, it could send it to Splunk, send it to Storm, Kafka, whatever you wanna do, analyze it in Spark. That's great. Now, that's data exfil, we've gotten data off of the device, and we can do whatever we want with it. But there's no command and control there. There's no way for us to bi-directionally interface with that device and say, hey, here's some response, here's instructions. So what we could do is add a listener in the Flask server and say, okay, now I'm gonna host an API as well, right? So now I have an endpoint running on the Pi. And NIFI can do all of its big data computing and combine this data with that of 10 million other Raspberry Pi's and the sense hats that are attached to those. And get some analysis and outcomes from all of that. And then say, okay, I actually wanna modify the flow. I need you to sample more frequently, or I need you to sample less frequently. Or I need you to prioritize this data because that's more interesting. And start sending back instructions to the Raspberry Pi and say, okay, update yourself. The problem is, if you wanna do that for millions of different sensors, you have to write that code millions of different times, right? You have to update that for each specific sensor. You're gonna have to update, well, there's a new version of the firmware for that. Or there's a new version of the library that does the JSON. Or there's all these different little edge cases that it's gonna make that impossible to scale with one person needs to update that. So why do we want a different solution? Why do we want something in addition to NIFI? NIFI is designed to own the box. I mean, it is the primary resident of the computer that it runs on. It's the latest release, 760 megabytes, okay? There are efforts underway to prune that and build what we call extension registries, so that you're not downloading the entire package every time. But it's still gonna be big, it's a JVM process. On an old, the older version, start up on a Pi in about 10 to 15 minutes. This one's about 30 minutes. So you can see my screenshot of trying to SSH into my Pi, where it just doesn't respond. Now, the size difference is understandable because there have been a ton of new features added, there have been new processors added. The software is going to grow in capability, but it doesn't necessarily mean that it's a good fit for IoT. So now we're gonna talk about Apache Minify. Minify is a sub-project of NIFI. It is designed to get out to the edge where NIFI can't go, okay? This is an analogy from my old days was tip of the spear. This is what's out there, where everything is actually happening. IoT, connected car, legacy hardware that can't support full-fledged NIFI. This is where Minify is designed to thrive. And it is designed to be a good neighbor, right? It's a guest on the system. It's a small agent that's running. It's not taking all the resources. It's not prioritizing itself. NIFI is a cluster system. I mean, you throw hardware at it and it will take advantage of it. We can get 100 gigabit per second processing and throughput on clusters. But that's not what you want on your pie. That's not what you want on some sensor that's living out in the world. So I think I just covered most of this. But yeah, you have to manipulate NIFI. You have to do surgery if you want to get it out there on the edge. With Minify you don't. So the Java agent is 45 megabytes. So that's substantial difference from 760. But the C++ binary is 746 kilobytes. So that's a 1,000 times reduction in size. So how do Minify and NIFI interact? Remember I talked about data provenance earlier? With NIFI data provenance starts as soon as you get the bytes in your system. As soon as you ingest from whatever it is, reading a log file, receiving HTTP data on an endpoint, UDP, TCP, whatever it is. As soon as you get those bytes, you start tracking the provenance. You start tracking the history of that data, what's happened to it, where it's going. But you can only start once you receive the data. So if you're operating NIFI in a cluster in a data center somewhere, that's when your provenance starts. That's helpful, but it's not the whole picture, right? You guys have no idea what sessions I attended before I walked into this room. I have no idea what context you bring to the conversation here. With Minify out in the edge, we start capturing that data as soon as the data is created, as soon as it starts to exist. We can start capturing the provenance data. And so we realize, in this connected car, this data was generated literally milliseconds ago. We can now watch it all the way through the car, all the way through our satellite link or Wi-Fi link, whatever it is, to get back to NIFI. And then every operation that happens within NIFI, so whether it's routed between nodes in a cluster, whether it's sent off to Spark for machine language processing, whether it comes back through a Kafka queue, whatever that is, we have all of that history now. And we can even perform metadata analysis on that provenance history. An intern last summer who did machine learning and deep learning analysis on the provenance data, could start doing anomaly detection in flows, could start doing self-healing flows. Like you start building a flow, you have Clippy pop up, and he says, it looks like you're trying to do this and you might have dragged the wrong processor in here. Can I recommend this one? Yeah, absolutely. I have an example of that a little bit later on, but that's a great question. So here you can kind of see, this is a very simple graph, but this kind of edge trade-off between edge and core. And as we're closer to the edge, Minify is more correctly suited for that use case, and as you get closer to the middle, NiFi is more correctly suited for that use case. So we've been talking about this just in a straight line, right? Edge over here and core over here. But that's not really how the world works. We have stuff all over the place. We have dimensionality to this. So look at these and you go, okay, well in each of these scenarios, I might have one or multiple NiFi clusters. I might have tens or thousands or millions of devices that are connected to Minify. I might have regional centers where I have maybe one NiFi instance that's just aggregating data and batching it and sending that on to further NiFi instances for analysis. So it's not as simple as I've been leading you to believe maybe with just these straight line, one thing here, one thing here graphs earlier. So the two flavors of NiFi, as we call them, one is in Java, one is in C++. The Java version is essentially a very stripped down version of NiFi. There's no user interface. It uses a very simple YAML configuration file, and it has a reduced processor count. So I mentioned we have 188 processors in the newest version of NiFi. Minify Java supports 63 of them out of the box. We've taken out things like Azure HDFS writing and things that you're probably not doing from a Minify device. But HTTP, TCP, UDP processing, syslog reading, encryption, hashing, all of that stuff still available out of the box. Basically on any device that can support the JVM. So that's good to go. You can add additional functionality as long as it's JVM compatible by just dropping what we call a NARS, which is a NiFi archive. It's just a collection of code and resources you can drop right back in, and there it goes. The Minify version, I'm sorry, Minify in C++, written from scratch, very, very small. It is very early age, very early stage. Very limited set of processors right now. But it has enough to perform the essential site to site communication, which is what we call our NiFi specific protocol of replicating data or offloading data from one instance of NiFi slash Minify to another. So this just summarizes what I was talking about. We've stripped out the UI, we've reduced the scope of the NiFi framework. This is architecture stuff, it's not super important. So what does Minify provide? What can you do at the edge? Data tagging and provenance, governance from the edge. So I don't know if you do any work. If you work with stuff in China, for example, geolocation information is not allowed to leave Chinese computers. You can't send that to, you can encrypt it, send it to your US computer. I'm never gonna look at it, no, doesn't matter. Legally, you can't export that data outside of Chinese control for a better word. So you can do this from the edge with Minify. You can say, okay, I know where I am. I will extract, I will filter out any of this sensitive data before I send this back to my processing instances. You can do security, you can do certificate based authentication, you can do encryption. I think we have something like 25 different encryption algorithms supported out of the box, six different KDFs. You can do low latency operations. So in the example, this is a connected car chipset that we worked with Qualcomm on. So on that chipset, we're doing low latency decision making where, okay, we're sampling break temperature, break pressure, speed, geolocation, all of these kinds of decisions. Well, I don't need to sample what the radio station was 100,000 times a second. That's not gonna give me relevant or useful information. But I might wanna do that with break temperature, or maybe 100 times a second. Or maybe the car's been stopped for three seconds. Maybe I don't need to do that 100 times a second. You can start making those decisions closer to the edge, because exfilling data from a car is expensive, right? You're either in a city location where you have access to some Wi-Fi hotspots, or you're out and you're using an LTE radio that's embedded in the car. Well, that bandwidth is really expensive. So if we can limit the amount of data that we're exfilling, especially the stuff that's not relevant, we can save money. Which is usually what drives all of the decision making processes. So this is just an example of the architecture of Minify on the car and how it can connect to the CAN bus, it can connect to Ethernet. Whatever the local internet network is, or some protocol that we don't even know about yet, right? We work with the manufacturer, we work with an OEM or a third party who's providing that, we can integrate with that. We can start performing these decision and analysis on the car itself. And this is just a screenshot of the flow and somebody driving through San Jose with the LTE and the Wi-Fi boxes showing the signal that's coming out and the data that's being transferred from each one. So Minify Xfill, okay, getting the data out of the edge to some central process system, right? That's where we're gonna do our large scale analysis, our big data analysis, our clustering, whatever. We have this thing called site to site, that's a NIFI specific protocol. It has two implementations, one is raw socket base and that's the original implementation. One is HTTP, HTTPS implementation because I found a lot of users didn't want to have to ask their IT department to open up another firewall port to try and get data out. They're seeing encrypted data coming out on some unusual port. They usually shut that down. So we provided an HTTPS. Now that's not in Minify C++ right now, but it eventually will be, it is in Minify Java. If you don't want to use site to site, if you just want to send this data, let's say you have a competing product from NIFI. You already have something that is doing data flow and data management and putting it into Kafka or some other system. We can also just send HTTP, HTTPS or FTP, SFTP, JMS. You can even run any arbitrary process on the shell from Minify, if you want. Now we have access controls put in place. We have severe warnings that, hey, you're doing something restricted and you probably shouldn't just let anybody run an arbitrary process. But you can do that. It's technically supported. So we want to provide as many options for you to get the data out of the system in the way that fits whatever you're trying to do. This is a feature proposal for flow persistence and versioning. It's not a giant issue for the Minify area of expertise right now. But in NIFI and somewhat in Minify, versioning flows and going from development to testing to production and promoting flows through that life cycle is something that we've been hearing a lot of requests for. So there's an effort underway. Minify Java does have what we call a command control API. So there are these three, what we call ingestors. So you have your Minify flow living locally on that edge node. From NIFI, from the nice user interface where we get to design everything and we see how things work, we want to build a new version of that flow. We can do that and then send it using either writing it to a local file on the Minify device and having Minify read from that file. Or sending it to a REST API or having Minify reach out at certain intervals and pull from another API, right? So there are three ways currently to get that new flow definition down to the Minify instance and have that reload and start running the new flow. So let's explore a little bit. We're going to walk through a scenario where essentially we have some arbitrary IoT device that's generating log messages. We want to encrypt the data on the device. So everything that ever leaves that device is encrypted, okay? Now, obviously, normal caveat that we're doing this through Minify. So if your device has some other embedded server that it's running to allow command control, you'd have to turn that off. We're not taking responsibility for every bit of data that the device says it can handle on its own. Saying through Minify, we will make sure that everything we pull off of that device is encrypted. We also need to prioritize some of that data because maybe we have an unreliable network connection, right? Maybe this part of the world, the satellite goes down twice a day. Maybe we have spotty cell like LTE connection. Maybe somebody comes by and clips the wire once a year. We don't know, but we know that we want to prioritize certain data because that's more valuable to us. We're going to transmit that data to a central node and then we're going to decrypt the data and analyze it. And then in the future, make determinations from that and then modify the live flow. Minify also serves as a great test harness, especially when you're doing IoT work. It can be not necessarily expensive in terms of power, but expensive in terms of your time to design a flow, offload it to some edge device. If it's something that has to reboot or have a firmware update or whatever it is, it can make that feedback loop very slow. So NIFI can serve as a great test harness for you too. It can replicate and simulate all these different environments where you're generating the data that you expect to consume directly within that NIFI instance, running it through your actual flow, consuming it in whatever way you would expect to, learning lessons from that, and then updating your test flow. So it's very good for reducing the scope of that feedback cycle. The schedule on, so this simple flow on the right, we're just generating a new message every three seconds. And the message has that timestamp in there. And then we are appending it to a log file, right? So we're just simulating some device that's just continually writing out to its own log file. On the inset here, that's just a little script I wrote in Groovy that's doing the log appending. You can see that you can, so where it says log file path, that's a dynamic property. So you could set that from the processor itself. So six weeks from now, I'm not here, but somebody is running this flow and they want to change where it gets written to. They don't have to go into the code and change the code. They can just change some property that gets read in by the processor and through the UI. So now we're gonna build the minify flow. We're gonna build the flow that we're actually going to promote to our device and send out into the world. So we want it to tail a log file, log the raw content. So whatever it's pulled in, now, depending on what our read window is, this could be 100 times a second or it could be every five minutes. So whatever it reads in, it may not just be one log message. It might be multiple message. It might be nothing. So we're gonna log the raw contents. We're gonna split that into individual lines. We're gonna filter it using, in this case, just the parity of the timestamp. But obviously, you can do that on any number of metadata or actual data contents that you read in. We're gonna prioritize that, encrypt it using AESGCM, and then expel it to our remote NiFi instance. So you can see the flow here and I guess I don't have a pointer. So here we're actually, this says filter. What we're doing is we're examining the content and if the last digit of the milliseconds is even, we're gonna send it one way. If it's odd, we're gonna send it the other way. What we're doing is just tagging it with even or odd. And so in this contrived example, odd has higher priority than even, right? So about 50% of our data will be tagged as higher priority than the other 50%. We're gonna export that template from NiFi, where we've designed it in the UI to minify in order to run on the edge. So all we do is we save it as a template in NiFi, which is a two-click user interface process. And then we get an XML file that we download from our browser. And we run this command line tool, which converts the flow in XML format into a config.yml file. Now config.yml for minify is a combination of both the flow definition and the minify configuration values. In NiFi, it was on an earlier slide, those are two separate things. There's the NiFi.properties file and then there's a flow.xmlgz. Because we're working with a smaller footprint here, we're just combining it into a sparser single file that contains both of those definitions. You run that command, minify is ready to go. Now I put a little asterisk because in this example, I'm a security guy, I always set up the security stuff. There's still an additional step to set up TLS and the certificates. But if you were doing this in your hobby shop and you just wanna get something out there, you are ready after those two steps. Setting up crypto is actually not as bad and painful as it usually is. We have a toolkit which makes it a one line invocation to set up your own CA, sign and issue certificates, generate a client certificate that you can import into your browser, have all that set up so that you have mutual authentication over TLS, you're immediately getting a TLS 1.2 connection, provided your hardware supports it. But we just are trying to make that process easier for somebody who doesn't have a dedicated security team to be generating certificates, issuing certificates, revoking certificates, etc., etc. Now, if we have TLS, why do we need to encrypt the data on the device, right? Because that's what TLS is gonna do. It's gonna do the authentication, it's gonna encrypt the data in channel, we're fine. Well, as soon as it hits the other TLS endpoint, all of that data is decrypted. So it's back in plain text. What if our NIFI instance is just a proxy or a router to some follow on systems? And follow on system A doesn't really want us to be able to see all of the input that is coming through. Or some of it's financial data, so it's PCI compliant. Or some of it is personal information or health care information, so it's PII or HIPAA compliant. NIFI can say, fine, you encrypt it out on the edge. We'll get it back, we'll end the TLS session, so we'll decrypt all the data. But that means we can see the attributes, we can see the metadata around this flow file that we need in order to route it and operate on it, but the content is still encrypted. We have no visibility into that, right? So this is our third flow. This is our process data in NIFI flow. We're receiving data from a minify instance. We're gonna log exactly the bytes that come into us after it exits the TLS session. Then we're gonna actually decrypt the content of those messages and we'll log it again just to see if it worked. So does it work? Well, this is the data provenance view. So I've done a provenance query. I've said I wanna see some of the records that have come through in the last few minutes, and I open up one of these provenance events. And it's the encrypt content processor, which is in decrypt mode. So you can see on the left I have an input claim, and on the right I have an output claim. So the input claim is the data as it appeared coming into this processor, and the output claim is as it left. Well, if I look at the top, that's the incoming data. I don't read Mandarin, but I don't think that's valid Mandarin. It's UTF-8 encrypted data. I can look at that in hex and verify that, because yeah, I see NIFI IV, which is actually the IV indicator and delimiter. And then I run through, I look at the output, and I see this is a message at a timestamp. So that's great. So I've successfully decrypted that data using AESGCM, running on whatever output device it was. So now let's verify the prioritization. Did it actually prioritize the data that was coming in? Well, here I can look at, I actually had to increase the right frequency in order to verify this, but you can see at the very top, you see priority value one. And then you see this is a message at, so 22, 27, 30.017, right? So at 10, 27, and 30 seconds, that message was generated. Then you look at the bottom, you see priority two, but that message was generated at 10, 27, and 29 seconds. So yes, even though that message was older, and if this was just a first in, first out world, would have appeared first. No, NIFI looked at the timestamps. It tagged it correctly. It said, no, this one's higher priority than this one. And then when it sent that data, it prioritized the higher priority. So what would our next steps be? Well, we could start doing like a window aggregator and say, okay, if more than 60% of the data that I've seen in the last whatever arbitrary time window is even, right? I want to switch the prioritization. I want to start bringing in other data. So let's say I was doing temperature sensors in a connected conference building, right? I want to say, okay, I'm sampling this room every second. And the temperature hasn't changed in six hours. Maybe nobody's in there. Maybe I don't have to care so much about that temperature sensor. So I can dynamically drop the sampling frequency down. But I see this other one and it's going up by like five degrees every time I sample that one. Well, maybe I should get more visibility into that one because people are probably in that room, or they're in and out of that room, and there's something going on. So you can start to balance where you're getting your data and how much influence you're allowing that data to have on your system by doing the analysis and then using command control to react to it. We could encrypt at the edge with different keys for different tags and then send those on to different following systems. We could tell it to cache low priority values and send those in batches if we had a spotty network connection, for example. We could tell it to perform rollover and prune the log so that you're not filling up your IoT device with data from six hours ago that you don't care about anymore. And then we could also expel the minify provenance, because it's capturing that provenance data as well, and send it back to NiFi and perform analysis there. We have two quick examples from the community and then questions. First is Roger Coacteco, and he had a Raspberry Pi. He was using Apache Thrift, Kismet, Cassandra. And so he was using this Raspberry Pi to monitor Wi-Fi signals. Find hotspots, send it back to big data, perform analysis, using Wireshark, looking up the manufacturer files. And then getting those definitions that he could, in real time, monitor Wi-Fi connections from wherever he was. But the glue to hold all this together was taking up most of his time. He's like, I wanna focus on the Raspberry Pi part and the Wireshark part. Why do I have to care about all these other things that are happening in between? So I mean, I literally just found his blog and he's there writing Apache NiFi to the rescue. This is his graphic. He threw NiFi stickers on every piece of it and said, yeah, that just fixed all my problems. The scalability, the glue, the security of the connections between each end point. This is my solution. So thanks to Roger, I didn't have to do any work for this one. It's Thursday afternoon, you guys all wanna get out of here. This is pretty much the end of it. Buddymine, Jeremy Dyer, new dad. Apparently, weighing the poop from your baby is important for medical reasons. Don't ask me why, but that's what the doctor said. So he decided, okay, well, we're gonna make this a little bit easier. Really smart guy, understands everything except how to take landscape video. But here it is. Alexa, ask Dataflow to log poop. 1.7 zero ounces of poop logged. So he doesn't have to get his hands dirty, quite literally. Collecting that data, providing it to his pediatrician for whatever reason that would be. Why use NiFi Minify? Hopefully this presentation has covered a lot of that. But again, it's a completely open source tool. It's moving data, which is very difficult to do in a real scenario. It's very easy on paper, and it is almost always glossed over as we just draw a pipe connection between A and B. But it has real challenges. And hopefully NiFi can abstract enough of those away to the point where you make some decisions, you deploy NiFi, it works for you. And if it doesn't, it's very easy to adjust and moderate those decisions as you get real time live feedback from your system. It's common tooling, it's extensions. It's very robust, but it's also very extensible. We have a ton of community contributions for various protocols we've never even heard of, but we've tried to make the system as open to extensibility as possible. So it's, in my opinion, pretty easy to write up your processor, drop it in, and off and running. We do have a pretty healthy community. It's an Apache project, so it's not owned by any company. There are a number of companies that are providing input to it and contributions. But again, it's not owned by anybody. It was developed internally to the government and it was open sourced to Apache ownership under the Trusted Technology Program, Technology Transfer Program in 2014. So I would encourage you to go check out the GitHub repository. It's mostly a Java project, but NiFi, obviously, there's a fully C++ implementation of that as well. We had a little bit of a PR crackdown last week so you can see that the green line countered the red for a minute. So you can check us out. Again, so the nyfi.apache.org site is where all this documentation and examples and stuff are. You can follow us on Twitter. You can talk to us on Twitter. We're pretty responsive. The JIRA repository is a great place to check and see if somebody's already fixed the issue you're having or to file one. We're pretty responsive to that. And the user lists are, to my chagrin, quite constant. So there's plenty of discussion 24 seven on there. Thank you very much and I'll take any questions that you may have. So you can prioritize that and say, okay, I have one nyfi instance that's leader, right, and I have one that's regional. And so if I get an update from regional, that might not be the newest instruction from whatever the central system is back home, right? So I could prioritize and say, okay, if I get something from the central central, that's the one I want to trust. If I get something from regional, let me wait like 10 seconds and see if I get something newer from central. They could do something like that. If there are two non-conflicting instructions, they would both apply, unless it's literally redefining the entire flow. There's certainly edge cases there where I could see that that would be a challenge. What I would probably say is, unless there's a specific business purpose for communicating with multiple nyfi instances from the minify instance, I would say communicate with one nyfi instance from minify. Even if that's a load balancer in front of a cluster, but then have the conflict resolution done on the nyfi instance, because it's built to handle that. And then send just the single set of commands down to minify. But you certainly, you can have minify interact with multiple nyfi instances. Yes? Yeah, carefully. Sure, so the question is about certificate management. The toolkit that we provide is a drop in replacement for that, but it's not, sorry. It's a drop in tool to allow you to do this stuff, but it's not a replacement for a full fledged CI and security team and all this kind of stuff. If you want to use let's encrypt or tiny cert, you're more than welcome to do that. A lot of the places we deploy are in large enterprises and they have their own security team and their own internal root and intermediate CA and all kinds of stuff. Revoking a certificate is, I mean, actually performing the CA revoke operation is outside the scope of nyfi, but if nyfi has a certificate and that certificate's revoked, then it won't be valid anymore. I mean, it certainly follows that. Yeah, so you could, it's gonna sound very like snake eating itself, but you could use nyfi and minify to perform that task and say, hey, here's arbitrary data that I'm pushing out to the endpoints, right? So I'm pushing the new certificate. I want you to replace that at this location on the file system and then I want you to reboot. And now you have the new certificate. Now you asking like, how do you manage the trust for the time period between revocation and coming back online? Yeah, I certainly recommend don't get the certificate revoked, but yes, that's, yeah, exactly. You could, yeah, you can just arbitrarily push raw data. You could have a back of your mind security flow that you put on to all of your devices and you just say, if this, heaven forbid, ever gets triggered, here's what you're gonna do. You're gonna accept this blob of data. You're gonna perform some checks on it. You're gonna write it into this location and then that's your new certificate now. Or you can, I mean, usually you would trust cross-signed multiple intermediate certificate authorities, right? I mean, that would be the responsible tack, so that, okay, you have one in escrow that is breaking, break glass in case of emergency. My primary certificate authority gets compromised. I don't trust anything that's signed by that anymore. I trust things that are signed by this one that's never been used. But it's still, the public keys still put into all these devices. So if you're in a place where your memory constrained on the IoT device, you can only hold one certificate at a time. I don't have a great answer for you. But if you're capable of doing it physically and responsible, I would suggest having a backup public key on the device. Sir, did you have a question? The question is, is there a NIFI for Dummies book? There's not a book, there are amazing resources online. Even as somebody who's in the project, I have to say that our documentation is, it would be good for a proprietary piece of software. It's incredible for open source software. It's a priority of the management committee to ensure that that is true all the time. So if you go to the nifi.apache.org website, the documentation is comprehensive and impressive, as well as every instance of NIFI comes with all the documentation bundled in it. So you can be running on an air gap machine with no internet connectivity. You still have every piece of documentation that's associated with the project available at your fingertips. And it's integrated in the UI, so you click help on anything. Up pops the page, explains everything, all the settings, instructions. Hortonworks also has what we call the Hortonworks Community Connection. And that is an extensive, very similar to Stack Overflow type site, where people come in, ask questions, subject matter experts publish articles, tons of screenshots. I'll throw out a tag for a woman named Jen Barnaby has excellent video tutorials on YouTube. So if you look up, I think it's Kiss Tech Docs. Keep it simple, stupid tech docs. Jen does fantastic videos. A number of the people on the project management committee have their own personal blogs where they'll focus on the topics that they're specifically focused on. So I can throw out Brian Bendy does great work on the authorization and authentication and site to site stuff. Matt Burgess, who wrote the execute script processor, has dozens of examples of, hey, here's a problem from the real world, here's how you would solve it. Brian Rosander, Pierre Villard, I mean, there are more people than I could name right now who have put information about this out into the world. No book yet, though. Maybe that's what I'll work on next year. Yes, and then in the back. Yeah, yeah, yeah. Yes. Yes, it was in gray though, right? Yeah, so there was active development on that and then Amazon published a new SDK. So there is somebody working on it right now, but it's, they're using the new SDK instead of the old version, so. Yes, MQTT support has been there for, I mean, I don't want to misspeak, but a long time. The AWS IoT SDK is being worked on right now to wrap that in a processor. So you can just drag and drop that in and automatically have it be set up for Amazon type stuff. I don't know that we have something for Azure IoT, but we have Azure Cloud, yes, that's built in, yes. Not that I'm aware of right now, but sorry, the question was about IBM Lumix, is that it? Not that I'm aware of right now, but I would say check the mailing list, because I may have missed something the last couple of days. Or also, if you file a JIRA and just say, hey, I'm interested in this, somebody will come along and pick it up and investigate and possibly contribute that for you. Okay, sir? So the question was, I did mention earlier that we could theoretically encrypt different messages with different encryption keys. And the question is, where do the keys come from? In, yeah. So if you look on the right here, sorry. If you see this long string here, that's an encrypted key. So what happens is on NiFi in the encrypt content processor, the key is marked as a sensitive property. That means that it'll be encrypted when it's stored and it will never be sent in either plain or encrypted version over the wire back to the user interface. Everything that the user interface does is does through a REST API. So it's a good consumer of NiFi's internal REST API. That API is exposed to any client. So you could write something in whatever language you want to communicate with NiFi over that API. A sensitive property will never be exposed back over the wire, even if it's encrypted. So just a placeholder will be sent. That encrypted key, I copied and pasted it into the minify flow. Now, in a short amount of time, when the next version is released, the flow management, control and command, all of that stuff, will change the way that this is done. So you won't have to do that manually at all. That'll be in your NiFi instance, you can figure that key. Because it's AES, right? It's symmetric key. So you have to have the same key on both sides. You put that into your NiFi flow. You export that to minify. That process captures all that data and sends it out to the minify flow. So it's still encrypted. It's encrypted even in the minify flow, currently, on the file system. Minify has its own sensitive properties management key that it uses to decrypt that property and then use that encryption key to encrypt your content as it's being sent. Now, you can also do this with PGP support out of the box. So if you have key pairs, you can do that as well. And you would only need to put your server's public key on the minify device. So you're not exposing any sensitive key material that way. Anything else? Yes? Yes. So the question was, this came from NSA. Do you have a year in which the project was started? The project has existed for 10 years. And it was provided to Apache through the TTP. It was actually the first project provided that way. So you can do the math on 10 years from today. Anything else? Thank you guys very much. Enjoy the rest of the conference. Thank you.