 to introduce Apache Nifi or Nifi, Minifi. Okay. Thank you. Thank you. Thank you to Fauston for having me here. I apologize for my appearance. I'm from LA, and I forgot that it rained anywhere else in the world. So yeah, I am Andy LaPresto. I work at Hortonworks on a project management committee member and a committer on Apache Nifi and the sub-project Apache Minifi. And we're going to talk about intelligently collecting data at the edge. So I don't know if you can quite read this. It's an introduction both to the application and to myself. I have been working on security topics for about 13 years. I joined the Nifi open-source team, got here, as I said, from LA, and that's a little bit about me. So we'll talk very briefly about what Dataflow is and what the challenges involved in collecting the data are. But I apologize, this is a fairly long presentation that's been compressed down to combine multiple things. I will stick around afterwards for any questions, but I'm going to go through a little bit quickly today. It's very simple, right? Dataflow is getting data from one place, one point to another. We have producers, so that could be computers, log files, it could be devices, it could be user interaction, send it usually over the internet, and then somebody or something needs to make sense of that data. But it's hard. And it's not the first time this problem has tried to be solved, and many people have tried to do it in many different ways. So the classic example of 14 to 15 standards. Why is moving data effectively hard? Because moving data is not that hard, right? You can just put a modem somewhere and you have Dataflow. But doing it correctly, doing it effectively, doing it in a way that makes sense for the consumers and producers of that data is very difficult. Like I said, we have standards. We have different formats. We have exactly once delivery, right? Which if anybody has worked with data before is a really difficult problem. We have the protocols, the veracity and validity of the information, ensuring security, overcoming security, which as a security person, I object to this completely accurate statement. Compliance, schemas, et cetera, et cetera, and again, exactly once delivery. So we're going to connect A to B to C, and it's very simple and straightforward. We're going to read log files off of a system. We're going to ingest them into SQL. We're going to collect all that data from a bunch of different instances, put it into a data lake and run all of our queries and machine learning and modeling and everything on that data, big data. Well, this is easy, right? You write a bash script, put it onto every computer. You have a little bit of Python and you can script it together and there you go. Let's look at a model of courier service. So I guess DHL is the big one over here, right? We've got mobile devices. Everybody has a scanner. They're scanning the packages as they deliver them. There are registers in the store where we're accepting and delivering parcels. We have trucks. We have deliverers. And all of this goes to a gateway server that's in each store. And it goes up to a distribution center where we have a cluster. And we even have a data center, right? A professional data center. And we can do all of our machine learning and big data there. So great. I get all this data. Now, what am I going to do with it? Well, I'm going to throw it into Kafka and use that to send it to Storm and Spark and Flink and Apex and anything else that's on Apache's homepage today. You can send it to any other consumer, right? I mean, as soon as you start collecting data, somebody's going to want it. Okay, that's a universal truth. Usually they want the data you don't have yet. That's even harder. So now let's scale this up because we're going to be an international courier service. And we have people and computers on every different continent, every different corner of the globe. Raise your hand if you want to maintain the Python scripts for that for the rest of your life. So let's talk about Apache NIFI. I've definitely stated some problems. Let's try and solve some. NIFI is an open source project, part of the Apache Software Foundation. It provides a ton of features that are going to be very important for your data flow and data collection. Data buffering is huge. NIFI comes from an area where the importance of data can't be overstated. And protecting that data, but also not losing that data is really, really important. One of the features that I probably won't get into in too much detail, but back pressure allows you to provide custom configuration so that if some piece of your chain starts slowing down, it can actually send information to the predecessors and inform them of that information and allow them to either prioritize the data that you need and stop sending everything that's just first in, first out. It can start sending it to a different flow and backing it up, putting it into a buffer, all kinds of things you can do with it. That's really valuable. Like I said, prioritize queuing. We have quality of service that can be customized based on the flow. Data provenance is something that I will get into in a little bit more detail. You can push and pull models. And it's a visual tool. It's not something where you have to have a master's in order to start the thing up. It's something that anybody with domain knowledge can use which really makes it easy to get the right people involved. Because you're not always going to be able to say everybody who has valuable insight to this problem is able to run the terminal and do everything there. So we're trying to open it up and make sure that you get the right people involved in your process. NiFi originated with a concept called flow-based programming. So there is some vocabulary, basically a glossary here. You'll hear me talk about flow files. That's basically your atomic unit of data. That's the thing that's moving through your system. Flow file processor is some component, some black box. It doesn't matter what the internal implementation is. It's something that you can connect data into, get data out of, and operate. Obviously, in order to do that, you need connections. Flow controller is essentially a scheduler. It's the overarching system that's running all of this and orchestrating it. And then a process group is basically just a collection of these processors and connections logically grouped together. So NiFi is completely data agnostic. We really don't care what you're using it for. But it was designed understanding that users have to care about the specifics. And it allows you to do a lot of transformation for manipulation of the formats and protocols within the tool. So a really good example to understand a flow file is to think of it using the analogy of an HTTP data. So this is an HTTP response. You can see that there's a header, and then there's some content below. A flow file is very similar. We have attributes, which is key value mapping. And then we have the binary content of the flow file. And the reason that this is important to be separated logically is that the attributes are maintained in memory. They're accessible all the time. They're usually pretty small. They're used for routing. They're used for classification, tagging, access control. The content could be anything from a couple of bytes of text to 10 gigs of video. When you're routing data, you really don't want to be moving all of that through the heap constantly. So what we do is we've split this into two different storage capacities. One is what we call the flow file repository, which stores all of these flow files with their attributes. But it then has, similar to a pointer, a reference pointer, it has a claim to content that's in a content repository. So what we do is everything is dealt with through streaming interfaces. The data, the content is read into a content repository. It's referenced from there, but it means that it's not always on the heap, whereas the attributes are always available for operation. This allows it to process a ton of data very quickly and respecting the resources that you have available. I'm not going to pretend like you don't need to have resources in order to do this if you're operating on that scale of data, but it makes it easier and it's not unnecessarily wasting the resources that you have available. Let's talk about the user interface. I like this, but in general, we want less of this and more of this. And let's talk about your interface. I'm actually going to go up to the board a little bit. So here we have what's called the navigation palette and then the operation palette down here. Across the top, this header, these are components which you can drag on to the flow, the canvas. On the far right, the hamburger menu, I guess we're still calling it that, allows you to get into some of the system maintenance and operations. And then you can see on the graph itself, your components, connections, et cetera. This is mostly to illustrate that people tend to think of data processing as happening just in some core data center somewhere, right? You have a team that only cares about doing something with the data. They really don't care about how the data got, well, they care if it doesn't actually get there, but they don't care about how you're collecting the data where it's coming from. On the other hand, you have people and resources out in the field or at the edge which are doing the data collection, right? And not only that, there's usually some other value they're providing or else the tool wouldn't be there. One of the things that NIFI allows is for bi-directional data flow. What I mean by that is we're extracting or ex-filling some kind of important data from the edge and bringing it back to a central resource. But we also need to be able to send commands, communication out to the edge in order to update the flow, prioritize what we're getting, ask somebody to resend something, right? So, with this bi-directional data, what we call the data plane and then the command plane or control plane, it allows for us to improve the data collection, make sure it's robust and stable as we move on. We have over 180 processors that are available as part of the default NIFI installation. Everything from SQL and SQL-like data stores to big data, to all the different edge formats and protocols that you might encounter. So, listen TCP, log readers, just file extractors, Kafka, database integration, all kinds of stuff there. Web, so HTTP, email, basically if you can think of it, if we haven't written a processor for it already, we can probably knock one out unless it's the new version of Kafka, sometimes. We can do a ton of different operations to that data. So, usually manipulating data, transforming data requires some custom knowledge of that format, right? I want to go from XML to JSON. Well, I have to understand XML, I have to understand JSON. I may have to understand different schemas that are involved. Here, you drag a box onto the canvas and you have XML to JSON. It really is that easy. So, now let's talk about Minify, and I am blazing through this, so I apologize if anything's getting a little confusing. Minify is going to let us get out to the very edge. So, everybody in this room has a computer that's capable of the calculations needed to put a man on the moon 50 years ago, right? That kind of data collection, data processing is unprecedented and will only continue to grow, right? Data lives in the data center where NIFI has brought it in, transformed it and provided it to whatever follow-on system exists, but the data doesn't start there, and so the closer we can get to the creation of the data, the better quality we'll get, the better control over that data we'll have, we'll be able to prioritize it, we'll be able to secure it, we'll get granularity of provenance of history in a way that we've never been able to before. Why not just put NIFI out there? Well, NIFI is big. 726 megabytes for the last release, okay? It's not a tiny thing. It is respectful of the hardware, like I said, you can run it on a Raspberry Pi, but you'd like to run it on the heaviest machine you can find, right? We have a friend who has a machine running right now with 768 gigabytes of RAM and it uses it. What we'd also like to do is have Minify, where we can put that out onto a client library on somebody's iOS or Android device on every IoT chip with six legs that you can stick on a wall somewhere, on a connected car. NIFI 726 megs Minify Java is 45 megs, okay? Minify C++ is 700 kilobytes, so it's a 1,000 times improvement on space requirement. One of the things that we did, I'll go into an example here of putting Minify on a connected car. This is a project we did with Qualcomm. It allows us to tag data immediately. This is a model of the network inside of a car and I think it's still a little new to a lot of people that your car has computers in it now and it really can't run without them. Unless you have a 50-year-old car. You have the CAN bus, which is your main network within the car. There's usually Ethernet on board. There's also like an interconnect network and then there's whatever else has yet to be invented. Tesla is certainly not stopping on their cars. Every car manufacturer is moving in this direction. Minify living in the car, literally on that chip in the central computer of the car can ingest all of the data that's flowing across these different networks. Everything from speed data to brake measurements to GPS process all of that, tag it, prioritize it, maybe filter stuff out based on geographic location. If you're operating in China, you're not allowed to send any of that information to a computer that lives outside of China. For example, Ford has a data center for the entire world. We'll get all the information in, we'll do some machine learning and modeling there, and then we'll send out our learnings. Now you have to have one, I think maybe even now one for Europe, one for the US, one for China, one for Antarctica, whatever you want to do. So Minify can start routing that information, encrypting that information, prioritizing or filtering it while it's still on the car. Taking data off of a car is very expensive. The car will prioritize that and send more data over Wi-Fi. But when you're driving on a highway and there's nobody around, it uses an LTE modem that's in the car. And that's extremely expensive. So you really don't want to send a bunch of wasted data, a bunch of unnecessary data, or uncompressed data over that if you don't have to. So here's a little diagram the map on the right and the boxes there are actually showing the ratio of the data that's getting X filled live via LTE versus Wi-Fi. On the left you see the data flow from NiFi and that is ingesting data off of those networks, processing it, filtering it, and then sending it via the radios that are in the car. One of the next things that we're really focused on developing is the flow versioning, right? As you develop your data flow, all of this, we like to use the analogy you're not building pipes, you're a farmer digging irrigation ditches, right? That water's always flowing. That river's not stopping because oh I need to update this processor. So what you can do is continue a data flow, build something new alongside of it, make sure that works on the same data that's coming through without stopping anything, any follow on system. And then when you have a new flow that's better, you kill the old flow. You've never stopped moving the data from the source to the destination, but you've improved your system as you're doing that. Unfortunately, you improve that and then six weeks later you find out oh I needed something from the other flow that I forgot to check, well how do you go back to that and so that's where flow versioning, whether it's on the six week timeframe or six seconds, right? I'm reading from IOT sources and I need to change the priority based on the available bandwidth. I might want everything, if I have the bandwidth, I might want only top priority data when the bandwidth starts to get spotty. So I can have my control, my command control system, NIFI write these rules into my flow and then send them back to Minify and not just one instance of Minify, all 6.2 million instances of Minify that were running on that flow. That's all going to be taken care of seamlessly. I think I am getting close. Okay, there's plenty more. Anybody who's interested is welcome to come talk to me. I'll hang out outside and thank you very much for having me. Sorry, yeah question. Is it possible to make queues that are consumer controlled in NIFI? I don't understand the question fully but my answer is yes. Yes, almost definitely. It depends on what specific consumer format you're talking about but yes, yes NIFI will yes, it's the opposite of Kafka, yes. Yes. Yes. Yeah, the question was about exactly once delivery. NIFI uses what's called a write-ahead log to track all of the data that's flowing through the system and so it will guarantee exactly once delivery here, so that's actually this copy on write. The data is not manipulated in place. The data exists permanently as long as you have storage to hold it. When it gets sent to a follow-on system it receives a confirmation. It's a two-phase commit signal that the information was received. If it wasn't, NIFI can replay that information to the follow-on system. Yes, in the back. Sorry, can you repeat that a little louder please? Is it possible to make transformations on data on the fly? Is it possible to make data transformations on the fly? Yes, that's the whole point of NIFI, I mean that one of the whole points of NIFI. Like format XML to JSON, yes. CSV to parsing records into different atomic units. Yes, absolutely. Yes, the question was performance versus Kafka or other systems. That's a whole other talk. I mean, we can talk about that for an hour. I'll take it offline with you if you want, absolutely. Yes, in the back. System requirements for NIFI in embedded devices. The ability to run C and 700K. NIFI bundles. It's a completely new implementation of the same system. On the JVM and a C implementation. Bare metal. So, RAM. I believe it's four megabytes of RAM. Yes. The question was schema list versus schema formats. Yeah, we have plenty of processors for Avro. We can immediately, there's one literally called infer Avro schema. So you bring in arbitrary data and it will basically build out the schema from that Avro information. Yes. Can you remember about the runtime model and performance? You mentioned 2-phase commit and then write a head load. Which kind of suggested maybe it's more like single or not distributed kind of deployment. So I'm concerned how it scales. Sure. The question was 2-phase commit, write a head log, cluster scaling. Okay. Yeah. Yeah, it scales extremely well. It is built to be a clustered system. So, there's an entire cluster coordinator. It uses embedded zookeeper. Another instance isn't available. Resource management and allocation is still something that we're continuing to work on. And integrating with like Miso or Yarn or some other resource manager for that. But, NIFI will encapsulate the resource management. It has cluster coordinator and heartbeat and all that kind of stuff as well. 2-phase commit, that's for the following systems to acknowledge that they've received information. The write a head log implementation is so that we're, again, the copy on write so that it's not manipulating data in place. You don't lose the record of what that data was. Was there another one? Okay. Okay, one last question. Did you have a question? I'm just curious. Is autonomous cars an appropriate use case for these? So, the next big thing will be autonomous? Yeah, sorry. The question is autonomous cars. Is that appropriate for this? Technically. Technically, absolutely.