 Hello everyone and welcome. My name is Sophia Izokin and today I'll be walking us through log analytics using Apache NIFI and Apache Kafka. So before we get started in depth, I just wanted to properly introduce myself. Again, I'm Sophia and I currently work as a data scientist within CISO at IBM in New York City. I've been working as a data scientist for as long as I've been at IBM, which is about two and a half years. And however, I've been in this current role with CISO for about a year and four months, so still pretty new, but just very excited to learn something new every day. And a part of my role as a data scientist involves working on data ingestion and building data pipelines. That's what I spend about 80% of my time doing and I absolutely love it. As a mother, fun facts about me. I'm an avid gamer and sneakerhead. I just finished the new Spider-Man game, Spider-Man 2, and it's awesome, especially on the PS5 if you haven't checked it out. So, and I'm also originally from Monopeg, Manitoba. I moved to New York again about two and a half years ago, but I'm originally from Monopeg, Canada. And this is my very first time presenting at a conference or hosting any kind of workshop at a conference. So welcome everybody and let's get into it. So this is how it's going to work today. We're going to start out by discussing log analytics, what it is and what the challenges are and how the tools that we're discussing today can help with those challenges. Those tools are NIFI, CAFCA, and OpenSearch. And then we're also going to be talking about the dataset that I selected for this demonstration. And after that, we're going to move into discussing the architecture very briefly, where we're starting from and what the end result is going to look like. And then we'll move into demonstrations and set up starting with NIFI and CAFCA, followed by OpenSearch. So it's going to be a pretty packed session. And yeah, let's get into it. So what is log analytics? At a very high level, I guess you could say it's basically analyzing and extracting insights from machine generated data that comes from a variety of sources. So to get more depth, we can think of various things that generate log data. Let's think of SIEM systems, which generate log and events. Let's think about servers. Let's think about Internet of Things devices, mobile devices and so on. All these systems generate logs, which are incredibly valuable and quite complex data. And log analytics can then be described as when we're closely examining and analyzing all these log files that capture all sorts of information like these events and even errors and transactions as well. But the crucial bit to keep in mind about log analytics is that the data is usually generated in real time. So log analytics is, I believe, an incredibly exciting field, but not without its challenges as well. So what are some of those challenges? Well, we're talking about the sheer volume and variety of the data. You're getting data from all sorts of sources and it's coming in all at the same time. And it can be pretty massive. And a lot of organizations can find a challenging to handle the data and to properly scale for that sort of thing. And another challenge is the lack of consistency in the data that's coming in. You might have data that's supposed to be meant to be used for one particular thing, coming from different sources and therefore has different formats. So that lack of consistency is also a challenge. And because different systems may carry the same category of log data, but it could look quite different depending on where it's coming from. So how can the tools that we're looking at today, like NIFI, OpenSearch and Kafka be leveraged when working with log data? So the tools that we're going to be looking at are supposed to help ease some of these challenges and simplify the processes that we just talked about. Apache NIFI, for example, is an awesome tool for data ingestion and for helping data get from point A to point B. So it's basically a tool for collecting or data collection and processing. And then you have Apache Kafka, which is known officially as a distributed data storage system, but it's best for processing streaming data in real time. So like I said, we usually get log data in real time. So Apache Kafka is usually great for that sort of use case. And when you have log streaming in real time from multiple sources, as is often the case with log analytics, Kafka is a tool that you'd want to use to act as that central hub for all these logs. And you're able to use Kafka to manage the high throughput, as well as ensure that your data is reliably stored and made available to your Kafka consumers. So an OpenSearch, last but not least, is a system where log data can be stored, analyzed, and visualized, which is very important because you're collecting all this data, but you actually want to do something with it. You want to make the insights you're collecting actionable. And that's where OpenSearch comes into play. So now we're going to get into the data set that I selected for this walkthrough and what it's about, where it's from, and all that good stuff. So the data set is from the NASA Kennedy Space Center, WWW server in Florida. It is from the months of, I believe, August of 1995, sorry, July of 1995, up until August of 1995. And I selected this data because, well, I thought it was quite cool. First off, it's quite old and it's often used when trying to talk about use cases, looking at NIFI and log data. And something else that's pretty interesting that people can do with this data to play around with it more is you are able to use certain systems like, for example, Apache Spark to generate your own log data using this. So it will take this and essentially mimic it and you're able to kind of play around with it more and do really cool things. So I thought this was a really cool data set. And then the logs come in an unstructured format in an ASCII file and they come in in one line per request with four fields. We have the host, which is oftentimes a host, but sometimes it can be an IP address as well for the host name. And then we have the timestamp, which is a typical day, month, year format. But at the end of it, it has attached a time zone. And then we have the request HTTP reply code and the bytes in the reply. So the total size of this is approximately 3.5 million requests. All right. So now looking at the architecture of what we're talking about today. So we're going to start off by looking at the ingestion layer, which is the log or where the log is being ingested from and where it's going to, which is Apache NIFI. And then it goes into Apache NIFI into the data transportation layer, which consists of a Kafka topic and consumer. This topic and consumer will be processors within NIFI to store and then download the data. That's what the Kafka topic and consumer is for. And then we're going to close off with the data analytics and search layer, which is the open search. All right. So the way that this usually works for when you're working with Apache NIFI or when you're working with Apache Kafka, you have to make sure that you have the most recent Java configuration installed on your computer. Depending on what you download, it might not play well with the most recent, just really depending because Apache NIFI just released their newest version over Thanksgiving break. And that doesn't play well with Java 20. So, but if you download Java 21, it definitely plays better with that. So you want to install Java and then you want to install NIFI. And you can do all of this through the website and downloading it and setting it up on your local machine is something that I will walk through very briefly just so that we can see how it works and we can see how to set it up and what the UI looks like. So before going into the first demonstration, which is just to show how Apache NIFI and publishing into Kafka looks like, any questions? Okay. So I'm just going to change over now from the slides to demonstrations. Okay. All right. It's kind of difficult because I can't really see from the side. I'm sorry? Can we see your desktop? Yeah, but I was trying to... This is what I'm trying to show. Okay. So when you're actually... Okay, great. So now we just have my CLI, pretty basic. And what I'm trying to do is to head over to where I've downloaded Apache NIFI. So if I were to just LS and CD. Okay. So now I'm in where I've downloaded Apache NIFI and it's as simple as downloading it from the website, unzipping it. So now you have these files on your laptop and then you want to go into BIN. That's where Apache NIFI works from. And once you're here, you want to NIFI the SH and just start. Oh, thank you. Okay. So once this is all started up, you can see that I have the correct Java home directory. Actually, it's saying Java home is in setup. So I'm just going to go ahead and set that up now. Okay, let's try this again. All right, awesome. So that's essentially it. I think all I did show is a download process because it could take some time and I don't have the best connectivity right now. But once you download it and you unzip it, it's really as simple as setting up your username and password and making sure that your Java is intact and just going into the bin and typing in nifi.sh start. There are some issues that could come up, such as if you want to provide more space for your NIFI to be able to run on depending on your RAM, then you would have to go into NIFI properties and configure all of those things. But that's really for much bigger use cases with a lot more data. So, okay. So now that we have NIFI all set up, I'm going to pull it up on my UI. So, NIFI, excuse me. NIFI is located, usually goes to 8443. Okay, so this is what it looks like from a UI perspective. You know, it's pretty straightforward. And when you go in here into this particular process group, what you want to do is actually, so I'm just going to make a change here with the setup. Okay, so you should be able to see my NIFI process groups. So I'm just going to go right into the testing. I'm just going to come here. So this is what a basic NIFI flow looks like. And here we have the public, sorry, the published Kafka process group, as well as the consume Kafka process group. So what we're going to do here is just to generate some basic logs to see how the process works. So consume Kafka, again, is for pulling the data and publish Kafka is for pushing the data as a NIFI topic. The cool thing about NIFI besides the speed of it is that once you have a built-in configuration, or I guess they call them processors, such as this one, you don't have to create your topic on Apache Kafka on your command line like you usually would. You can just simply do it here. So for example, I want to name this new topic, which I currently don't have at all, Cassandra Conf. So I just named it that and I just click apply. And then I want to be able to pull from this particular topic. And I give it a group ID, as well as configuring all of the topic name formats and everything that's currently in bold. So you only have to worry about the things that are in bold. The things that are not are other configurations of properties that you don't necessarily have to worry about. So let's just go. And I'm going to just start the pulling. So if you notice, there's nothing coming out right now because there's nothing there. There's no data in that that particular topic doesn't even exist yet. So I'm going to generate flow files. These are random flow files. It's nothing, you know, it's just me saying hello world as custom text. And I'm just going to generate them and start. So once I start publishing, we can see that consume Kafka has begun to pull in what I'm publishing. So that's essentially how NIFI works. And that's how the NASA log exercise is going to work as well. Because you notice when you start consuming Kafka, there's nothing there. It's not going to pull anything. But as you're pushing data into Kafka, it's pulling it in. That's really the power of Kafka mixed with NIFI. Besides the fact that I could do this really with three process groups, the process group weight is just so that we can actually store it. And so I could show you the success of what's being pushed. So you can do this with three process groups and it requires no coding at all. All you need to know is the Kafka topic that you want. And that's basically it. So I'm going to stop this. Are there any questions or anything? Anybody wants me to go over? So now we're going to get into the main process group. So like I said, we're using NASA log data. And like I said, it's about 3.5 million rows. And although NIFI is a very powerful tool, it's not something that we can pull in those 3 million rows and start viewing them like a data frame. So similar to how if you were using Python or pandas, you would have to use Spark. You don't have to use anything else, but you do have to break it down in a sense. So here I have two process groups. And what these process groups contain are the data sets in the original ASCII format. And I'm going to split the text. And the reason why I'm splitting the text is because again, we have 3 million rows. And at the end of the day, we want to push them into Apache Kafka as one log or one row per log. So instead of pushing everything that we have at once, we want to push them one at a time. And splitting the text would help us do that. So I start by splitting the text just to show you the process group. Start by splitting the lines into 100 lines, 100,000 lines. So if we do the math, that's about 35 flow files that we're going to have. So I'm going to start this. All right. And then I'm going to split the text even further. Now I'm going for one line. So it's going to take a few seconds as opposed to this, which was instant. But at the end of the day, we get the desired result. Okay. So now we have our first 200,000. So we know that we're going to be publishing it to Kafka, but we also want to be able to pull it from the consume Kafka and go into our data processing and transformation layer. All right. So I'm going to publish. And I'm going to consume. And I've named this topic NASA dash logs. And that's where I'm pulling from with consume Kafka as well. There we go. So as we're publishing and pushing data in, consume Kafka is going to be pulling data out for us. You can think about this in the sense of imagine you have data from various sources. Let's say we have the get file processor and we have another processor that's pulling it from an API. Let's say the NIFI invocation TV processor. You want to be able to take advantage of that speed and the speed at which it pushes it into Kafka, into Kafka topic and pulls the events down using consume Kafka. So then we're going to move on to our data processing and transformation layer. And just to give you a hint of what it looks like, the good news is now it's in one log per line each so we can actually view it without the system crashing. Okay. So here we are. So this is what the log looks like. It looks like a log. It's nothing wrong with it, but this isn't something that we can push into OpenSearch as is. You can think of OpenSearch like you can think of a database or even for me coming from data sciences backgrounds, I see everything in data frame format. You can think of it like a data frame. So you want it to be structured. We have our host name here and then we have our dates with the minus four, which is the time zone. We have our get request and so on and so forth. When I look at this data, I see that there's so much that we can do. The first thing, first things first, yes, we have to parse it into individual fields or columns or what have you. But something else that we could do is that I'm noticing that we could extract the end of these. So we have GIF or GIF. We have HTML. We have all sorts of things like that. So we can extract these and then perhaps we can also have another field that lets us know whether or not we have an IP or a host name. And all of those things are happening here in this data transformation and processing layer. As somebody coming from a data sciences background, it was very important for me to know that I could use a tool like NIFI that would still allow me to keep up with and do the basic data transformations that I would need to do on a day-to-day basis. And NIFI allows me to do that. So using regex parsing to get all the fields out. And then I use the update attribute processor to transform those initial field names into actual field names based on what the data set indicated. So we have content size, we have host name, and we have methods and stuff like that. So I'm just gonna start this so that we can see what it looks like step by step. I'm gonna run this once and then convert the attributes to JSON. So let's see what this looks like now. So now we have them in a more structured format. Of course, we have to change the names of these because log line seven, log line six, that's not necessarily going to be of any use to us when we get to open search. Sorry, I just have to charge this. So we need to change the field names, but besides that, we can see that it's in a much more structured format. Now we're talking about something that we can actually do stuff with, we can actually analyze, and we can use for visualizations as well. So going back to this, I use the Jolt Transform JSON to transform all of the field names. So now we have what I would consider to be open search ready data. It could do with more processing, of course, and it could do with even enrichments as well, but this is structured enough. So of course, when you look at this from the get go, you notice something that's pretty crucial in open search, and really when you're looking at it, log analysis is the timing. That timestamp really has to be correct, and it usually has to be in UTC format. So that's what we're going to do next. And NIFI is a pretty powerful tool because not only can you use their processors to do whatever you need to do, you can also create your own scripts using scripting languages like Groovy, Python, I believe, Java. So I use an execute script, and I wrote my own script in order to actually properly parse that timestamp that we were seeing. So now we have a timestamp that is in line with what we got from the original dataset. We were getting data from 1995 up until 1996. So now we actually have that timestamp in the format that we wanted. We have the host name in the format that we wanted, the method, the URI, the protocol, the status, and the byte size or content size. So all that's left for me to do is something I really wanted to do was to understand how many, or at least get an idea of how many of these records for the host name, because the dataset just says host name when some of it is indeed host name and some of them are IPs. So I wanted to really get an understanding of, okay, how many IPs do we have? How many of them are host names? And can we also add a Boolean field, if you will, that'll tell us, is it an IP or is it not? And that's what the rest of these processors is doing. So I use the update attribute to be able to extract the file extension, like I talked about. And then after that, I begin to work on, okay, how can we actually extract or understand whether or not we have an IP address? And I use the route on attribute processor, which is a very powerful processor, because what it does is it goes through all of the attributes that you have in the flow file, and based on the contents of the attributes and what you specified, it'll route it to, okay, matched or not matched. So it really did give me that binary or Boolean effect I was looking for, whether or not we have an IP address in there or we don't. So let's take a look at this before we run it through. So far, we've changed the timestamp, and we've also extracted the file extension. So here it's GIF, it might be HTML, it might be any kind of file extension, but it has to be extracted just for analysis later on. So I'm going to route an attribute and to see what it says. So this goes to our matched, which we know because we saw the host name and it was an actual host name and not an IP address. And now we can do this for the rest of this, so we can just start all of these. And you can see really the speed at which this works, because if we leave this group, we already have 8,000, and if we keep on refreshing, it just keeps going. The flow file queue has a limit of 10,000 to 20,000, depending on what processor that you're running for these particular transformations. So it's already reached that limit, but with very minimal to low coding and just using the processors and some regex, I was able to do all of this analysis and something that you would usually do with a tool like Python. And you're also able to do it really fast. So that's about it for the data transformation Apache Kafka layer. We still have the Kafka publisher and the consumption consume Kafka running and the publish Kafka running. So it's not all going to be able to go in at the same time just because again, there is a limit. But once you start that open search, once you start that open search layer and you actually start putting it into open search and it goes much faster, you can actually visualize it. So we are going to move on to the next stage, which is the open search layer. But before we do, are there any questions? Sorry, I think she had her add up first. Go ahead. Yes, you can. So if you wanted to do something like automated on a daily basis, you could just use a Chrome job. And you would do this by going here. So I'm going to stop. For example, let's just go to this one because it's the first processor. And then you go into scheduling. So you can either use timer driven, meaning I want this process processor to run every 10 minutes, every two seconds, or whatever the case is, or a Chrome job. And you could say I want this to run every day and you would write in the actual specifications for the Chrome job. And then that way it's automated. It starts running every day like you want it to Apache Kafka. Yes, it is. So I would say that NiFi has that unique ability, or I guess I wouldn't say unique because other tools do have it. But NiFi is able to work with other systems such as Apache Kafka or even something like Apache Flink. And it also has the fact that it's quite fast going for it. And this is specifically discussing log analytics. And for log analytics, I would say that NiFi is the best because we are dealing with data sources, multiple data sources. And you're dealing with a lot of big data and you're trying to scale. Apache NiFi will be the best tool for that. I'm sorry. To do the minor transformations of the data, and like I just showed, add to also ingest the data because Apache NiFi is supposed to be a data pipeline. That's supposed to be what it deals with. And that ingestion layer is really where it excels. So when you're ingesting the data from different sources and you're trying to make it work with, let's say for example, it open search, like we're discussing today, Apache NiFi works very well with that. Yes, correct. But I'm not sure I haven't... I'm not sure you wouldn't need to transform the data, at least from what I've seen. Some... Oh, okay. Yeah. No, I understand that because Kafka alone definitely can get that done. But NiFi again has the analytics layer to it or data transformation layer. And it doesn't necessarily have to be anything too robust. It could just be, I want to change this JSON and this dataset, or I want to select certain things and send them to certain places. So I will explain what I mean. For example, when we talk about the route on attribute processor, right, we took all of the data that we had and we said, if it has an IP address, send it to this one. And if it has no IP address, send it to this one. It's possible they could be going into different locations, like this one is going to this Elasticse open search index, while this one is going to a different open search index, or it's going to a completely different tool. So data routing, data transformation, and even a bit of coding as well as updating certain fields, Apache NiFi does excel in that. You had a question? Okay, so are you saying that file handling could route the data and do data processing as well? Yes. Oh, okay. Okay, okay. Okay, I see what you mean. I don't have any experience with that per se, so I wouldn't be able to tell them. I'm sorry, I can't go more depth. All right, so are there any other questions? Okay, so now I'm just going to get into the AWS open search layer. All right, so you would set up open search the way that you would set up any other instance or application on AWS, so it's nothing too different than setting up an S3 bucket or an IM. You would have to open it up, allocate resources to it, depending on what your budget is, and they of course have a free tier, as they often do. So I'm just going to click on this. So for this particular Amazon open search service, I have named it NASA logs open search just to make it on theme with the naming convention of everything else. All right, so open search is again a tool for visualizing and analyzing data, as well as searching through data. So it can work with transactional data. It can also work with logs data. So today, of course, we're going to be using it for logs data. And then I would say that in my experience with using open search, I would say it's divided into two sections. You have open search, which actually stores the data, and then you have the open search dashboards, which you use to visualize the data. There isn't a distinct, you know, quote-unquote difference between the two. I mean, it's all here. You can go to overview and the dashboards are here, discover, but it is possible to not create a dashboard and still store stuff on open search. So all right, so the first thing that we're going to go over is the index management. In order to get data into open search, you have to, first of all, create an index. And creating that index, the best thing to do before creating that index would be to create an index template. So I'm going to go over that right now. All right, what an index template does is essentially make sure that all of the data that's coming into this index follows the same set of rules. So any record or anything that I put into NASA log index into this particular index is going to follow the rules of this index table. And the rules of this index table include having strings as keywords. So anything that's coming in as a string is also going to be a keyword. That prevents the duplication of these fields. And another role that it's going to follow is the field that we created to tell whether or not we have a host name that's an IP, is IP, is going to be a Boolean. Because there is a way that open search views the data or I guess stores the data and the fields are different. Time stamp is a different field, Boolean is a different field, numeric is a different field, and they all have their different capabilities. So I'm just going to go ahead and show now. Our index pattern is NASA dash. So when we're putting stuff from our pipeline, it has to have that naming convention of NASA dash. Because that's how it's going to follow our index template, the rules of our index template. Okay. So I'm just going to stop right here. Because before actually putting it in, I want to make sure that I have the naming convention, right? So NASA dash log. And we can call this O3. And our index operation is just to simply index the data. We're creating a new index because that index doesn't exist yet. And something else is because we have some certain fields for example, the timestamp field that we have, we would like to, or it's probably the best practice, is to make sure that the timestamp that you have actually comes into Elasticsearch as a timestamp, which is what is going on here in this pipeline setup. So this ingest pipeline, what it does is process the log data and sets it up as a time field. So now that we set up our ingest pipeline and our index template as well, we can begin to put the data in. So I'm going to go here first and click on index management and indices. So in our indices here, we have all of these indices. And you notice NASA log O3 is not one of them because it's not created yet, which we're going to do when we press play on this. So I'm going to go ahead and do that now, start. Oh, actually, I think I'm going to configure this to 100 at a time for the batch. Okay, so now the data is going in. And if we refresh this, we'll see NASA log 3. So the data is already here, NASA log 3. And if we would like to visualize this data, we would have to go into here, into dashboard management, because again, the viewing layer or how you view the data with the use of dashboards or with the use of the discoverer panel is this kind of a different layer than the actual storage. So now we're storing the data. That's already done. But we actually have to view the data. So we're going to go here into index patterns. And then we're going to create an index pattern. And we're going to select NASA timestamp for that and create. And these are our fields. So you see is IP is Boolean, meaning it's following the rules that we set up in the index template. And the method is there as well. The content size, file extension, all of that. So we can go in here and discover. Okay, so all right. So now the data that we're storing in is pouring in. We have to go way back in time because the timestamp of what it goes by is the NASA timestamp. And we got that data back in, all the data is from 1995. So it's not going to show up for the other time frame, which are like the past 15 minutes. So here we are with the timestamp, the source and all of the data that we need. We can expand it to view it. And we can also create visualization layers as well as conduct any type of analysis that we would want here. So for example, if we go into dashboards and we select create a new dashboard, it could be interesting to see for now how many of the IPs, how many are IPs and how many aren't, just to kind of check out our metrics here. So we can split the chart by the count of the data. Okay, so this is a count based on the host names that we have. We have this being the most for now. So I'm saying for now because the data is of course still streaming in and if we keep on refreshing, things could change. And then this is the second most common. So this is just a very basic metric of saying, hey, what are the most common host names that appear in this data? You could also do things like, let's look at the file extensions that we have, which are the most common, or let's look at the host names that we have, which are the most common, and then break them down based on the response that we have or the reply. If we have a 200 status code or 400 or 300, we could also break it down based on that. So that, all of that goes on here and the visualization and dashboard layer. All right, any questions? Yes, correct, you can. So if you're trying to look at, let's say data over a specific timeframe, then I would recommend perhaps like a line graph to kind of see the trends over time of what is, let's say we want to know why we're getting more responses or more logs in the month of July versus the month of August. So yes, we can definitely visualize something like that. No, no, no. So just so that I understand you're trying to say, like if, let's say that we're breaking it down based on this, right? If we have this IP and it hits a certain threshold, then we want to be alerted on that. Like setting up rules? Yeah, open search can do stuff like that. You don't have to, it also has machine learning jobs based on the data that's coming in. And machine learning jobs works great with the streaming data because it kind of updates based on whatever rule that you set up to be alerted on. But you can set up alerts, you can set up rules and you can set up machine learning jobs in open search. Yeah, all right. So that's about it. Thank you so much everyone. If no one has any more questions, then yeah, that's it. Thank you.