 Hi, my name is Thuns. I'm a tech evangelist and Oracle ACE director I'm going to be talking about Fluent D today and it's work with logs on how we can use it to make our lives a lot easier in operations and development environments So let's introduce myself properly first In five quick bullets. I'm a father I'm a husband a blogger and an author I also happen to help run a developer meet-up here in London and Currently working on a book called unified logging with Fluent D And it's in the early access program at the moment. So halfway through And you can read a lot about the use and configuration of Fluent D And why we might want to use Fluent D in different scenarios I work as I say I work for Capgemini and I'm fortunate enough to work with a very successful UK team And we've won a number of awards over the last couple of years So that's us But before I get into into the nuts and bolts of it it's worth looking at Looking at monitoring and putting it into context so The idea of monitoring is actually if you use the the go-back to the English language definitions is to observe and see what's happening over a period at the time and and see what's happening and That's led to some ideas such as observability and the three pillars of observability Which is quite a useful vehicle it's important because we all work from different perspectives and Depending on your viewpoint and your role in an organization you may will I think logging or monitoring it is very different if you're an infrastructure expert to when you're an apps or if you're working from a low-code perspective or Into Business solutions and BPM and things like that So let's have a quick look at the stack and the three layers of observability So one way of observing things is to look at it from a numerical perspective. This is capturing your CPU usage This is kept in your memory Resources and very numeric statistical perspective The next way we can look at The monitoring capabilities is around the idea of logs And this is where fluent is really at its strengths, but we'll come back to that in a little while But you know logs are equally important as those statistical facts When you have a problem with your server it's likely to put in information into an SNMP trap or using another mechanism To report the nature of the issues that you're experiencing You know hardware fault and things like that And of course in applications and most of us that work as developers actually use logs as a bread and butter We use it for a number of different reasons from helping us understand and making sure the software develop We're developing is running and behaving as expected through to Creating audit trail events and information that meets security requirements And tracking user activities To help address SEM requirements and then you've got the latest of evolution which is around the application of tracing Where you're wanting to see how events start on one node in a distributed environment And they will track through different applications and components such as various microservices that you might be running in kubernetes through a Kafka Stream perhaps through your API gateways in and out And seeing where the delays are where the performance issues are and details like that And that's tracing for you and then The best way to examine that is as I mentioned some of the examples But there's like you know hosting in the infrastructure level monitoring activities tend to be heavily focused on the metrics side of things with a little bit of the log characteristics and Practically no real tracing on the infrastructure layer and then when we get to virtualization The VMs or you can turn a framework such a kubernetes It's again still fairly strong on the numerical side of things but logs become a lot more prevalent and We start to see a little bit of tracing coming in and we move up a stack a bit further and move into the application space your tracing is probably just as important now as your logs and Actually, the stats drop off a little bit and they change in nature because you will need to be interested in the stats from your virtual machine Making sure that your application is tuned correctly rather than perhaps raw pure CPU issues We can then look at above that in into the business Application monitoring so that's looking at how many transactions you've completed from a business perspective So that's you know from purchase to to completion to fulfillment and processes like that and of course then on top of that we have security Considerations and security is looking at a lot of log things that you know What's a user done? When did they sign in? When did they sign out? How often are they interacting? How often do they fail? To get the credentials right and things like that and then on the other side of the spectrum You're trying to capture the capacity in sites so that you can scale and forecast demand do Charging potentially if you're sharing the costs of running a back-end amongst different teams and things like that So Let's have a look at application logs more specifically. Why do we use them? Well, we're looking for unexpected errors as mentioned When when something doesn't behave as we expect it one of the things that will tell us what's going on Or why it's happened is obviously our application log We want to look at things like performance issues Classic measures for that the things like looking at your database or your storage mechanism and looking at things like slow query logs and looking at the query performance and pulling that out and then of course as I mentioned that we've got GDPR compliance and all sorts of other legislative Requirements these days that mean that we have to create audit trials and record what users are doing So that in the event of an investigation or something suspect happening Or your question in their users actions for some reason Then you can actually go back through and understand what they've done when they did it Even potentially why they did it But the bottom line is is in all of this which is always about feeding the business need so that We can understand the value that we're bringing to the business and succeeding in resolving business problems or confirming the business is running Smoothly and everything's behaving as expected and we're not losing data and things like that So if anybody raises a question you got the evidence to support it But it's all about the impact if there's no clear business connection to to what we're monitoring then Sooner or later, we're going to get there the activities of developing monitoring Shutdown so, you know, that could be simply monitoring to optimize performance make sure that we are running our environment efficiently or At the other end of the spectrum monitoring to make sure that nobody's abusing the system One of the things that we've seen over the years is changing complexity And that's impacted our ability to monitor whether it's through logs or through metrics, but it's worth just understanding how Significant the plot complexity is evolved so when we Started the IT industry if you like back in the 50s and 60s that we didn't have to deal with concurrency We didn't have issues of distribution Scaling was well, whatever the computer could do if you could buy a bigger computer then maybe you could scale and do a bit more But that was the limitations And then as of times moved on we've got into multi-threaded in a single CPU or within a single server and multiple CPUs on a single server as you'd see in that in a large enterprise infrastructure mainframe computing and With that we've seen things like the Tomcat and Servic-based applications Some sewer platforms such as the Oracle service bus amongst others coming in but it's largely Focused on a single CPU with threads and then over time we've actually scaled that out And started to distribute the workload But typically when you distribute the workload and into an process still remains on one server You don't start jumping around the servers too much Unless you're calling a discrete component like a database In which case that might be hosted on a different piece of infrastructure, but you're still Fairly easy to track what's going on. It's still happening within one environment normally then as we've evolved game we've driven up the level of asynchronous behavior and We've got a look at Node.js is a great example of that where we've moved away from a threading model to more of a Single threaded but we're picking up and putting down workloads based on the IO requirements and things like that and asynchronous behavior Kafka is another example of that kind of thing and then of course we're into all our distributed Mechanisms as well where we've developed scale out and we can spin up particularly in the advent of Cloud and greater elasticity in our operations. We can spin up new servers At the top of the hat to scale out part of our application space and as a result Combine that with asynchronous and jobs can suddenly bounce around different servers as they process through their life within our application So the complexity is just escalated phenomenally and we've got to respond to that with our ability to measure and monitor And then there are a number of techniques that we've put have been brought to bear, you know 20 years ago. We might have just looked at one log file now We have many and many sources of logs as well and it's distributed so FluentD is a tool that can help us address that So let's introduce FluentD It's not the most well-known of products out there But it is certainly amongst those that are entrusted and involved in DevOps and operational activities Something that people are becoming increasingly aware of its legacy actually was in big data with a company called Treasure Data who I am sourced at and then Over time it's become an open source project and under the governance of the cloud native computing foundation or CNCF And that's really when it started to get a lot of traction In many respects because it's got the open governance It's become part of the the Kubernetes ecosystem And it's vendor neutral But it's highly pluggable There are other technologies in this space that people know we've heard of the Elk stack which will come up into in a moment But FluentD is basically a Very lightweight framework that actually Uses a lot of plug-ins so you can build your own plug-ins a lot of vendors have Made their solutions compatible with FluentD By building plug-ins so you can incorporate or feed your FluentD monitored environments to their tools so We can see from Many directions capturing the data and feeding it to Splunk as an industry leading product and The log aggregation analytics that particularly for security use cases Through to integrating with the cloud native solutions that Amazon and Google provide And Oracle now as well with their latest offering in this space And of course if you're building a bespoke Solution you might want to build your own custom plug-in Either to inject log data into FluentD or to actually pull it out if you're trying to do some analysis But this highly pluggable nature has made it very very powerful very very flexible And there's something in the order of 500 plus plug-ins not all of them are out of the box FluentD There is a very lively and vibrant source community contributing Quite a few of these in addition to the core governed Part of FluentD So it means that the plug-ins allow us to do things like formatting filtering Deal with different storage technologies from elastic search through to S3 buckets And many other things It allows us to build caching and using caching Products as well as parsing the payload so we can start to extract meaning out of the log events that we capture and Start to then apply rules about you know Do we need to pass it on who needs it because some organizations as we'll see shortly We'll say actually the security team want to use Splunk But perhaps the ops team want to use NADGOS And there will be an overlap of information so we need to route it to the right place We can actually look at the lifecycle of logs and therefore What we expect and what we need from an overall monitoring solution and a log management and log unification mechanism As a lifecycle so We see at one end of the spectrum log events have been generated by vast numbers of different components from from Kubernetes even from the server infrastructure from your applications from the business process layers and They all need to be captured and processed or Determined to on their relevance and need In terms of what to do with them so we need to take them in and ingest them dynamically and then Evaluate what they are and root them Potentially manipulating the structure of the log events so that they can be consumed by the target product Some systems just want the whole blob as text others want to structure JSON file a payload So they can process it in a particular way And then in that process as you Feed that into a system you might want to Either dynamically grab it and push an event out because it's a critical area That's been spotted within the stream of log events that you're capturing or Actually, you want to aggregate and look for patterns in activity And this is how a lot of the SA SIEM tools work Where they're looking at user behavior over time to see if there's a number any anomalies in the ways people are using systems And then of course we might want to visualize the data, you know, what's the demand profile look like visually? What's our consumption? When do we see? problems occurring and performance and Correlate that and visually show how that might relate to how other components in an area for structural working You know does our application suddenly throw an error? When the database server shows it's under some load Well, if you've got that visually represented It's easy to spot those correlations or you could incorporate AI and some clever tricks to actually find those patterns for you And of course there's no point in doing all that analysis if you're not going to then act upon it So we need to be able to notify people or alert people to these things particularly time sensitive notifications When it's operationally critical, you know servers collapsed Your database is suddenly starting to grind because it's running out of storage for some unexpected reason Along with perhaps even generating Dura tickets to say that, you know This is a bug or this is an error that has cropped up so many times today it needed to be investigated because it's You know to frequent So I mentioned the a okay stack or elastic log stash and Kibana is probably the best known stack for monitoring We have log stash at the bottom of the stack and it has a baby brother a few like good beats which is Designed to be ultra light foot print so that you can deploy in IOT style solutions and highly distributed solutions And it has a limited capability compared to the full log stash Then of course you've got the aggregated layer where you've taken all your view of log events I'm putting them to elastic search. You see you can start doing analysis trend Occurrences what's happened over time look at your overall history So it means that local servers do not have to retain the logs for very long It gives you a central repository that people can then interrogate what's going on And of course as I say we need to visualize all of that and there's a kibana all of these come from elastic as a vendor and and There's a lot of commonality with fluent D in in this So let me bring in the EFK stack EFK is coming in because fluent D is starting to be seen as an alternative and potentially dis-policing The log stash And that's been driven from a couple of dimensions First of all as I say fluent D has got a lot more richness in terms of plug-in that gives you a lot more facility in terms of Things that you can capture natively without doing any work It is also a natural part of Kubernetes which means that it's becoming a first-class citizen in the infrastructure And it's a very powerful Utility that Has some flexibility around caching in ways that log stashes not quite so freeing but so people are looking at Fluent D rather than log stash, but keeping the elastic search and its analytical capabilities and the visualization of kibana And as a result we have EFK And fluent D as well can behave like log stash to the rest of the stack As one of its plug-ins is a log stash style behavior to elastic search So I'm going to do a little demo in a moment. This is Going to show you just a little bit a hint of the art of the possible And a little bit of fun. So what's that demo do? So I'm going to have a Server running which is going to take two log files and it's going to do a little a tonial Transform on one of the log files And then pass that log file on to what's known as a folder Which we'll see in a minute and understand its role and also where it just pushes Logs back out. I could put filters into that pipeline just to say okay These are the events of interest and I need write those to the log file But for simplicity we've we've avoided all of that at this stage And that could occur on one node or it could occur on many nodes And in fact whilst the demo we will only run it on one node. There is nothing stopping us actually running this up on lots of servers and Simulating the source through of course In a highly distributed environment, we want to sort of centralize and aggregate that together And what we've done is therefore built a second node with another fluent de-configuration And this is like an aggregation point which is going to take the folder And do a common process and that common process is simply examining the log Filter so that we can have a quick standard out peek of what's going on. Just let it stream through And then specific events specific log entries. We're going to tease out and I'm going to send to slack So if I was actually Configured to look at or look for specific errors I could grab those and send it to slack. I could even get very clever and say right. Okay that Exception is in this part of the application. Therefore, I'm going to send a slack message to Joe the developer because that's his responsibility. He should know that there's a production error occurred Right So I'm not going to use the screwing gaps. I'm actually going to jump in to Some real-life Development environment and what we've got here is To fluent de-configurations you can see node one and node two and Let's just quickly walk through the node one. So I can tell the system Fluent D how to log itself and I'm just saying okay fluent D. I just want you to report your info and then In a fairly decorative manner I define one or many sources in this case. I'm defining two sources One called basic file and then I'm getting a second source called basic file, too in the real world You would potentially if you were running in there a legacy app server perhaps be using you know a web logic or web sphere or a Red hat fuse container and It will be generated in lots of different log files for different applications You'll potentially separate out log files from your application or your wall or a file perhaps Compared to the actual logs being generated by the core engine of that J2E framework or Micro-profile container But however it's set up You can collect these and you could I could have easily have infrastructure Collections going on here as well collecting aspects like an SMP trap on my machine and then I Declare those and then I've defined a very very simple pipeline which is going to do a simple transform and Manipulate the structure of the the log event so that the log event when it's passed to the central node will appear in the same structure irrespective of the origin And then you could I can show you the how that differs that I have two configuration files here that I use in an AppSource tool that we've built And available through github that allows us to Take data and create log events Or replay an application log file again and re-stamp it And simulate it. Well, we'll play it back in in real time. So the time intervals are controlled and things like that If you're trying to use it to test a year you monitoring configuration that you might want to Reiterate or look through your logs Multiple times so we can do things like that lots of flexibility But the bottom line is is in here. I've defined how I'm going to structure the payload And how I'm reading the source and there as you can see here I'm getting sources there the basic file and event and then take the event message and add a counter to how many times I Iterated through the test log Data set that I've got If I go to my second stream, you can see it's different the message and I've got stream and then I've got it Same value, but I've just given it a different name And you can see I've given it attributes as well that I can simulate class paths and things like that So that's gonna go and get picked up by node one and It will get processed and the important thing is it comes through Here now once I've done the transform And I'm gonna send it to two places. I'm gonna send it to my File that I mentioned and then I'm gonna fold it on to the other node and that's declared with this declaration here Okay. Oh, sorry. This bit then says, okay, right. I'm gonna pass it on. I want another pipeline That we call refer to as a label, which is a series of activities and It will then match all the events that come in and every five seconds I will push log events over to the Consolidated second node and then if we look at the second node Which is here. You can see my source coming in which is now my folder and I'm saying on port 28 80 capture accept events in and then I can look for log events that got referenced to computer So I can have computer with a capital C or a lowercase c of course. I could be clever with my rejax And say well, it's gonna be a capital C or a lowercase c Or you know ignore capitalization in the expression Once I filtered those out so not all of my simulated logs are gonna have to talk about computer in any way But those that do will then that come through the standard out. I mentioned and then go on to slack and Yes, by the time We finish this presentation. I'll be changing my Token into slack But you can then see I've configured it and it's it describes the message which aspects of the log message could be used And I'm not I'm just using all of it But it'd be very easy to say I just want these attributes and pull specific attributes out of the message and traverse it The the log event because I've applied some structure and meaning to it So Let's fire things up. So here we've got a few shell environments on my machine and we're just going to fire up The fluent denotes So there's one and there's the other and we can see it's processing the configuration It's just telling us about some buffer configurations And then I'm going to fire up the log generation process and you'll see it saying I'm talking about the logs that's building and What we'll see is it's now processing these if I now bring up my Sluck Environment Then what we'll see in a moment is as you can see it's starting to nudge me telling me that I've got events coming through and if I Bring this over You can see it's now to fluent the is getting messages and you can see it's built the message up based on telling me about which No, it is well. I've hardwired that into the configuration if we look at the configuration and then it's Displaying the messages and you can see it's changing the headers based on the origin of the payload That's entirely my configuration So if I was down it will keep turning away sending more messages to me as we go And that's really a very simple example of Fluent D in action. Yes, it's very Mickey Mouse, but through A few dozen lines of code Or configuration should I say if I come back here? We'll see that I'm consuming and routing the messages very easily and I'm filtering out messages over here and There we go. So let me stop slack because it's ping it away And see what we'll do is we'll kill the processes so In the real world, you know, we've talked about that the need for highly distributed Models, so let me show you some options and ways you can deploy fluently We can think about how we're reactive and Support the tracing considerations potentially as well, although you might want to use something like Jaeger if you're in our microservices world to do tracing We've talked about that the challenges have been out to distribute to different tools for different people teams to use for different purposes But whilst I've talked a lot about Kubernetes here, there is also a lot This will do and is equally valid in a legacy environment You don't have to be a microservice in a container This can be on a real physical Machine because at the end of the day, you're just harvesting up log files or log events because you can actually wire Your applications directly to Fluent D if you're using Java for example log back or a cell for J Then there are Um Configurations that you can apply that mean that they talk directly to Fluent D cutting out the file Which gives you an efficiency improvement and a performance game because Fluent D hasn't got a parse The lines out of the file to get our okay, right? I've got this record now. I'm gonna apply some meaning to it But I'm not having to understand the structure of your your log file But let's look at some possibilities in terms of scaling the deployment This obviously is a little bit like that that the scenario mock it mocked up scenario that I've shown you We have three servers Showing you One application And now aggregating to a central server and they could have a secondary or dr node Ready and available and you could then have some other servers using with Fluent D instances aggregating to another central one and the central servers are perhaps filtering out the most critical messages and sending it to An alerting tool I've used slack But that could just as easily be pager due to your teams or or many other tools in that space and then pushed on the logs first of more substantial long-term analysis By putting them into a persistence layer such as elastic search or just dropping it in S3 buckets To be pulled into a tool of choice and so on But That could equally be a Docker Setup being managed by Kubernetes We're on the left-hand side here now. I have a worker node where it's just assuming that all the Microservices are just spitting to standard out which means that Kubernetes will pick them up on the Worker of the worker nodes And be able to process them Obviously, that's less efficient because you've got to take a standard out food and then we apply the meaning You know slice that the look at the the line of text up to extract the time and date and the node That generated that the event and then the messages which may contain structured data as well You know, so if you can avoid that then you perhaps might want to use the the model on the Right-hand side where I've gone and put Fluent D smaller brethren fluent bit which is able to do a subset of fluent D But it has a very very small footprint and it can root the log events through So we don't have to repause things to get that meaning Fluent bit has done that or actually the interaction with fluent bit is meant that we've not lost it from the outset And we aggregate and we send to a central node And we can do things like fail over and stuff like that or as fluent D would talk about do node discovery so it can go find a node and we can put the concentrators into pods if we like and so on We could also address it by using the sidecar pattern So you have a docker and the last that pre-built docker images with fluent D And you can then just inject the configuration file for it to use And off you go So again, you know another option is it's just in the the options and the possibilities are really down so How you want to work and the pros and cons and the benefits of different configuration approaches? To wrap up I just wanted to share this this is the basis of Real-world use case that we've built as you can see across the bottom. I've shown the life cycle of the log events and Fluent D sits in the middle as it's Real strength is the structure and route And we've got the information sources on the left varying from virtual machines to Kubernetes and docker images log4j on our Java application at Istio is involved And of course we could even harvest if we want to deal with multi-cloud potentially the Log events being captured by a cloud provider That could be on AWS or Google or Oracle or Azure and so on and then what for these did in this case was Fed those events through to a number of different tools So we've sent the security any security related activities to Splunk But didn't send everything to Splunk as it was deployed in a different location And we didn't want to generate vast amounts of data transfer across from cloud to cloud because you Start incurring costs there. So we were pulling out the relevant log events and pushing those out We were actually building some very simple metrics with influent D Which were then being shared with Prometheus Routing on certain log events that we're just showing stats that were coming out of the JVM And of course a subset of that was being sent to elastic search for Visualization through Kibana And we had some rules on elastic search that if it found a certain number of errors of a particular Common characteristic then we would raise a due a ticket and of course we were nudging people through email and slack on specific events of interest for them and In line with, you know, right tool for the right job Displot having fluently there for the tracing activities. We still use the Jaeger to collect the trace data and Provide the tracing analysis and visualization You know because it's not a case of one tool fits fits all Fluent D works well with semi-structured Data and structured data But pure metrics Quite often best handled by dedicated tools particularly for things like trace You could do it if you were really perverse, but it'd be a lot of work When actually there's a tool out there that does a better job and it's designed for that very purpose So thank you for your time. I hope you found it useful a Lot of this information is available online as I mentioned at the start And we'll take any questions Thank you very much