 Hi, can you hear me? Okay First of all a few words about me. I'm Peter Tzonic from Hungary working as community manager for at Bellabit Cisco Genji's upstream developer. I'm doing packaging support and advocating of Cisco Genji in Hungary and around the world If you haven't heard of Bellabit, it's an IT security company with a headquarter in Budapest, Hungary, but also have offices in European countries and Since last year also in New York Manhattan Today I will talk about Cisco Genji, which is not your typical big data tool, but it we are working on it to become one and Since last year we added quite a few possibilities To use Cisco Genji in a big data environment First of all, what is logging and what is Cisco Genji? Logging is a recording of events on a computer a typical log message on a Linux system looks like this one and And Cisco Genji is an in-hand logging daemon with a strong focus on Centriolo collection. Well, it was the focus for quite a long time and still it is but It's many more we have many more Features and possibilities now it can not only collect system messages, but all kinds of application data It can also process and filter these log messages and Can not can not only store messages to a central location, but also forward to a wide variety of destinations Since last year also many big data to many big data solutions It's a kind of C3PO, but while C3PO knows about six million communication formats Well, we We still have to work to reach his abilities, but we are working on it So how you can use Cisco Genji in a big data environment? It can Facilitated data pipeline to big data in many ways it can act as a collector a data processor and With data filtering you can make you can make sure that only relevant messages reach your Big data systems So let's talk about these in detail. First of all, I would Data collection This organ you can collect both system blocks and application locks which Provide very interesting and important contextual data for either side and As a solution it can Collect messages from a wide variety of platforms the specific sources as it runs on all the different Linux distributions and most unique variants so it can read from Devlog Journal Sunstreams and so on It's a central logging solution It can receive messages from the network using both the original legacy Cislog protocol and the new RFC 54-24 Cislog protocols, but not only these you can use any Data format you like to send messages to Cisco Genji as long as you can as long as you can separate these messages like a new line or Solution and what is also very important that Cislog and you can collect not just system messages, but any kind of application messages it can read five log files from other applications or Collect data through sockets pipes or You can also collect the output of an application if it started by Cislog and G the next step is processing all of this data and It's very important that it's with Cislog and G You can process your data quite close to the source. So if you have many machines Processing of data is distributed To can be distributed to all of these machines Making life of your center infrastructure So you can classify normalize and structure log messages using built-in parsers in Cislog and G you can also rewrite messages and You don't have to think about falsifying messages, but for example It's often required to anonymize log messages messages can also be Reformatted using templates Seem solutions or Analyzers often need log messages in a specific format like a specific date format or having the whole message in JSON and It can be done by Cislog and G and Data can also be enriched like for example if you have an IP address using GUIP The location data can be added to log messages or Based on the message content as we will be we will see with pattern DB Additional fields can be added Based on message content The next step is data filtering and it has two major uses First of all filtering Makes it possible to throw away Surplus log messages like you don't want to store debug messages Unless if it's really necessary for debugging purposes for the rest You discard it and it's also used for message message routing If you have a scene system you only need security related events rooted to the Seam and the rest can be stored locally on the Cislog and G server filtering has many possibilities Based on the message parameters or based on the message content Thanks to parsing and There are many ways to find the right Messages using comparisons white cards regular expressions or different functions in Cislog and G and The best thing is that all of these can be combined with Boolean operators so the possibilities are practically endless in finding Filtering and filtering messages Finally a few words about which big data destinations are in Cislog and G right now We can support Hadoop some No, no switchware databases like MongoDB or elastic search Actually elastic search is our most popular Destination right now and the other most the second most popular Big data destination right now is Kafka Next I would like to talk a few words about log messages I already shown this Typical log message from a Linux machine and if you look closer you will see that it's a Date a hostname and some text the text part is Usually a complete English sentence with some variable parts in it as you can see it above It's very easy to be read by a human, but once you have not just your workstation Workstation log messages, but log messages from hundreds of machines Or just one single busy server. It's very difficult to find anything Or to create a report or do any further processing with your log messages People often feel lost when They first look at the amount of data it looks they can call up There is a solution for this problem. It's a structured logging Where events are not represented in with free free size text messages, but Name value pairs Coming back to my favorite SSH example You can create name value pairs for The application name the user name and so on and describe the same event with name value pairs instead of Free text and this is much more easy to search once stored into a database The good news is that Cytogengy had name value pairs Insight from the beginning It was necessary for being able to for for if it was necessary for flexible filtering for They so date facility priority program name and so on all over all were stored as Name value pairs inside Cytogengy and could be used for filtering. It was just one step further To add part of the Cytogengy and this way any unstructured data and some of the Structured data formats can also be turned into name value pairs and Used for filtering message routing There is a JSON parser in Cytogengy as this logging format is becoming quite popular recently and It can turn JSON Messages into name value pairs. So any data stored in these messages can be used Into in filtering or you can use just part of the field to store part of the field or create higher thing based on field value and so on The next one is CSC parser the CSC is comma separated values, but It it was the first type of a columnar data which was implemented into Cytogengy. That's how The name was born, but any kind of columnar data can be Processed with the CSC parser in Cytogengy. The most popular Is part in Apache access log messages and you can see in this configuration snippet All the fields of an access log are described here with Names column names and At the bottom of the screen you can see that the user name value passed from the Access log messages is used for naming the five Five destinations where messages are stored the most in most interesting parts are is pattern BB in Cytogengy. It's a Message parser which can extract useful information from unstructured messages into name value pairs and it can Not only extract values, but It has to know the message for be able to part it It can add also status fields based on the message content and It can also classify log messages just like log checks can do typical Debian system To get it working one needs xml files describing the log messages Some of these are on github ready to be used this way an example and Coming back to the SSH login it's login failure the user name and the source site address are actual fields expected from the log message and That it's a the state this is failure and the action is login It's These are fields which are based on the message content and it can also be and that it's a failure It can also be classified as a violation in the upcoming version of Cytogengy there will be some more some additional Parting possibilities one is for Parting name value pairs out from log message log messages and another one is for parsing the audit log message log format into name value pairs and That way you will be able to create alerts Reading your audit logs for example Anonymizing messages is becoming a hot topic recently There are many regulations and compliance requirements Which say which declare that what can be logged and what must not Be in in the in log messages for example in PCI DSS credit card numbers are Not allowed to be logged or in Europe. There are many different privacy regulations And often IP addresses or user names Are not allowed to be logged in Locating sensitive information Can be done in multiple ways One is using regular expressions when there is a For example credit card numbers or IP addresses can be located using this techniques which Any kind of log messages on the other hand Not just known messages on the other hand. It's quite slow. It's not an efficient way to find information One can also use button DB for locating sensitive data Which is very fast on the other hand you need Description for all of your log messages at least Where sensitive data can arise? There are multiple ways how you can anonymize anonymize your log messages The simple way is to overwrite sensitive information with a constant which is Simple fast, but if you need to Analyze your log messages and follow sessions in your logs Then it's better to use hashing so the original data is always written by a hash Which so you don't see the user name or IP address or any sensitive data But you can follow as the hash will be the same if the same data of course in your logs again This organ G is originally implemented in C C makes it possible to be a high performance Application So it can process many more logs than any Logging solutions written in interpreted languages on the order on the other hand not everything is implemented in C and Also rapid prototyping is much more easy to be Done in interpreted languages. So Last year last year we started to implement language bindings into CIS to energy for no Destination drivers can be written in Non-C languages the core of CIS to energy supports now Python and Java and Lua and bird are in the system engine cubator. This is this organ g incubator is a Sibling project of this organ g. So if someone Right and module for this organ g the first step is to Include it in the incubator incubator where experimental modules are available and Once it's mature matured it can be moved to To the system energy core With with the language bindings Impact the interface that is embedded into this organ g This has some speeded speed advantages and it also makes possible proper error handling It's also possible to use as external applications On the destination side, but in that case there is no feedback towards CIS organ g if anything went wrong if The interpreter isn't embedded in that case error or proper error handling is possible As you might be aware most of the big data applications are written in Java See or Python clients usually exist, but not all the all the time But even if they exist Java is the official client for Big data solutions and this is what is maintained together with the server component so this is Why we decided to develop Big data destinations in of CIS organ g in Java to use these destinations it takes a bit more effort than And then usual us Our Java base destinations cannot be yet Included into different solutions Most of the jobs used by us and also the build tool cradle is not yet in the distributions, but Hopefully In the coming month This will be also fixed If you would like to try this My blog goes into detail how different Linux solutions Work from this point of view If you want to configure this one to use CIS organ g you need to configure it and My first advice is don't panic CIS organ g configuration Looks quite a bit scary at first sight But if you take just a few minutes You will learn that it's not that scary and it and it's actually simple and logical It has a pipeline model Has many different building box building blocks like sources destinations filters and so on and Once you define these Blocks then you can collect them with lock statements Into a pipeline Here I will show an example configuration First of all some global options CIS organ g.com starts always with a version declaration In this case, it's 3.7 then usually Not necessarily there are some includes SCI.com stands for CIS organ g configuration library We have some configurations prepared and bundled with CIS organ g which You can use in your configuration. For example for locating credit card numbers There is a long and ugly regular expression Which you don't have to copy and paste into your configuration just use SCR and you and use it to find credit card numbers or Many similar Features are implemented in SCR Next there are some global options how Which affect all the rest of CIS organ g many of these features can be Many of these settings can be overridden in different parts of the configuration for example, if you don't if you have a low traffic server then Flashlines 0 means that each log message is written to this as soon as it arrives but if you have a Also have a SMTP server with many incoming email messages then you might want to Increase this value to a larger number to make sure that the performance is Logging performance is not affected But only for the given destination and not for the rest the next is Defining sources where you are collecting messages from the first one is for local messages system the system source is a The solution to hide hide away differences between different platforms, so if you have Linux machines with system 5 and System be you have free BSD You have solaris and so on You don't have to keep track the system specific Locks work is but use the same configuration configuration on all of the machines as system will find the right Locks work on all of the machines you have internal is for CIS organ g's internal messages In most cases it's it's okay to collect these together with system locks But in some occasions, it's better to lock them separately Next you can see a network source it's in this case it's udp listening on all IP addresses of the host on port 545 14 The next step is to configure some destinations At the top of the screen you can see a five destination in this case warlock messages The other one is more related to big data. It's an elastic search destination Where you can set the index name Cluster name and the template how you send your messages in this case it's jason Jason template and mmm the legacy Disco format is filled from the legacy system format our format for forwarded The next step is to configure some filters and parts out the first one is a filter which discards Debug messages and let's do the rest the second one is typical for warlock messages and you can see here That many different filtering possibilities are used Filtering out debug messages Net allowing milling related messages and so on and all of these are combined with Boolean operators at the bottom of the screen you will see how a part that is defined just add button BB and the XML file which we use for describing your log messages and Here comes the most important part of the Configuration, it's the log path Where you connect all of the building blocks together the first one is a typical line for warlock messages So it reads the system as it is applies the filter for which I showed on a previous screen and Stores it to a file the next one is a bit more interesting it's the destiny is the log path for elastic search and You will see that we utilize both the local lock source and the network lock source filter out only the debug messages use the pattern BB parts are on the remaining messages and then store Everything into elastic search and here you can see a screen from Kibana and Oh You cannot really read it, but you can see that all of the data rise It's worth to do the configuration of the data rise safely into elastic search Here are some graphs for the decision of Priority and facility and in the upper right corner. You should be able to see Some results Coming from pattern DB. It's a top list of source IP addresses parts out from SSH login messages using pattern DB at the bottom you can see the dissolution of log messages over time and Upcoming and very interesting technologies Kafka it's published subscribe messaging and It's becoming more and more important in data design Organizations it's extra like a data backbone Sister Genji can send Messages to Kafka and we are also working to implement Kafka source into sister Genji so it So that way we will be able to also collect messages from Kafka and not just send to Kafka Finally, I I would like to summarize the benefits using physical energy in a big data environment first of all, it's high performance and reliable log collection it can also greatly simplify your data architecture a simple single application can be used both for System log system and application logs or just about any application that can be forwarded using sister Genji and It can also significantly lower that load on the on the destination on the processing side as Sister Genji can process log messages close to the source in a distributed way and It can parts messages Forward only the important information and also format These messages to be ready for processing and As everything can be done on the sister Genji side If you would like to join our community use sister Genji have correct more information the main anti-point is sister Genji org The source code of sister Genji is on github and And If you have any problems you can report report it on github or we also have mainly English and We are also on I on the sister Genji channel IRC on free node If there is any university students among you we have open to any position and also We are creating the sister Genji universe. It's Something brand new More information is on the website so you can get Small exercises which you can use in your programming assignments and we give you we can give you all the help to implement something and We also can merge it into sister Genji. So it's good for us as we get some new code and it would be good for you as you Learn something how to do and you can also point your features that You created something Do you have any questions? sister Genji is In practically all Linux distributions, I know Yes, but not not always the latest version. So it's in Fedora. It's in Suzer it's in Debian you want to gen2 art are clinics. So practically all of the major Linux is Also available for that attempt at life It's in 3DSD open BSD There are packages for solaris It's always only the question of time How updated these packages are so if there is a release Right before we have a release then they will carry and all the older sister Genji For a while, but often there is a there is an external repository Which carries the latest sister Genji for the given Linux distribution It's the default in some and mostly it's optional one Yes, it works in both sides It's it's yes. It's absolutely the same. You only have to change the configuration High availability is not built-in, but it can work together with Load balancer is load balancer. It can work together with Any high availability solutions Yeah, sure. If you have any further questions Not now, but later you can Buy email and I also have a blog where I regularly post Information updates about sister Genji interesting use cases and so on and I Don't know if it's carrying you away or What is the effect here is a sample XML file It's the one I used for parsing SSH messages At least part of it where at first it looks ugly as usual but Here you can see the actual pattern used for parsing the message you see that it's the actual look message and where the variable parts are we have some parts are inspected into the machine So you can see the field names the parts are named the field and the field names Which are used to define The message and here is an example message so you can Verify immediately if the parts are you created is all right and At the bottom You can see that based on the message Message text some additional fields are created Like that it's a login event and that it is accepted So it was a successful login login Actually, there is an application. It's called Elsa. It's enterprise log search and archive Which does exactly this but it has many built-in parsers for different idea systems like snort bro, and so on and Parts are for IP tables Cisco and other juniper and other firewalls So and it can also store all of this into my sequel and index the messages So you can easily search any IP address in the database What happens on your network? It has some built-in tools for looking up information can call who is on the IP addresses or do many similar kind of magic so it used often in The security part of network operation centers Sorry As a enterprise log search and archive With raw data You don't really know Except oh, you can know if if you use to IP in that case you can create additional fields from the IP address Either at country level or at city level And at this information to your log messages and store it together with your log messages. So I Think this is the same or what is used in as well lock search and archive and archive Yeah No Archive sorry, I'm not native English Yeah, sure that that was the original purpose of CISO Genji Order if it's called next generation, but actually it's 18 years old We can remove but only if the same message appears right after the other By the way, it would require to be You wouldn't when WN packages are and also my open to the packages are in the build service Open to the build service If you don't need an email, I can Provide you with your or I think it's also on the web CISO Genji.org website links to all of the different Package sources any other questions Yes, I already sent my slide to scale. So it should be on the website soon. Thank you for your attention