 All right, let's get started. So who am I? I'm a maintainer of Fluendee. You can also say I'm a community manager or a mailing list person. The preceding talk about managing open source really resonated with me. And I work for this company called Treasure Data. I won't talk much about it, but if you guys were to take one thing away about Treasure Data, they're cool enough to have me focus entirely on open source, so cool folks. And I'm a former quant at the hedge fund, and I think the term to draw as I was a data scientist in finance for three years. And last but not least, I'm fairly new to Ruby, so Tom's talk was, well, I mean the whole monad thing aside, it was like quite a bit of Ruby for me. And just to show, like, how much of a nuke that I am, here's a conversation me and Zach Scott, who's a maintainer of Ruby. I didn't know that Ruby had a lambda until a few weeks ago, so please be easy on me. So what is data science? Whenever I don't know the definition of a term, because I'm a mathematician, I do what all the mathematicians do, which is I go to Twitter for definitions. Data scientist or data analyst who lives in California, 137 people resonate with that feeling. But, like, we can do better, right? We can be a little more serious, so here's a pretty good one, actually, by Josh Wallace of Cloudera. Data scientist is a person who's better at statistics than any software engineer, and better at software engineering than any statistician. Pretty smart, 896 people like that. But my personal favorite, those are both really good, and I honestly didn't think, like, how much life I could get with those, so I have the third one, which is Big Data Borat. And a little, like, spoilers of sorts, which is this talk is not going to be about data science proper. I really like, I'll just talk about the statistics, and it really behooves to have, like, a solid understanding of statistics, even if you're programming, like, back-end or databases or distributed systems or, like, front-end day-to-day, because it really, like, puts numbers in context. But what I want to talk about today is actually before that, before you get to, like, do the whole number crunching and model building and hypothesis testing, which is, I want to talk about log data. So how many of you have, like, grep through logs or, like, type in, like, less stuff? All right, cool, right? Like, we all love that stuff, right? So I want to share, like, two anecdotes from my own professional life. And the first story is me, myself, and log data. So before I came to the bright side of the industry, that is, Silicon Valley and startups, I worked in the dark side as a quantitative analyst at a hedge fund. And my job looked like this. So I'm supposed to get data, and I do a whole bunch of, like, number crunching, about building, and I ruminate on the outputs, and I share my findings or lack thereof with my manager and, like, partners and other traders. But in reality, I spend, like, 80% of the time just doing that, which is, I don't know if the fund is too small, but acquiring data and reformatting, cleaning it, and rinsing it, repeat. So I like flowcharts, so here's a flowchart. Acquire data, and this is, like, 9 a.m. And you realize there's an encoding issue. And I would understand, like, if I'm, like, doing, like, sentiment analysis on, like, tweets, right? No, I'm looking at stock market data, like, this shit should be in, like, ASCII. I have no idea what I was doing, but, like, it's fine, it's fine, fine. I'm gonna write a script, and because I was a quant, I would use Python and not Ruby. Then I realize there's, like, some deeper level issue, like, the stock market was crossed, meaning the price that you can buy is actually, like, above the price at which some, like, you can basically buy and sell and make a profit for half, like, half of the day. And that makes no sense. Like, you realize that there's some issue with data. So you're going to the database and you, like, sort of work with the engineer, then you get very, very frustrated, and you say, you know, like, one time, it's like, screw this, like, I'm gonna just try to do this on my own. So I talked to Kurt, who's a really good engineer, like, Kurt, just give me access to all the logs. I'm gonna do it. He's like, are you sure, kid? Are you gonna do that? He's like, yeah, I'm gonna do it, and, like, holy shit, Kurt has a very, very interesting job. And that actually really made me think about why it's so hard to actually make data useful. And part of it is it is not entirely clear when you start out, like, what's useful and what's not, but humans have this tendency to just, like, shove everything into something, right? And this was shoved into something, but it's impossible for, like, a business person like myself, my former self, to directly work with this. So I would work with other people, and other people are not exactly the people who get business value from it, and, like, you sort of, like, pass around the hot potato, and nothing gets done. So the second story is from my time spent at Treasuredata and Fluendi, working with people who really want to use log data to do stuff. So they're like, yeah, we're all about logs and metrics and that. And I call this M&M, not the chocolate, which I got earlier here, but metrics and monitoring. Everyone loves metrics. And everyone also loves monitoring, because if you're monitoring, you can see that all the trouble is coming in your direction. So it's not a bad idea to subscribe to this mantra that, you know, log everything, right? Log this data, log that data, like, I'm spinning up a new service. Oh, like, is logging everywhere in that? Like, are you sure, like, we have all the information later? The problem is if it's like three person startup, sure, it works. But usually it's like this. There are different people in the organization. And they have different needs for log data. Like, VP is probably one of the top, you know, top line numbers. Operations folks want to know, for example, like HAPoxy, concurrent number of connections. And you would ask about HAPoxy to product management, and there would be, like, HAPot, right? And what happens is, like, everyone's like, you know what? I'm going to start collecting my own logs, just like my former self, right? And then Kurt, get out of my way. I'm going to try to do it. And when everyone starts doing that in the team, what you get is something like this. Like, it sort of works. Like, day to day, like, as long as nobody touches it, it works. Because, like, that one Perl script, and it's all as Perl. Like, the 200 lines of Perl that Dan wrote six years ago that nobody else touched works. As long as you bug Dan, it's like, hey, the format changed. Like, can you change that regular expression? It's like, yeah, no problem. I'm going to go to line 120 and change that rejects. And it would all work. But basically what I'm here to convince you is that there are ways to do it better. And with some, like, moduseness, I would say that Fluendi is one such solution. And I really mean that Fluendi is one such solution. There are so many other ways to do it. There's different paradigms. And I could talk about that, but I want to do a demo later. So if you want to talk about different ways to centralize logging and, like, build a unified logging layer, come find me at a party or tomorrow. So I'll just talk about Fluendi here. So how does it work? One thing about, like, one misconception about mathematicians is that we love the abstract. Like, we love monads. But the thing about it is mathematicians are trained to anchor discussions to, like, concrete examples. Just like Tom did this morning. That was very easy to understand. I finally have an educated guess of what monad is. Yeah. So I'm going to start with a very simple example, which is an Apache log. Because everyone's at least seen it, right? And it looks kind of like this. And whoever came up with this thought it was a brilliant idea because it's very clear that 200 is a status code and not just, like, a random number. Like, seriously, right? Like, I think for the longest time, I didn't even know what different fields were, other than the last thing, which is, like, a user agent. But so what Flindy does is it processes, like, semi-structured or ill-structured messages like this and parsed it into a JSON. And of course, you have to supply the logic as to how you do it. It's usually based on regex, but you can also write your own parser in Ruby. So once you do that, all the data sources are parsed into JSON. But then the question is, OK, so you're getting the data and you're getting an Apache log data and you don't know where you want to put it or like you might want to put it into two systems or all you want to do is just put it into the centralized log server. But, like, Flindy has to know, like, how to put the data into the right destination. And if I get abstracted, whether it's Flindy or not, the system needs to have a very intelligent and scalable way of knowing where the data originates and which data should go to which system. So how Flindy does this is it's based on two things. One is timestamp, but more importantly, it's based on this idea of tagging each event. So each event is actually a triple as a timestamp. And John talked a lot about timestamp. I used to live in Illinois and I didn't know Indiana had, like, two bajillion time zones. But then the second is more important, which is the tag, which is assigned by the input source. Then there's the bulk of the data. In this case, it's the method, path. So 200 is the status code and 140 is the size. And so to answer the question of how do you get the data into different systems, there's a way to tag it. So if you are sending data to MongoDB, you're going to probably tag it as MongoDB. If you're sending data to Hadoop, you're going to tag it at HDFS. So this is pretty much it, but I know I have, like, 15 minutes left. So what I'm going to do is a live demo. And one disclaimer is that I've never given a talk to an audience this large. And I've definitely not done a live demo to an audience this large. So we'll see. But hopefully it's interesting. At least, like, you guys can see me failed before. You guys are going to go to, like, the party. Ah-ha! That last speaker here is demo demo. Right? All right, so as some of you might know, or obviously all of you are cool enough to come to this one and not the one in Japan. But concurrently, there is Ruby Kaigi going on, which is a pretty big Ruby event. And, like, I love both events. My co-worker is actually at Ruby Kaigi. But in terms of logistics, I must say, Gokaruko has an upper hand, because I just saw this tweet this morning as I woke up. Like, they are missing the big dad, right? And I don't know if this was intentional or not. But the title of Matt's talk was Coming Soon. And, like, there's so many ways to interpret this. It's like, he was coming soon. Or, like, he wasn't sure what the title was going to be. Or, like, maybe, like, the features of Ruby that are coming soon. Or maybe add Ruby. But what I want to do is, and I need to put a disclaimer, which is I do a lot of demos for Fluendee. But every time I use, like, Nginx or Apache logs as, like, an example of data sources, like, half the people are like, oh, that's cool. Like, you guys are solving my problem. And the other half is like, it's just, like, web server logs. Like, I don't care. So I thought I would come up with data sources that everyone would care. And that is, I'm going to collect Twitter data strings with the hashtag Goga Ruko and RubyKaigi. And I'm going to use Kibana and Elasticsearch to plot it so that if we have fewer tweets, you guys can stop listening to me and start tweeting frantically so that we can, like, out-tweet them. So, with that up for the view, wish me luck. So one thing that we did recently with Fluendee was, like, it's, like, three years old. And for the first 2.5 years of its life, it had this, like, incredibly intuitive command-line interface for everything. And some people liked it. A lot of people liked it. Some people prefer something a little more visually driven. So we made one. It's called the Fluendee UI. And I started, actually, a Rails app. And when I go there, I'm supposed to sign in. There you go. Ha-ha. So it is a way to, like, launch, stop, and configure Fluendee inside a web browser. So how many of you are familiar with Splunk and Splunk Forwarder? Idea is similar, except you don't have to pay a dime for this thing. Splunk is a great piece of software, by the way. I have nothing evil to say about them. So what I'm going to do here is, like, there are a bunch of items here. And, like, let's look at installed plugins, because plugins is what enables Fluendee to listen to many sources. So there are three that I have pre-installed. One is for Elasticsearch. And one is for, it's called a record-reformer. It does, like, lightweight filtering in-stream. And another is, like, one for Twitter, so that I can listen to Twitter data stream. And if I go to a config file, there's a config file that is, like, it's not, so there's one feature. Like, I'm waiting for a pull request. Like, try to color this thing. And here's an uncolored but larger font version. So the stuff inside source is data source, so data inputs. So it says types. So there's a little bit of DSL for Fluendee. And there's my consumer key, OAuthToken. You can steal those now and, like, do evil things on Twitter on my behalf. And I'm sampling the timeline and looking for the keyword gogaruko, and I'm tagging it. And this was a concept from earlier. And, yep. So that's going to be called twitter.gogaruko. And if you go down, there's one for RubyKaigi. And now, match is how Fluendee knows which output systems to send it to. And it has a wallet card, so everything that starts with Twitter, it matches. And there's a little bit of bookkeeping. And this is purely technical, because I want to show the Twitter data stream for RubyKaigi and Gogaruko in different colors in Elasticsearch. But this is, like, technicality. Then you reroute this message into Elasticsearch, and here's, like, a little bit of parameters. So I am going in here and start this thing. Well, before I start that thing, I should stop that thing and start Elasticsearch. So I'm going to launch Elasticsearch. As always, it's very easy to start. I have no idea what the logs are saying most of the time. And I'm going to start this. Hooray, it's running. And if I go to logs, it should have logs, but the log says log is empty, so I just go ahead. And because I'm at Gogaruko, I'm going to use simple Python server to spin up Kibana. And let's see this thing's going to work. All right, so I have the default screen. So there's a little bit of, ha-ha, it's there. There are seven heads per minute. And nobody seems to be tweeting with Gogaruko right now. So hey, everyone, take out your smart, smart smartphones and start tweeting about Gogaruko. I'll be here. I still have nine minutes. I'm going to tweet myself. At Gogaruko, onstage, trying to make my demo work. Oh, it could be anything. It should be looking for just this train. Sorry for my, so my day job is a marketer. So I have a tendency to gloss over technical details. All right, cool. That looks pretty good. Oh, I need to encapsulate this thing. Hooray. At least my tweet should show up. No, why not? Gogaruko? OK, at least there's data. There is data. So this is basically my presentation. I can go on and on and wonder why Gogaruko is not showing up here. But yeah, and going back to the presentation, yet another reason I shouldn't use PowerPoint. Can you guys see it? It says it crashed. So basically what I wanted to convey is that log data is actually everyone's concerned in an organization. And this is very hard to feel, especially your team is big, and especially when you have messy but working data pipeline. But also as someone who experienced like how inefficient it can be, it really pays huge dividends to revisit how you guys log and filter and use data. And there are lots of tools out there, Fluendy's one option. And I'd love to hear how you guys are doing stuff. And I don't know why that is showing up, because mine is still caching. Lesson number what there could be. Let me see. Config file. No, I think that is right. So this is one of those things that I expected my demo to completely not work. It said my work, so that's pretty good. And my parting message obviously is do not use Microsoft PowerPoint when you're presenting on. No, no, no, no, no, let me finish my sentence. When you're presenting on the MacBook Air. Thank you very much and have a great party.