 Okay, so, who am I, right, kind of a new face to Ruby? I'm Dana, Jess Dana, I don't get a last name, so y'all better remember that. Before I came to Ruby, I spent eight years in the corporate world, and I was responsible for lunging an incredibly insane amount of ridiculous data. I did that every day for eight years and decided I'd had enough of that. So I left the corporate world to come to Ruby, where I get to munch a ridiculous amount of insane data every day. So, I developed Rails applications. Some of you people might know my wacky husband, James Gray, name known as Jig2, you know, there's some guy. So, why is data munching important? I mean, it's kind of a boring task and it's one of those insidious things that you constantly have to do over and over again. And the fact is we live in a very data-driven society. Companies feed on reports. They just love data. And if you don't give them enough data, they start going into some kind of weird data shock and they have to come back for more. Clients have data that they wanna get at. They need it put in databases or they need it put into some other kind of structure. So, what basically happens is it's really, really important to know what data you have, what you need to do with it, and how you need to get it there. So there's kind of three parts to that. The more you know about your information, the more about your final output, what you need to do with it. The easier you can manipulate that data, make your life a lot easier. So I'm gonna kind of discuss the process that I kind of live by after having spent years working with data. I use the rule of three in, munch, out. You read data into some construct. And for me, I kind of follow the general principle that if it can understand each, you're good. Then I transform the data. I munch it. I mix it up. I change it around. I do whatever I need to do with it. And then I spit it out. I output that transform data using some format that understand puts. And that's, again, that's just kind of my rule of thumb for managing and handling data. So we're gonna start with rule number one, reading, which really is part of the hardest concept to kind of pull together. So here's a real basic, very simple munging script. It has all three parts in it. You've got the output file. You've got your input file. And you've got your transformation. So you've got one, two, and three. And this little script just opens a file to write to. Opens a file that it's gonna read from. It capitalizes the information and then it sticks it back out, okay? Real simple, real basic stuff. But it kind of sucks. What happens when your client or your boss comes back to you and says, oh yeah, we got this other report over here. That doesn't look a single thing like the one you just screwed around with. But we need it put into this format. And oh by the way, the format sucks. We need something else. Well shit, you just wrote this program. Now it doesn't work. So you have to write another one, okay? So we're gonna simplify. By don't, the first rule is don't confuse reading with munging. They're not the same thing. They should be separate. Because reading information means you may or may not be reading that information from the same kind of file. You may have various files that you have to pull all this data from different places into one place, transform around, and then spit it into one file or two files or database or whatever your output construct might be. So if you separate reading and munging, then suddenly you have the ability to handle those two different ideas separately. So here, I've simplified it. I'm gonna make a method called mung, which is my actual transformation. And I'm gonna pass in my input and my output. So here I pass in some object. I don't know what it is yet. And I'm gonna pass out some object, doesn't matter what that is either, because I can define those later. And then I'm gonna mung it. And because I control that munging process, and I know what the input is gonna be, and I know how the output's gonna work because I'm gonna control those separately, I can add each inputs to my transformation. And that's, this is better. I think it becomes pretty obvious. I've got exactly the same thing going on here, but I've had two ways of passing the information and two kinds of data, two kinds of outputs, that to the same thing. Both of these capitalize the file that was passed in, and it sticks it out in two different ways. The first one prints it out to a screen, the second one writes to a file. I didn't have to rewrite my munger though, see? So all of a sudden you've gained a lot of flexibility. Okay, so and because you control your system here, each inputs doesn't have to be a common writing to a file. See, I didn't write that to a file, and if that's not interesting enough. Okay, so you have control over your input and your output. You can write each, you can write's puts. So let's take it a little farther and reach our ultimate munging power, if you were. I'm gonna make a class called munger, and I'm going to pass into that class input and output, and I can define those later. And then in my mung method over on the left, I'm going to yield to this method what I wanna do. I check to make sure that it's not nil, and then I shouldn't ever have to come back into this mung method again, because I'm yielding my block to this method. So you can kind of see on the right how it is in action. I make a new munger object. I pass in the file that I want to read from, and I pass in the output file I wanna write to, and then I call mung on that object, and in this case I play with the data that's in that file, and I spit it back out. So let's talk about data for a minute. There are so many different kinds of data. I could spend an hour up here talking about just the data structures that you might run into, so I don't want to put you completely to sleep before dinner. So we're gonna kind of realize that there's about two massive types. They're structured, which is record-oriented data, and in our modern society that's really the most common that you see record-oriented data, stuff from databases and Excel and that kind of stuff. And then there's unstructured, which you do find, and it's also the most difficult to work with because some of the things you expect to be true may not be true. And when you talk about data and pattern-matching and munging, data is about pattern-matching. 90% of all munging happens in pattern-matching, which is why I love data munging, because I get to play with regular expressions. Okay, so basically what we're gonna do from here on out, this is a real-world example. This is a very small subset of a 55,000 line report that I would have to play with every day in my old job, okay? Names have been protected for the week, but this is basically the data that I got to play with, right? So let's kind of look at this for just a second. What's going on in this file? Well, first of all, it looks pretty structured, right? You've got columns and headers, except you've got two headers that repeat every page. All right, so we're gonna have to deal with those. Not a big deal, right? It's still, you've got these pretty little spaces here, so we can kind of see where the columns fall until you realize that they don't exactly fall straight on. You have these nasty hierarchical categories that you're gonna have to mess with. Okay, so that's not gonna be very fun, but we're not panicking yet, we can deal. And then, well, okay, so what's happening with these little lines? These are totally for humans. We don't care, computers aren't gonna care. And there's the subtotals that are right above those lines. I'm starting to get nervous. And then if we want to add insult to injury, this report doesn't read thousands in the columns or too big, so it sticks a letter. Well, now I can't total it, because it's not an integer. Ah, okay, so this is ugly data, right? If I told you I used to have to type this stuff into Excel because our IT department wouldn't give me an FTP program, would you guys like run the screaming from the room? Yeah, yes, yeah. I did. For the first year, I would type this stuff into Excel and I finally told them I said, you must, must get me an FTP program so that I can at least play with it in Excel. So. All right, so I have my munger. I've already written that. I don't have to worry about it. So I'm gonna meet my reader. And on this report, that's a real chore, right? But it's not impossible. The trick is to break it down and think, you know, what is it in this report that I really wanna get to and what is in this report that I don't care about? And the first thing I don't care about is anything that has a report total, anything that has a subtotal, any line that's blank, it doesn't have any data on it at all, and any report that starts with a dash, right? I mean, this is useless information to us. We can't use it. Let's get rid of it, okay? So that's what this does. And if anybody needs me to go through those regular expressions, let me know. I certainly can. Okay? So, all right, all of a sudden we have a much easier, less obnoxious piece of data to look at. We've kinda gotten rid of a few things. But the headers are still there. We still have these nasty headers that we have to deal with. And part of the problem with these headers is not only that they repeat, but that you have to have some of this data. You have to know what your headers are, and you have to get this categorical information out of there as well. And those aren't in the headers. So you've got two things that have to happen, but that means that we have two processes. So I'm gonna back up. We're gonna just do one thing. I wanna get rid of these headers, but I need to collect that information first. So I'm gonna introduce you, if you don't know about this little method already. This is unpack. It's a great little method. It's designed for breaking up binary code, binary data, but it's really awesome little method for handling this kind of data. Because most of it's fixed with, it's just those categories we have to deal with. We're not dealing with those categories yet. So it works for this particular thing. Basically, unpack takes in a format string, and you tell unpack in your format string what you want it to do, what does your data look like? And before I go on with this, unpack has tons of other things that you can do with it. I'm just giving you the couple little pieces that you need to know to make this format string work to break this data down. So in this thing, A means ASCII. X means skip, and then you use a number right after one of those two letters to tell unpack, count this many characters, or skip this many characters, and then it gives you your result. So here I'm describing this little string, cookies and cream, and I said there's seven ASCII characters followed by a skip space, followed by three ASCII characters, followed by skip space, followed by five ASCII characters, okay? That's not too hard, right? This doesn't make anybody go, ah, like they do when Reg X comes up, okay? And it returns an array, and here are your headers, right? So unpack is a great little tool for that. And then in this next example, I'm doing it the other direction. I'm passing in a string that has the information. I'm gonna split it apart. I'm gonna count out the length of each piece because they're variable, and I don't necessarily know what they are, and it would be ridiculous for me to have to go back to the paper and go one, two, three, four, right? So I let Ruby do that for me, and then I join them on X, so it skips right over the spaces, and then I get my result. So I let Ruby build up my format string, and what, I have great data, those dashes, remember the headers on my report? Those are static, they don't ever change, so I can count those dashes in the header to figure out how big those headers are, and then anything inside that space, I can just capture. Okay, so to my reader class here that I'm building, I add a couple variables, headers and format, and then I add a parse header method. Oh, oh, go back, go back. I add a parse header method that takes in the format, kind of exactly what I just described on the other thing. And I use these magic numbers, wait, let me back up. In my parse header call, I pass in the first four lines of my report, and I do that as the very first thing, and I know those first four lines don't ever change in the report, I know every page starts with those four, or every report starts with those four lines, so I just pass those four lines in, and then because they're always in the same order and I can count on that, I just use the indexing ability of arrays to get the line I want, and so I pass in the format that I'm going to use to pass to headers which uses unpack to get the actual data, okay? Does that make sense, everybody kind of following that? So now I have my pattern, and I can pull my headers into an array that I can then use later, okay? But that doesn't help me get rid of the headers themselves. I've gotten rid of the first four, but in 25 lines I'm gonna get them back, so I don't have to recall that parse headers and blah, blah, blah. So I wanna set a tag that says whether I'm in a header or not, and since the data's very static, I know that that first line of a page is my first line of my header, and it starts with that S-A-A-R, and I can count on that to tell me I've started the headers, and then if it's set, I set it to true, if that's where I'm at, if it's already true on the next line, then I know I'm still in the header, and it will continue to stay true until I get to a line that starts with all dashes, that starts with the dash, and then I know that the next line I read in is not going to be a header, it's gonna be actual data. So I can kick out of that header, I reset in header to false, and now I'm into my data again, and then I can go through and do the rest of my report, which basically gives me this. Oh gosh, I just got rid of 90% of the crap that was in there, and I have something that I can actually start to work with. But that's only one part of the two-part header problem, right? I still have these categories that I have to deal with, so I'm gonna use another fun little method called a Soche. It's a lookup method, you call it on an array of arrays, you pass in the data you wanna lookup, it walks through the outer array, it returns the inner array, and something to kind of be aware of, it's much slower than a hash, so don't run this on 10,000 records, it will freak out. So essentially what Soche gives you, because I have to keep the order, the order's very important, so I don't wanna just use an array, although that's maybe a little different in 1.9, since 1.9 now has insertion order on hashes. Soche becomes a poor man's ordered hash, okay? So I can use that. I have this array of names, I pass in James and it passes out the array I'm looking, or the small array I'm looking for, something to note it will only, it will continue to look through the list until it finds the first thing that matches and then it stops. So if I had another James in there, it wouldn't find it, it would stop after the first one, okay? So I add another variable categories that I'm gonna pass some information into. It goes through, the system goes through, it identifies if categories has a particular item in it. If it doesn't, it assigns it. If it finds it, it reassigns the value because it means I've moved into a new category and now I have my categories, okay? And so this is the construct that's passed back out. Our report goes from being something a human can read to something maybe we don't read as well, but a computer can shoot through this in a heartbeat. And now I have this very simple, very common structure that I can then pass into my munch system, my munch method, and I can count on it. Kinda make sense, make sort of following along there, okay? And the reason I have these lines here is you can see at the bottom how the sort code changed. There's the SA sort code 4442 and then the 19-1 waffles. So it showed that it's changed categories even though the customer and the salesperson are still the same. Here's kind of the, since I broke it up to make it fit all on slides, this is what the final reader looks like. And this is pretty complicated data. A lot of readers aren't gonna have to be this complicated. You're not gonna have higher article data inside of fixed width. You're not gonna have repeating headers and all that kind of stuff. You might, you might not, but this is what the ultimate reader looks like and is by far the most complicated part of the entire process. Okay, so I'm gonna skip munging because I've already kinda dealt with building the munging thing. And I'm gonna go right into writing. And I use faster CSV. Somebody I know wrote faster CSV, so it's kinda cool to use that. The only thing I really have to deal with in the writer, seriously deal with, are the headers. So I get my headers. If they haven't been assigned, if the headers are still nil the first time it reads in, which means it hasn't found any yet, it pulls the headers in and prints those out as the first line of data. Then because headers are skipped or headers are no longer nil, they're true, it skips to the next section and prints the data. And that leads us to munging. Did anybody have any questions on the writing? I mean, it's really that simple. And here's my munger. This is where I actually am going to play with the code that I've gotten, the data structure itself. And I pass in the reader. I pass in the writer to my mung class. I run through it. I check to see if I have the last part of the array because you've got your header and then you've got your data. Remember that array, arrays inside arrays? I make sure that I'm in the last one. If it has a comma, I get rid of it because there were commas in that file. And then I go through and I get rid of k's and transform those into a number that I've multiplied by a thousand. Basically, I call two i's since nothing happens on that k. Nothing happens to it. It just becomes a number. It's multiplied by a thousand. And now I actually have numbers and you can do a hundred other things in here that you wanted to do, okay? So there's my munger. So we went from having to rewrite this system that was doing everything in one place to having all three parts kind of broken out. And why that's cool is these are all the things I can do with it. I mean, these are just the things I thought of off the top of my head. I mean, God knows. There's a hundred other things you could do with it. I can now put that data into Excel or numbers or whatever spreadsheet you want. You can put it in a text editor and do something with that. You can import it into a database. I mean, there's just tons of things that you can do. So let's see kind of what just, whoops, go back, what that did. So in the first thing that it can do, I just fit it to standard out and it reads it across the screen. Like I said, the next one, I'm gonna put it into a CSV file. And I'm gonna open that CSV file. Kind of cool, right? Bosses like this kind of stuff. They get excited when they see reports like this. If you have some weird need to see it in TextMate or whatever, you can pipe it to your text editor. I don't know why you would do that, but you can. Or the most common use I expect a lot of us in here would use it for is dropping it into a database, right? That should be recognizable to anybody who programs in Rails. Get a SQL light thing and open up. Let's run a little query real quick through that I actually did it. So, I mean, just running your data through those couple little things that I did, you can really kind of tell, I hope, how much easier and how much more powerful your ability to control your data and maneuver it and do things with it becomes. And for those of you who are interested, this is my database writer. When I dropped it into that SQL light database, I basically open a connection or establish a connection to my database. I create a table. If it doesn't exist, I give you the ability to migrate it and that's really about it. No, no, no, go back. I have no slide. So then the actual munging code. I don't even have to make a reader for this particular data because I know that it's in a series of arrays. I just pass it to faster CSV and then my writer is my database writer and then I create the part codes because I need to create that table, if I need to create that table. And then I munch through it. I pass in the reader, I pass in the writer and because CSV has a two hash method on it, I just call two hash on it and I'm able to drop it directly into active record. So, congratulations, you two are now mungers. That's it. Unpack because I can use it to describe whatever pattern I have and I can control that better, I think. So that's why I use Unpack. Yeah, I mean, it's Ruby. You can do that a thousand different ways. The else. In my munger, I just found those k's. I just used your regular expression and looked for those k's and replaced the k with thousand, multiplied by thousand. Yeah, do you guys want me to go back to that? Yeah, there it is. Yeah, now here in the last part, it starts at the beginning of the line. It looks for a k at the end of that field and then if it finds one, it calls two i on it and multiplies by thousand. Okay, slash a means beginning of the line and slash z means end of the line. So I'm telling it, I want to make sure that I'm starting my line, in this case with a digit, so that I know that I have a part code. They are different, yes. One means start at the beginning of the line and one means start at the beginning of the string. Not the same thing. They can, yes. But it's very important to know that they aren't the same thing because especially with the slash z and slash big z, what's the other one, is it question mark? Dollar sign. Because they all do something a little bit different. Yeah, so if you use the wrong ending, then you can definitely get some pretty wacky data. Yes, yes, that's very true. I had probably one of the worst corporate firewalls you could imagine. Every time they, my IT department company I used to work with, true story, find out I had a text editor on my computer, they would take it off. Okay, hello, notepad. Yeah, it was crazy. I mean the IT department that I had to work with was insane, so. I'm not kidding about the You're Getting an FTP program out of them. You know this wacky guy named Jag too, maybe you guys have heard of him? I would call him periodically and go, you must, you must help me get past these people. I need something to do this. The real report, if you'd like to see it, is really pretty something amazing to see. Cancel, hang on. This is an honest, this is the real report. It's 58,000 lines. I typed that by hand on many, many occasions. So you can kinda see, I mean this is just nasty data. What I did, I started out typing it in and that was kind of a good learning experience. You get to know the information that you have to deal with, but when you're dealing with this absolutely every day, begin to realize that this blows. I don't wanna do this anymore. So the first thing I did was I got them to give me an FTP program, but as you can tell, that's not gonna help a whole lot in Excel either, because it's not delimited and it's not fixed with. It's neither, it's both, it's in between. And so I was able to get it into Excel and then I'd spend my day cleaning it instead of typing it, which isn't a whole lot better and maybe even more tedious, but less prone to typing errors. And so I did that for about, I don't know, what, three weeks and I said, James, this sucks, help me figure out a better way to do this. And so he actually is the reason I program now. He got me started and taught me how to write in Perl. I know, weird, but. And I wrote some Perl scripts that started to munge this data and play with it and what happened was someone came into my office one day and I had a little program called Extras and all it did was turn this into a CSV file that I could then open in Excel. And I was outputting reports like crazy. I mean, people in the office were like, okay, your boss comes to you and you have that report done before lunch and I'm still there trying to get it printed out so I can type it in. What are you doing? And so I gave him Extras. I would just put it on their computer. Never told my IT department. I knew that was fruitless. I just installed it. I know I could have burned an effigy for that, but whatever. And I put it on their computer. Well, pretty soon there were 10 AAs in there, 10 administrative assistants in there that were all getting really good work done. And so then we're all just sitting around chatting in the office, you know, because we're done. It's before lunch, we have our entire work done and they're like, okay, so why are these people all sitting around talking? What's going on? And management loved it. So then when IT came back to me and said, you gotta take that off of everybody's computer. And we said, okay, no problem, we'll take it off. And then of course our productivity went through the floor and management went, what the hell happened? Well, I said, well, IT wouldn't let us do this anymore. And they went to IT and said, no, no, you let them do this, this is good. Okay, so that's how I beat that corporate thing. I just did it. And they realized what a great productivity boost it was. And then they had to get it back. Well, of course that did make me very popular with IT. And so we kind of had that love hate. I love to hate them relationship. For the next, I don't know, five or six years until I'd finally had enough and said, James, I gotta get where there are real programmers. And I came here. You know? A couple. Now it's a great community. It's a lot of fun. I get to do a lot of really groovy things. It's really kind of an inspiration and a relief to go into work every day and have your client call you and say, can you do this? And you're like, yeah, I can do that. No one's gonna stop me. Yeah, that's pretty groovy. So, all right, well thanks, guys.