 Okay, I'm going to kind of go fairly quickly because I have a little less time than I thought I did, but I'll be putting everything up online and all the examples and everything so you can play with them later. So I'm going to talk about something kind of goofy that's not my day job, so keep that in mind. I should just warn you, this is actually rated TV, M-A-L-S-V. And it also has spoilers, but everything is text, so don't worry about the spoilers too much. I'm going to try to be very, very careful about the sex. The reason is because I'm using 50 shades of gray. So these are the book stars of today. We're going to do a couple of Dan Brown books and only a little bit of Twilight and the 50 shades books. I'll let you guess which out of these books on here I read and liked. All right, why am I doing this? I actually sort of grandiosly think it's important to study what's popular because it tells us something about people. And unfortunately, this particular topic genre bestsellers is not something I get paid to look at, but I read trashy books and I enjoy it and I'm unapologetic about that. So yes, so let's take a little bit of time to look at them. And I also wanted to do some fun statistical tricks with text data that I haven't gotten to do and build some visualization types that I haven't gotten to build. So that's why we're doing this. So first, what inspired me to look at 50 shades of gray was this info pornographic from, believe it or not, the economist. This graphic design firm called Delayed Gratification did this about six months ago. And they had some poor person sit down and chart out where the sex scenes were in all of the books and their level of kinkiness, sort of apparatus is involved in them and all of it. I mean, and mostly, you know, what's funny is that I was kind of attracted to this, like the very basic displays here and paying very much attention to this. And also I had not read the books. So I was curious that there was a little more sex in this and it decreased. And I think it I think it got kinkier, although I didn't read all this in detail. It doesn't matter. But my main question was they paid some poor person to do this by hand, like flip through the books and score where the sex was. And I was like, can't you just do this automatically? Because that's what all of us data scientists really thinks. Like, let's just do this stuff automatically. So I immediately thought of various text classification methods in machine learning. Commonly text classification is done as sort of a bag of words. It ignores structure in a document and it just looks at the words that you care about and counts up frequencies and uses those as essentially feature vectors in a classifier. And so, I mean, just to get into this topic, I thought, well, sex scenes, you know, that seems like it ought to work. There's only certain words involved in sex scenes and behaviors. It ought to be perfect for a classifier, a simple classifier. So we'll show you whether I was right. So in supervised learning of the type that I was aiming at, you take a data set and you label it as an example of a positive or negative for the thing you're looking for. So in this case, is it a sex scene or not? And then you feed that data into some algorithm, which is often described as a black box, especially why people who don't know anything about data science. And the algorithm learns from what you gave it what the properties are of a thing that is, in this case, a sex scene or isn't a sex scene. And to do this properly, you split up your input data that you've labeled into a training set and a test set and you run on the training set and then you evaluate on the test sets and you know how you did at the end of it. Like you can just count up how many did it get right. And then you tweak parameters in the black box until you get like a score that's good. And then ultimately, you use that as a predictor in the wild. So like spam classifiers are the classic example of this. That's like data science 101 as you build a spam classifier. So for the 50 shades problem, I bought 50 shades on Amazon, which no, I did not own it. I unlocked the text in Calibre, which is like open source EPUB conversion application, which is awesome actually saved it as a text file. I cut it up into 500 word chunks using Python. My machine of language learning of choice or language of choice. And then I sat down one night and I thought how hard are you going to be able to label each chunk in a big Excel spreadsheet. So I thought, well, I'll have three because there might be, I might want to have sort of a fuzzy classification at some point. So I had not sexy, maybe steamy, like, you know, little sort of romantic stuff and then actual sex scene. And you know, doing this was actually horrible. I realized that I did not like the book. I didn't really enjoy going through and manually and doing the scoring. And I already knew that I needed to do probably two of her books so that I could really test on the second book. So I went on Twitter and I complained about it and people are like outsource this. So because you guys are my friends. Anyway, this is an example of the text of 50 shades of gray. I'll just, if you're close enough, you can read some of it. But it's basically, it's not exactly my style. I read trash, too. So yeah, so I outsourced it to Mechanical Turk. This was fun. This was actually the first time I'd played with Mechanical Turk for, even though it's a classic machine learning like platform where you get people to classify things and then you build your algorithms. I hadn't actually done it on anything. So this was really fun. I'm showing you, this is not actually 50 shades text. But it's essentially the size of one of the text chunks I gave them. And then they got a category to pick. Those are my three categories. And then I had examples like under those little blue eye things that told them the kinds of things I thought were a sex scene or not. So, but like you should know, I mean they're getting a small 50 or 500 word chunk. And it's funny, even though I thought this would be a really easy problem, obviously it's a little problematic to say is this part of a sex scene or not when it's 500 words. And so I was sort of torturing them because it was already hard for me. I mean if you think about it like just visually, like some people find this sexy and some people don't. Now this, which is actually what 50 shades of gray is about, I don't find sexy and that's by the way a kit available on eBay in the UK if you're interested. So like this was one of the things about classifying what sexy is. It's highly dependent on yourself essentially. Now suppose you're reading like chunks and there's a shower scene like this. And in the middle of it, you know, actually she looks a lot more interested in that awesome shower than she does in that guy. I just have to tell you, stock photos, what are you going to do? Anyway, so it turns out in the book that there are a lot of these scenes where they start sort of like making out or whatever and then they stop and they do something else. So that's hard for a person classifying it to know. Like is that actually part of a sex scene or not? Like it gets problematic. And in particular, they spend a lot of time in the first book arguing about a contract as you might imagine. It's in fact a sexually explicit contract about things they will or won't do together that she's supposed to sign. But it's still a damn contract. It's actually like legalese text for pages. And there's email messages about this contract. And so like if you're given a 500 word chunk of this contract out of context, like you might not know like whether that's actually talking about sex or is it sex or whatever. All right, so this is what you get back from a mechanical turf crowdsourcing job. You get, you know, a bunch of labels. These are on the left hand side is my chunks and they've been randomized. People got them in random orders. I paid for two Raiders so that I could have two opinions on everything, which was the minimum, I mean, I was paying for this myself. You get, you know, the original data and then you get what answer one was, what answer two was, and a little bit of stats with like a code for who the Raiders were. So I scored those numerically and took a look at just briefly in Excel, Raider one versus Raider two. So it seems that Raider one thought the beginning of the book was a lot steamier than Raider two and the big spikes are where they agreed or didn't agree on what was actually a sex scene. Okay, but so since I grafted in this nice, stupid, simple bar plot, we can compare it to the pornographic. There's a surprisingly good match. The person who made the economist's infographic and went through this process matched pretty closely what my Raiders said, even though that person was probably reading it in order and mine were looking at a 500 word chunk out of context. So this was for me more evidence that it might be possible to actually do this automatically without context, right? And on the second book, which I did as well, so I'd have more data, it's, sorry, I didn't fix up there. This is part of the pornographic, it lines up. So here's, essentially I got only 68% accuracy using naive bays in Python's natural language toolkit, which is not so great. There are plenty of examples where you can do better with naive bays. The thing that I want to show you is it does output the words that it thinks are the good indicators. So there's a list, eases, moan, raw, beg, packet. They actually use a surprising number of condoms in these books. So correlated with sex is the word packet. So I switched to Python's scikit-learn toolkit, which is much more developed and has a lot of people working on it who are really interested in text data. So they have this one chart, which the details of it don't matter, but it's basically all the sort of classifiers you can use easily on text, it takes sparse data where you have only a few observations of each feature, that kind of thing. And they show you in this little chart how long it takes to train it, how long it takes to run, et cetera. So I picked one that just somewhat randomly, although it has good reviews, which is the stochastic gradient descent classifier, which is in the middle of this chart you can't read. And you should also notice there's a passive-aggressive classifier, which I think is the best name ever for a classifier. I want to try that one next. So in Python, in sklearn, they've made it super easy to do machine learning routines and to do lots of iterations, do a feature search, and get a ton of data very, very quickly and compare things like this. This is the only dense code slide I have. But basically, they've created this pipeline concept. And I don't know why it's cut off. In any case, all the code will be up there. But it's only a few lines to set up a pipeline and then feed it your data in the target and then get the results of how you did on a test set. It's like that few lines. And I got to 88% pretty quickly just doing that one switch. I didn't even tweak any parameters. I took out of the box parameters. So that was pretty good. Now, at this point, though, I was like, well, I think I want to go back to the data and figure out, where did it make mistakes? And so that's when the tool buildings really started. So I made a tool to look at the false positives and the false negatives against the truth to be able to see. And this is where the D3 skills really come in. It's super dumb, because it was just for me to wait. So hang on. This is where I'm afraid I'm going to really roll over something embarrassing. You see this badly formatted text down there? Because essentially, I was starting to do tokenizing on it. And I'm sort of showing you bad looking text. Future text from Dan Brown looks a little better. But here up on top, I've got the target class. All the purples are where it agreed with the judgment of the raters. And then the blues are where we missed. It's a false negative, and this is a false positive. A lot of times, the mistakes are at the edges of the actual sex scenes, which figures, right? You're out of context. You don't know, are they just making out and steamy, or is it part of a sex scene? So I wasn't too surprised, and I wasn't too disappointed. And when I do roll over some of the ones that are false positives, there are things like they're getting undressed or something. So it's pretty understandable to me why it made the mistakes it made. I wasn't that upset with these results. And anyone who knows D3 knows how stupidly simple this was to build. This was just for me to visualize the results, because I couldn't figure out. And I needed to be able to see the text next to the actual score that it got. OK, that was tool one. All right, and then I have a surprising PS that was just out of curiosity. I took this a step further to see if I could do better than 88. And I paid for a bunch of fan fiction to be coded. Now, I didn't want to spend arbitrary amounts of money on steamy romances that I wasn't going to read. So I went to friends of mine who read sexy erotic fan fiction, and I said, give me things that are long, that have a lot of sex scenes. I don't care beyond that. And got a list of six things free, threw them up on Mechanical Turk, and got ratings, fed that in as more training data. And you should remember, 50 Shades started as Twilight fan fiction. So this wasn't a stupid thing to do. The genres are actually pretty similar. Fan fiction didn't feature sex scenes, and 50 Shades come from essentially the same background. I got an astounding 97% accuracy after that. If only someone had paid me to do this. It's like, amazing. All right, so anyway, so I was right. It was easy to do the sex scene detector, essentially. 50 Shades and the sequels just happened to have this extra kink component and discussion of contracts that I didn't expect when I started this. All right, now in my proposal, I also said I would talk about story arcs. So let's talk a little about that as the next sort of section. So initially I was thinking about, if you're writing a screenplay say, you get this advice to have this three act structure and you have like, you can see on the lower right, sort of a graph of like rising tension with various crises that up the tension. Then you have some relief and you go up and then, and then at the end there's this like climax and a denouement and things cool down again. And I thought, well, you know, I bet I can detect that too. So the rest of this talk is me trying to detect that. And since it was just for his talk and not for a paying client, I'll just show you the process I went through with all the visualizations and techniques and let you decide if I got somewhere at the end. But I should tell you first. So after doing this, while I was doing a search for story arcs, I very quickly hit this awesome article, you know, the Hulk movie critic. He is awesome. Anyway, he has this rant about the myth of the three act structure and then what garbage it is. And he goes on and on and on. And this is absolutely my favorite paragraph of this. Of course, shit has beginning, middle and ending you in sufferable turn. Which it's true. Like you could divide a line anywhere about what the middle beginning and end is. It's hilarious. Yes. Okay, but back to story arcs. So at that point, I'm like, okay. So even though I think Dan Brown's really effectively trying to write for the movies in book form and I think that's arguable. I went back to like basics. So this is Vonnegut's story arcs. And now his story arcs when he's talking about fiction are essentially a sentiment thing, not about tension, but about like the Y axis is misery and ecstasy. And so these are like some arcs that he thinks exist out there. And you especially want to notice this one, which is me writing this talk essentially. I mean, and the thing to point out here is it's not just a sort of rising peak and then a fall. It's more complicated for him about what's interesting to graph graphically. And he's talking about structural differences in stories. So I think that's the gist to get away from this. And then this is his graph of real life, which I saw this after looking at the other stuff. And I go, my God. So here's a picture of a cute little bird taking a page from Dominicus, which should perk you up right here. All right, so yeah. So this is what we're really after and we're gonna see if Dan Brown has it essentially. This is the generic version. So I did the same thing because I had had so much luck with Mechanical Turk. I took Dan Brown books, two of them, and I chopped them up into 500 word chunks and I sent them to Mechanical Turk. And I just, I mean, it's a factoid, but I got ratings for 50 shades of gray in two to four hours. Took half a day to a day to get Dan Brown read by Mechanical Turk. Make of it what you will. And here what I was doing was I was asking them to use sort of exciting scenes, like exciting action scenes and revelation scenes, that kind of thing as a proxy for the action in those arcs. I thought, well, there's probably something different about those scenes. Let's try to build the action scene detector. And action scenes are actually a little difficult to this obviously action in Indiana Jones, right? Now, this is an interesting one that most of us remember from Jurassic Park. It's in the middle of a very tense action scene, but it's technically a close up of a rear view mirror. So imagine you're a reader 500 words out of context, if it's a description of the mirror, et cetera. That's not super exciting. It's just a we know in the context. And the other thing about this shot that's interesting is it's a joke. Everybody laughs when they see that, which breaks up the tension and the action and like gives you a little bit of relief. So this is not exactly the equivalent of talking about a contract in the middle of a shower scene, but it's sort of in the middle of an action scene. It changes the tension of that scene. So yeah, so this is Dan Brown and what the Da Vinci code looks like. I worked a little bit to find like a classic action scene, but I'll just let you like skim the style that we're talking about. Yeah, okay, there are giggles, right? As my next art project, I'm going to do some kind of visualization of the last sentence of every chapter of Dan Brown books because they almost all are three words and end with an exclamation point. It's just amazing. And there are a lot of chapters. He has 108, 130 chapters. Yeah. Okay, so text worked really well on the other problem, although I was really interested in structure. Text still really worked quite well. So I thought I would do the same kind of class where I took the stochastic gradient descent class where I tried to figure out exciting scenes and it didn't get anywhere. So I switched briefly to topic analysis. I'm not gonna spend a ton of time on this except to show you that this, because this is a complicated thing to understand the results of when you run it, it's an exploratory process that comes up with groups of words that effectively are trying to describe a document that, let me show you this picture. David Blay, when he made this popular a few years ago, produced this, this graphic doesn't matter that it's cut off. Essentially, you have a document over there. You have associated with it probabilities of content associated with latent topics. So, you know, the different colors correspond to different topics here and it looks like the yellow topic, which has those, that many yellow words is winning in terms of assignment for that document. So it's a probabilistic method. It produces a large amount of data if you wanna look at it and it's pretty complicated to understand the output. So digital humanities, people who are wrestling with there are all trying to do visualizations and it's something we could help with. This is Elijah Meeks doing a visualization. Networks are the obvious way to represent this but it ends up being these kind of like, there's a topic in their documents and they're associated and there's a lot of so what in this, in the outputs of this algorithm. I thought maybe topics would help me describe things like I'd see action-y words in some topics and like boring words in another topic but I'm not sure that I got a whole lot out of it but I went pretty far down the trying to visualize it route. The obvious first step was this arc diagram which you know, we've had many people already at this conference show dismiss, try in a Mike Bostock thing and then reject in favor of something else but see I didn't wanna do a straight network diagram because I have ordered data these are chapters going from one to whatever and I wanted to see relationships across chapters so since it's network data essentially I decided okay I'll just do it as an arc diagram and see what the structure of the topic assignments is and down at the bottom in this ugly kind of bar I have how exciting that chapter was according to a writer on Mechanical Turk. Okay, I looked at the set I don't make anything out of this there's some structure going on I don't know what it is. So I outsourced chapter descriptions to a friend who was looking for work had her write 108 chapter descriptions of the DaVinci code which she hated by the way she hated that book. She earned her money and then I stuck them under to see if there was any relationship between the chapter description she wrote and the topic and at this point it's like it's getting hard to read so I'm like hey I'm in SVG, I'm in D3 I'll just rotate that sucker with one line of code I mean I fiddled a lot with where to hinge it and do it and setting it and Doug's like ha ha ha SVG. So it becomes a lot more readable and definitely more interesting to play with and so then I added some stuff to help myself understand it like fade out. So if I'm on a topic there those chapters were all assigned to that topic and then over there is like a tool to saying what's the words in that topic were so I'll briefly show it to you I'm not gonna be able to do all of these but all right so essentially it works the way you'd expect it to work and I mean the main thing is that here we have topic chapter zero, topic 19 with these words the loop how the beginning of the DaVinci code as if you've seen the movie I cheated and I saw the movie it has stuff about the loop but this one at the top is super exciting and this one isn't so in a quick browse of this data I could tell that there wasn't a tremendous relationship between topic and excitement of the chapter you can see right there. I even added a switch to color everything by excitement on the text as well to help it be more salient for me and essentially there doesn't look like a close relationship but it's fun to play with so whatever this was officially the Crayola color scheme I'm not a visual designer I literally went for the biggest list of colors that are very bright and divergent and that was the one I used so what do I do next the obvious thing with what looks like network data please advance I made a network diagram so I borrowed my friend Jim Vlandingham's code and did a network diagram of the words in the topics and their relationship to chapters and excitement and I'm worried about time so I'm not actually gonna play with this but I'll put it up online and I made a filter on the top for exciting chapters and dull chapters and looked for any relationship among the words in the topics and I didn't find very much so this is a case where I had to use visualization tools to actually understand the outputs from LDA and then I realized that it was a dead end so bummer I probably needed another cute owl here but here's an accident in what I did I didn't prune word to topic relationships and I ended up with when I fade out over these nodes these interesting constellations that turned out to be a cool trick in the unlabeled state when I'm rolled over one of these you see these little constellations that's words that are shared across multiple topics it's kind of cool actually, it's a nice trick all right so onto the totally geeky but slightly more profitable path for another talk earlier this year I was using Nodebox which is a Python graphic programming like toolbox that's a lot like processing I was doing many sort of stuff with books and I plotted dialogue to exposition in these books just as this simple bar and I also at the time did Angels and Demons because it was on sale on the Kindle daily deals so and what I found please I had a lot more sort of non dialogue bits in the red bits are dialogue and when I put the slide up I was like is this lots of running around and stuff is that why there's no talking and I think there was something in that anyway so I went back to Python I cut all the books up by chapter got part of speech tags got punctuation counts got word counts counted everything I could count for each chapter imported the scores from Turquers and incorporated it and then did some graphing this is radar disagreements about whether something's exciting or not there were tremendous disagreements about excitingness in Dan Brown so that could be a source of problems now this is me plotting in Python just all the nouns by chapter look how much variation there is in that that in itself doesn't tell us anything interesting and then I decide okay I just need to smooth out this data so with the pandas data frame library I did a bunch of quick smoothing tricks with rolling means until I got something that started to show there's some structure over time this is nouns and I'm not sure this essentially probably correlates with something else I'll show you the rest of the analysis and you can decide what I should do next so I threw it all on axes including down at the bottom is the thin black line that's the excitement score in which you can't compare to anything so this is totally uninterpretable so I standardized to get them all on the same scale one line of code in Python re-plotted I'm working in an IPython notebook re-plotted and then at this point I was like well this is the point in the investigative process where you want something interactive again but I had just spent all this time building like cool network diagrams even though I was using some of Jim's code modified heavily and like I was like either there's gotta be an easier way for me to get a quick interactive out of this and that week my pal in Portland Rob Story posted Bearcart which is a little tool that lies on top of rickshaw and you can call it from Python to create a little D3 line chart in a few lines of code from so this is like how tools for data mining are going like we're heading towards these push button solutions to get something interactive that's built on top of D3 so that you don't have to do all the programming that I did I he had to fix a bunch of bugs that I was his first beta tester but essentially you get out of it this this is Twilight actually and all of the things I was counting are along the right with check boxes I can turn them on and off and see what seems to be correlated with what and obviously I could do a giant correlation matrix or any other stats as well but I wanted to just play with this and see the structure visually so I did find a few oddities nouns and verbs seem to go in this cyclic thing through the course of Dan Brown and in the Twilight book there's sort of an inverse relationship in terms of which one is the higher proportion per chapter and it's very cyclic I need to investigate a little further what's going on there these are the excitement action arcs based on what the Turquers said was exciting in these books so Angels and Demons follows a fairly classic rise towards the peak action scene and then fall Da Vinci code they thought the beginning section was more interesting and I actually think having rewatched the movie recently that that's because there are all these revelations all the stuff about Mary Magdalene sitting to the left of Jesus in the picture and all this stuff like all of that stuff is pretty exciting there's a lot of running around early on so I think it's actually pretty accurate now it did turn out that the amount of talking was inversely proportional to the sort of action scenes in a bunch of places and only one book was statistically significant in a logistic regression but I need to do a couple of other explorations so I think I was right in my initial stupid little bar chart where I said lots of running around and stuff where people weren't talking so the dumb little picture wasn't totally wrong there you go and Twilight has a bunch of talky bits and if you invert that you get maybe what's the action I need to go check what's happening in chapter six and I don't know, I don't remember so I'm totally out of time so I'm not gonna bother showing you this is based on the same pattern but my next stage was oh god I do need another tool because I don't know what's happening at these peaks where the excitement is so I loaded in the chapter by chapter for each of those little blocks and then I can color the bar underneath the red bars with the metric that I'm interested in and then roll over the block to see what's happening in the chapter of those books so it's a visual way of browsing the text to figure out how the metric corresponds to what's in the actual text so basically this for me was a fun investigation of some essentially pot boilers but ultimately it was also a test of how awesome D3 was for doing very quick visualizations of data that was really too complicated to understand especially the machine learning output was too complex for me to actually see or figure out even how to visualize without something interactive and so we as data vis people and as data scientists should be thinking hard about visualization tools to help people with these algorithms I think that that's a big frontier for us and the other thing I learned in the course of doing this of course was that you could put an insider novel easily in a browser and have a little interface around it looks like you're working all right anyway thanks a lot guys