 Hi, thank you guys all for coming. I'm really grateful to be able to show this story with you So thank you for CSVCon for having me The year was 2016 and I was playing with some crossword data And I discovered hundreds of crosswords that were near duplicates of previously published crosswords. I Showed this evidence to the crossword construction list that I was on and a tweet was posted Had a crossword in USA Today in 2004. They tweaked it in 2008 with the pseudonym Then ran it again in 2015 CD business indeed His crossword was titled CD business So it's as always kind of a plan words and that was from the editor of the American Values Club crossword Ben Taussig Then soon after an article was published in 538 spinning it into a plagiarism scandal And that blew up into hashtag gridgate. It was the mainstream news for a cycle and And it turned out that a major crossword editor had been engaging in plagiarism on a grand scale for several years Now these crosswords had been publicly available including historical ones and since the crosswords were republished Why did this go on discovered for so long? Well the 538 article hint said it was something that I said a quote that I gave them that when you get the data Into a nice clean dense form stuff just falls out of it And I said this off-the-cuff, but it's actually kind of an accurate statement of the thesis I was testing at the time and then the format and organization of the data were instrumental Maybe even essential to the discovery and end result. And so this is how a file format led to a crossword scandal So my original motivation was for looking into crossword data was actually fairly benign I had tried my hand at crossword construction and it turns out that that is pretty frustrating and tedious And I wanted to get better at it and as a software engineer. What else would you expect? I decided I would just get some data and do some analysis And there is a site. It's called export info which makes all of the new york times puzzles available They have over 24,000 puzzles dating back to 1942 And this is thanks to three years of work by the pre short scene puzzle project led by david steinberg Who was a teenager at the time and a group of several hundred crossword enthusiasts? They spent three years digging into microfiche and transcribing Litzing as they call it every new york times puzzle that had ever been published This is a great source of data. And so I emailed and asked to download the data in bulk And I got a very nice reply from jeff chen who runs the site and is very active in the crossword community And the reply was a very polite bit said in essence No So as trivial as crosswords are the collection of historical crossword data is exceedingly non-trivial And even in this case that having the lion's share of the work already done concerns about copyright We're going to make it unnecessarily difficult for me Well, I am not one to be so easily discouraged after all I can just scrape it right it's already on the web in public But scraping is a pain and I was sick of doing one-off data collections And it really would be too much work for me just to do it from my own little experiment in crossword construction So I decided to double down and do it Once and for all and I decided that I wanted to archive this data for posterity Expert info only has new york times crosswords and there are many other publications that have their own crosswords And uh, those that data is scattered across many different sites and it's many different formats some of which are difficult to parse And so this would be a gift to future crossword scholars So my plan Was as usual to get things into a the thesis that I mentioned in the opening is to get things into a high potential format And then well, who knows what and then of course as always profit Well, so what's this high potential format that I'm talking about it's an hpf as I like to call it Well first it's neat. It's tidy and it's organized. There's a place for everything and everything in its place And remember it's not hoarding if it's organized A high potential format is also clean. It's been properly decoded It's trustworthy as a source and it's structurally correct. It's ready to be grouped and sorted And finally it's dense it exhibits high data locality so that on any axis that the data can be sliced It's reasonably easy to slice it in that way So like a good engineer, I looked into existing Crossword formats including the granddaddy of all crossword formats the across light dot puzzle format Which is ubiquitous and a de facto standard But it's also binary and proprietary and not very easy to work with And across light also has a very old text format which has html like tags inside of it But it's very fragile the clues aren't numbered. They don't include answers So any extra line in the clues will render the puzzle indecipherable Xpf is a more modern format, but it's based on xml which is quite heavy And ipuz is fairly new and based on json also kind of heavy. So like a bad engineer, I decided to make my own format Specifically for archival and analysis and this is the xc format that I designed It's really just a carefully constructed text file But it's very easy to parse the structure and format should be self-evident to anybody who's done anything with crosswords And so if you look at the top the id is mnemonic. It's in the file name and it's easy to type It's a short and recognizable abbreviation for each publication along with the date And so here nyt means new york times and the date is in year month day format Which makes it easily sortable in time So overall this id has some very good shelving properties and then inside the file crosswords basically have three sections So at the top there's metadata, which is in this format key colon value pairs Which is easy to parse again and they use a standard crossword keys title and author editor date, etc And it's extensible by using other metadata keys like rebus for puzzles Which have grid squares with a symbol or more than one letter in them If you didn't know they could do that they do sometimes and it screws you up totally So then there's the grid Which is the actual solution to the puzzle one character per grid square And an octothorpe or a hash mark to indicate black squares And then finally there are the clues And here unlike in other formats that we I mentioned the clues are self-contained with one per line Including the direction and the number and then the answer after a tilde And this format is uh is very stable if multiple people list or transcribe the same crossword from the same source They should get identical Xd files and then also shelves under identical id And this format is also robust to hand editing if there's a structural mistake is discovered after an edit the original puzzle should be able to be reconstructed So then I put the format into action and if they make a movie about this then this is where the montage would go I don't know what song they would use but um I made a janky pipeline to scrape and convert and shelve the crosswords and I got the 24 000 puzzles from Export info and I found some other public crossword sites and I scraped them too And I most crossword sites actually only go back like six months or a year And there are many syndicated sites to wade through the same puzzles are sometimes Published on a staggered schedule But I found a plus site that had the past 10 years of us day puzzles And then I found a site that had crosswords go back 10 years from universal u-click Which is the major crossword syndicator and this took about two months of regular work on my evenings and weekends so Now that I have the start of god's own crossword corpus I had 30 000 crosswords organized on the file system in a simple text format of my own design This is all I really wanted in the first place, right? So I should be happy But instead I was pretty tired And I had gotten pretty busy at work and I started getting bored with the project and honestly I kind of just wanted somebody else to take up the analysis So I posted a link to the data into reddit in our data sets to quote share it with people who can do better Analysis than I can and I said they are organized and cleaned and reduced to their utter essence in a carefully designed bulk text format I was pretty proud of the work that I had done. I just didn't have the energy to do anything more with it Or so I thought Because when I sent it out into the world that apparently inspired me enough to play with it just a little And the it turns out that of course a clean data set makes for easy exploration So it's like nerd sniping. I kind of nerd sniped myself And I decided okay, I could look for an hour or two And so the first question I had goes right back to what I was looking in the first place Are there any patterns in grid fill? I mean, I don't know maybe there are identical corners or sections And I thought hey, you know, it'd be really easy with this format to find if there are any duplicate full rows of cross of crosswords So I use my trusty unix utilities grep sort and unique and if you went to nix talk yesterday He showed how to do some of this stuff And they come standard on any linux or osc system and work great with simple pecs And so I wrote a query against my database and this is actually it And this here is about five seconds to run and this is literal output from that command So if you look at these these are actually theme answers themes are like the backbone of most crossword puzzles They're the puny grammatical overreaches that make people grown They're like an interesting slice of culture. You can see some of the familiar idioms here flash in the pan Tooth and nail So I looked into some of these duplications and it looked like themes are reused more often than you might think Well, okay. How about identical non-theme rows might change the query slightly and then took another just took another few seconds to complete and Turned out that many lines had duplicates in other puzzles And so I use another standard tool diff to compare the xd files and that show that they were not actually reprints Only some of the puzzle was changed and even more the attributions for the puzzles were different So I pulled on that thread which kept me going and that led to another and another and that weekend Honestly, it's kind of a blur and by the end of that I had found many interesting anomalies For instance this one, excuse me This was the largest timespan difference. This was originally published in the new york times in 1955 and credited to jack lasado and then a nearly identical grid was published in the new york times in 1984 29 years later and credited to daniel gerardi And in this diff colors indicate changes only two letters in the grid have actually changed, but nearly every clue in the entire puzzle was rewritten So that diff output by the way was made with just two lines of python code using a built-in standard library called diff lib And this was enabled by having the data in a presentable text format in the first place It's not really the prettiest, but it is completely functional and it's accessible to everyone including non-technical people So here's another anomaly that I was very curious about Um, it turned out that many of the duplicated puzzles had something in common They were all edited by a timothy parker who was the editor the stated editor of both universal euclick and usa today crosswords And at the time I knew nothing about the crossword industry I thought maybe it might be a fake name like a pseudonym for a syndicate of constructors Until I found his wikipedia entry and his claim of a guinness record for most syndicated constructor I spent more than a few brain cycles trying to come up with reasons why his puzzles might be legitimately Reattributed but this puzzle here was the real clincher because the puzzle on the right is from usa today 2013 By elizabeth gorski who is a prolific crossword constructor. She's very well known and well loved in the crossword community And the left published by universal euclick in 2006 was seven years earlier It was attributed to somebody named tim burr So what was going on? Why would a major newspaper republish an old crossword attributed to an obvious pseudonym? And then change the attribution to someone who really existed So I wrote across a grid comparator and ran it over all the puzzles I threw this list together to enumerate the anomalies I was comprehensive but actually minimal effort And I just simply sorted by grid similarity and I threw it at my website linking to those scrappy diffs that you just saw and I and uh I sent it off to the so after three whole days after I posted the data sets I sent this to the crossword list I had pulled out the three examples the two that we just saw and one other one And I said I found some pairs of crosswords that are strikingly similar and I wondered if quote timothy parker might be a little loose with attribution and I present this as a curiosity I was very careful not to use the p word because crosswords may be trivial, but the crossword community takes them very seriously Well suffice it to say that people took an interest And the first response was you may be onto something here and there was much discussion on the list and off And this is where ben tausig from the intro got his information and posted to twitter As we saw and as the hubbub grew louder I took advantage of people's interest in my work and someone on the crossword list actually publishes a huge clue answer database once A quarter and so I asked him for the collection of puzzles that I knew he must have And I knew that he would not have given it to me if before this hashtag grid gate blew up But since it had he asked for his his source for permission and then did give it to me Again, it's a shame that this data does exist, but it's so difficult to obtain. So anyway, I got the data set And holy forking shirt balls I thought 30,000 crosswords was quite a lot But this collection had almost 80,000 puzzles in it And I want to give a big shout out to a guy named barry hauldeman Who collected crosswords very diligently for over 20 years And saved everything in the source dot pus format So I spent the next week before the 538 article was published Ingesting thousands more crosswords with my janky pipeline Discovering what directories and file names meant in the motherload Shelving crosswords properly in my corpus and figuring out the overlap between his collection and mine Which was substantial but not total. I was comparing grids exhaustively night and day I would start a batch before going to sleep and another one before going to work And I just I discovered that week that managing 50,000 of anything is pretty challenging Especially under time pressure So this is the second iteration of results that I pushed online And now there were 50,000 crosswords and in this version you can click through to each publisher And this is what was linked to from that 538 article All these old URLs are still available by the way because I take my responsibility as a data nerd very seriously Now at this point it had been 10 days since I posted to Reddit data sets trying not to do the analysis work And for my failure in that regard I got my 15 seconds of fame Which I have to say was stressful And time consuming there were many reporters contacted me looking for a choice quote from me to put in their summary of the 538 article Because everyone loves a scandal and the narrative had already been set So they weren't interested in the process nor the nuanced facts and I was not prepared for the onslaught I don't have many regrets, but I do actually kind of regret that I didn't have something to sell And I don't And I don't actually mean merchandise or stickers or anything to hasten my retirement although that would have been nice I mean actually my own career and story that the local NPR station actually figured out that I lived in Seattle And asked me to come on one of the radio programs But at this point having seen a bit behind the curtain I had become very distrustful of the media and I was not comfortable with the attention I hadn't groomed my online persona in quite a while and by the time it happens it is too late So my advice if you think this might happen to you is to be ready for the stampede to have a website With a coherent presentation in some way for people to engage on your own terms even if it's just Please sign up for my newsletter So after that died down a little bit anyway, I Kept working for another six months because this kind of data stuff is important to me I imported another 20,000 crossword puzzles from the haulderman collection I improved and automated my janky pipeline and I came up with a workable visualization All of my code for this project by the way is open source and on github And it's written mostly in python 3 with no dependencies I figured out also how to host the entire project on amazon web services for less than 10 bucks a year Thank you. I'm proud of that So this is my Visualization overview and I it shows the scope of the entire collection as well as the scope of the plagiarism You can show them by comparison over the x-axis is years and over the y-axis is publications The color indicates the kind of duplication and so Yellow indicates a reprint where this author is the same A pink indicates generally a theme copy only and red indicates a grid duplication with different authors And so within each publication year square they're little bar graphs broken out by day of the week and this way quantities can be directly visually compared So some takeaways from this and here's a zoom in to the damning portion of the visualization You can see that the scope of the plagiarism was egregious But actually mostly finite There are plenty of other potential instances of plagiarism Which I think are actually quite interesting, but they are completely overwhelmed by timothy parker's malfeasance And interestingly he wasn't even doing it that much anymore when the scandal erupted Although I say not much. It was only nine puzzle cell plagiarisms and three theme reworkings in that year Which for anybody else would have been a big deal And here you can also see an odd yellow red line starting in 2012 for usa today And that is when timothy parker and usa today started running pure reprints of puzzles on saturdays So why is it half red? Well, it turns out they're copies of other plagiarized puzzles And this was not obvious before they were broken out by day of the week And you can dive into this visualization too first for a list of puzzles by publication year and then into a specific duplicate puzzle comparison for instance This is that same puzzle comparison we saw earlier with the usa today puzzle on the right from 2013 by lisbeth gorsky And on the left by tim burr And note that the individual puzzle comparison now looks more polished But the basic xd structure and format is still evident The clue format is identical to the raw xd file even with a till that to separate the answer from the clue So here is more of the full story behind that odd pair The puzzle that triggered my spidey sense turned out to be one of those legitimate reprints The original puzzle was published in 2000 and attributed to Elizabeth C. Gorsky The grid is identical the title is the same and the author is correctly attributed in the reprint Although timothy parker did include himself as an author also But there's more The puzzle was first republished in 2005 with the upper left corner of the grid changed and the puzzle author listed as lucha coal And then it was republished in 2006 under tim burr as we saw And then in 2009 the grid puzzle was gutted and the theme reused So in total at least as i found anyway the puzzle was reprinted not just once but four separate times And you can tell these were not accidental reproductions because in every instance the black squares in the grid are also placed identically And so this is just one puzzle out of hundreds that were similarly hacked up and reattributed So the simple xd format was indeed instrumental in discovering the scandal if i had thought to go looking for it I suppose i could have found it with other formats But i wasn't going to go looking for it But this format and structure made it so trivial to do these explorations and comparisons that it literally just fell out of the data In these days the xd code is still running nightly and collecting and comparing crosswords But i don't have a whole lot of time to do more with it So if anyone is interested in playing with an epic data set peeled and cleaned like so many baby carrots This is the largest collection of crossword data on the planet Especially you should talk with me if you want to play at being a crossword librarian because there are still hundreds of puzzles that need to be imported Finally i'd like to encourage you to explore your own data sets in a less directed fashion When you don't know what you're looking for try a format first approach Make the data exploration easy frictionless. There's some really low hanging fruit out there just waiting to fall out And if you find some please let me know I'd like to thank Anya for her support and putting together this talk and i'd like to thank you for listening Thank you