 Hello, and thank you for coming. My name is Matt Levin. Matthew Levin, you can call me Matt. I am going to be speaking today about teaching and doing digital humanities with Jupyter Notebooks. And on slide one here, I just have my affiliation, my Twitter handle, at MJLevin80. And also my username on GitHub, which will come up later. So I thought I would share that now if anybody wants to GitHub stalk me as well. Feel free to jump in if you have a question. This is the first time I'll be giving this particular talk. So I actually do invite you to raise your hands as we go. There will be questions at the end as well. But I think that will be helpful if people have a point of confusion to just jump in. And if you are a little bit bashful about doing that, that's okay. Feel free also to tweet a question to me and I'll check that at the end, okay? That's totally fine. So to begin, I just want to introduce myself a little bit. This is my first time here. As I said, my name is Matt Levin. My background is in literary studies and history of American publishing industry. And I did a doctorate at University of Iowa and finished in 2012. And I became interested in digital humanities in 2012 and did a postdoc the year after that at the Center for Digital Research in the Humanities at University of Nebraska-Lincoln. And my current position, I'm director of the Digital Media Lab at the University of Pittsburgh. That's in the English department. So it's kind of an interesting mixture of coding and literary studies. I support four programs, literature, writing, composition, and film studies. So I also teach digital courses. And I've only been really seriously coding for about three or four years, mostly working in Python for that entire time. And I really just drank the Jupyter Notebooks Kool-Aid in just this past March. So I wanted to thank Matt Burton at University of Pittsburgh for convincing me to try it out. And I'm going to talk really mostly about the way that I have used Jupyter Notebooks so far and what I think they represent and why I think you might be interested. So first off, maybe a little bit about you. You don't have to belong to one of these categories, but I just tried to anticipate who might be interested in this and why. And so I thought maybe some of you are programmers or developers who at some point have to explain or teach code to non-coders. Is that fair to say? And then some of you might be Python beginners who are actually just trying to learn more code. And Jupyter Notebooks might represent a way of doing that. You might be a data scientist, a data munger, wrangler, et cetera, who wants to share results more effectively. Jupyter Notebooks can do that. You might just be a Python lover who's wondering what digital humanities is. I get that question a lot. What are the digital humanities? And finally, although I didn't necessarily advertise this, it could just be someone who loves HP Lovecraft. And if you do, you are in for a treat. So I'm going to talk about HP Lovecraft. Come on in. So here's my basic outline, which I will break very soon. I'm going to talk a little bit about Jupyter Notebooks and what they are. All I'm going to do is a little of this is possible because it's pretty Google-able. And some of you probably already know this. Talk a little bit about the advantages of the technology, potential drawbacks and challenges that I see. I have solutions to those things. Feel free to jump in. A little bit about how we might share Jupyter Notebooks. And many people are working on very interesting things. So I'm basically just going to talk about three things in that camp. And then most of what I'm going to talk about is how I have used Jupyter Notebooks for the digital humanities. So that will be the bulk of my presentation. And then what I'll do here is try to very deftly switch from this PowerPoint presentation to an actual Jupyter Notebook context and then switch back so seamlessly that everyone is super impressed. And it seems like I'm just really good at that sort of thing. And then some just future goals, kind of outlining those. So okay, some of you may know this, but Jupyter Notebooks build themselves as an open source interactive data science scientific computing across 40 programming languages. The website is Jupyter.org. Some of you probably have done something with iPython or iPython Notebooks. And that's really the roots of this project. But a group working on this rebranded it as Jupyter Notebooks by taking the name Python, Julia, and R and kind of pushing those together, Jupyter. So it's really cool. And basically what you get is a main dashboard with control panels, these things called kernels, and then these things called notebook files. And they use a web browser as the graphic user interface and allow you to basically, you can run a notebook server from the command line from terminal for me, I'm on a Mac. And then you open up this graphic user interface and you can actually just write Python in this context. But you get something that's a little bit more interactive and a little bit more graphical. So the advantages here and a little bit about how it works. But basically you start, for the most part, by running it locally. It works with your kind of pip install things that I'm guessing most of you are used to if you use Python, which is pip install, Jupyter, start up a virtual environment. And then you can have a requirements.text file in your virtual environment and then you run this notebook server. And then what you can do once you get into Jupyter is you can actually have these different cells where you would have either straight Python code or markdown or raw cells that actually just have like any kind of raw text or code. And then you can use these things called magic commands. This is all pretty Google-able. But one thing that's really cool about Jupyter Notebooks is a small flag and then you're writing bash instead of write back. So for just a handful of things, if anybody has had this experience, the Python way of doing it is a lot of work and the bash way of doing it is one line of code so you just switch back and forth between the two. So that's a very cool thing. And then these things are very easy to share. Write a GitHub and I'm going to show you that in a bit. And then because you can do markdown and other types of things like that, there's a potential to mix code and these narrative cells that actually are text that explain what you're doing. A lot of you are probably used to doing that with your comments, but you can actually have a kind of like, you know, text block paragraph long, not just explaining the code but explaining the narrative that's writing to that. So it's actually a hybrid narrative form. I'll show you that example in just a second. But the general perks here, if you're trying to introduce Python to a beginner, you really have to, it's better to be experienced if you're setting this up for someone, but to use a Jupyter Notebook, it's great for beginners in my experience. Number two, if you're teaching code, what you can do is set this all up ahead of time and eliminate, if you've ever run a workshop where you're trying to teach Python, the first few things you're going to do is like, you can install Python and then probably install, it's easy to install PIP, you can install with it, some kind of pseudo command or something like that and some higher level of, some higher access. And then you're going to start, and then you're going to pick installing your libraries and then you're going to have, you know, a handful of people in the room who have a PC so it's going to be different. There's all kinds of ways that you might just use up, you know, a half an hour of your workshop time doing all this stuff. That's the way to start with a Jupyter Notebook. Start with a Jupyter Notebook. And for me, the big thing here is, for beginners using code, and I do a lot of this homework, the mindset is graph of mind, is it a command line? You just can't get around it, you have to take it around people to be thinking in a command line, in an extra line. And so I think that our Jupyter Notebooks are a great way to read, just get people thinking about that interaction and the graph of the command line. So, some potential drawbacks and challenges, they drawbacks and challenges that you're aware of if you're thinking of the aware of, running some stuff in Jupyter, running some stuff in Jupyter. There are some issues with scale issues, these kernels and these kernels get overwhelmed and get, oh, well, they freeze up and then you freeze up it and then you're running a server for example, that's it. You, right now, it's hard, it's hard. You have a notebook, it's basically, you have a notebook and it's basically run it live, everything can be changed by everything can be changed. So if you want some section, if you want some, it's admutable and frozen and other frozen, I mean, others play with that a lot of users. It's probably not the best tool, probably not the best tool right now for that type of work. I've seen a lot of chatter, I've seen a lot of chatter about the right, you just want to write a notebook, write a Jupyter, write a Jupyter, write a Jupyter, write a notebook, and right now, setting that up now, setting that up is harder than I think it is. And then I really do think these things work, and I think it's really important that you have access to your system. You've got to encounter problems, differences, what is PC differences with PC. And then I think this life style and this something, you have a counter, but people, you don't want them all to have, you want them all to have books that they can look at, books that they can look at, notebooks that they can look at, notebooks that they can look at, you have to decide what kind of context you want to plan for that. That's something that's hard. I'll show you what I did. It is a challenge. So first, I already mentioned this, I already mentioned this, but they're very important to get up, because you have to get up, and you have to study the absolute basic knowledge. Knowledge will be able to do that, will be able to do that, how do you know, how to create a repository, create a repository, how do you know a repository, how to how to synchronize, and commit, add, and all that kind of basic knowledge. And then you need to be familiar with a requirement of technology methodology, that's something that, if anybody has a beginner and hasn't done anything with an alarmist, it's pretty easy to learn, and pretty easy to learn how to do, and it is just a wonderful thing to have under your belt. And then you have these, and then you have these, or things like server, or Jupyter, things that would need to be installed on a web server, and then could be run by a user, and you can do things like setting up OpenID, log in, for these Jupyter, and a user would log in and have their own little Jupyter instance, so that's pretty cool, I think. So the first time I used these, with any kind of sense of other people besides me needing to use them and look at them and get something that I ran. I ran a digital humanities workshop at Carnegie Mellon University, it's like a week-long seminar, it's for humanities graduate students with no digital humanities or coding experience, and some of them might have had some, but it was not a requirement in any way. So I came in and gave a topic introducing literary studies in digital humanities, and then I led two afternoon workshops, and each had about six students, relatively small, but still substantial I think for something like this, and then what I did is set up a Jupyter notebook server for my workshop on the Pittsburgh Supercomputing virtual machine that I had been assigned, and then I also created a backup option using this thing called mybinder.org, I ran my session using this, and then I also did support for the computational linguistics professor, Nareh Khan, to do the same architecture for her computational linguistics workshops. And then basically took everything and pushed it to GitHub, the address is there, and it's also pinned to my Twitter page at the very top, if you want to look at it. More than welcome to just go ahead and start looking at what Jupyter notebooks look like as I talk. So this is the part where I gracefully minimize PowerPoint, and hopefully to get into the CMU workshop, GitHub page here, and I think this zoom level, I'm going to zoom this just a little bit so that you can see it. This is a fairly typical GitHub repository with probably a lot more documentation in the read me than would be necessary for something that is software engineering world, but it has lots of instructions for the participants as well as for someone who wanted to actually run this exact workshop themselves or base a workshop on this kind of workshop. So what we have here is a partial introduction to digital literary studies. It has a workshop schedule with instructions for each item. I go through basically and this lasted, I think it was about two hours with a little break in the middle, but what I did was I started off by giving the students these excerpts of literature, and I had them guess what year they thought they were published. And I thought that would be interesting because the work that I'm doing right now is on this subject called chrono-stilometry, which is the computational analysis of the year of origin of a text using computational techniques to kind of ascertain what year was this probably from, but also like what happens when a text looks like it's from an era that it's not from. So what happens when literature refuses to act at its age was the name of a paper I gave. So I had students just go through these literary excerpts and guess based on their own intuition. And what I was trying to get at is that there are actually lots of contextual clues in just any average paragraph that would give you the impression probably within 50 years of when something probably was published, right? Unless it's doing a really good job of thinking. But you'll get these contextual clues like references to technology or references to things that just feel very modern. And we actually have pretty good intuition for that type of thing. I did 20 to 40 minutes of like absolute bare bones basic, you know, here's what a variable is in Python, right? Just really getting people acquainted with the feeling of working in a Python environment. And then we launched the server and had them kind of, you know, create a string and print it, right? Change that string to an integer. See what happens. Try dividing the string 5 by 10 and see what you get back, right? That type of stuff. I'm sure a lot of you had a moment in your own training where you did exactly that. And what the Jupyter Notebook lets you do is it lets you fiddle and it lets you get feedback. And it has a really good mixture of, you can do that in the regular interpreter if you just go into Python, right? But what you get is a mixture of permanence and that mutability. It's really cool. And so this is the get, I'll just show you the difference here. This is the get hub, right? Sort of static version of this. And if you click on an IP YNB Notebook, you're going to get a rendered version that's static, right? And I will show you that in a second. If you click on this link down at the bottom, for those of you following along, you create this by the way by going to mybinder.org. But then you just basically paste this badge in to the readme, which it's generated by the process of going to mybinder.org. And when you click that, it's going to launch something that looks like this. And I'll just warn you, if you're interested in this doing it, it takes a while. What it is doing when you click that is it is taking your requirements.text file and it is actually building a virtual machine that runs that Jupyter Notebook server around what you have created in your get hub repository. So everything that you've asked for in terms of dependencies are going to be installed, right? So you're going to get something that looks like this. And this is a Jupyter Notebook server. And you'll notice immediately that you have a kind of folder structure view here with these notes that say running. You also have this upload button so that if you want to add files or materials into your Jupyter Notebook, you're actually going to do that graphically if you want to. You can also do it non-graphically by pushing those to get hub. And you have this new button where you can actually launch a new Jupyter Notebook file. And inside my repository I have these notebooks that are already created. I have one that was for me, right? I have one that's called for snippets. It's just empty. And so when you go into one of these, and hopefully this will work because these do expire, by the way, it's launching a temporary server. So we get to see this in real time. I just did this probably 20 minutes ago to get that set up. And it'll basically expire within that period of time. And so I just refresh that and then this will launch a notebook. And notice right at the top that I have import numpy as np. I have import matplotlib. And then I have this thing that is a magic. It's the only magic that I currently use. And it just says matplotlib inline, which means that when I create a visualization using matplotlib, it's actually going to print it into the notebook file instead of opening up like a PNG or something like that. And I think that is a really cool feature. And then just Python, right? We have our, you know, print the type of MD text. It's Python 3, so we need our parentheses around print. We could write a function in here and reference it later. And then what we do with these cells is you write Python and then you hit the play button and it executes, right? And we're getting warnings about Anaconda and then we get our our class string. So basically, it creates this really cool interactive experience where a user can see the code you've already written and execute it bit by bit and then also change it and have it actually change, right? And remember, I just launched this just now, but I don't want to do type anymore, right? And this will take a second. Again, we have issues of scale. This might be gigantic, actually. It's the entire text of Moby Dick from Project Gutenberg, right? So there it is in a cell and I could scroll through and actually look at this. So I think really what this comes down to is for these introductory students, right? We have this tremendous advantage of being able to interact in this context where you have, you know, everything you would want with Python including all these libraries that you would want to import and you're able to actually produce that experience in a way that is partly graphical and partly command line, right? So a huge aspect of what I'm interested in and what I do is this sort of buzzword called code literacy and this is a fantastic tool for that, right? Introducing people to and making them more familiar with elements of code literacy. So I'm going to go back here and I'm going to talk about my next use case which is a little different but has to do with HP Lovecraft and basically I did a paper for a digital humanities conference about a month ago and it was on this. So here's HP Lovecraft in a letter in which he says, when I was 10 I set to work to delete every modern word from my vocabulary and to this end adopted an old walker's dictionary from 1804 which was for some time my sole authority. All the Queen Anne authors combined to form my literary diet. He also said, I am certainly a relic of the 18th century both in prose and in verse. My taste in poetry is really defective for I love nothing better than the resounding couplets of dried in and pope, wrote in the 1920s and mostly the 30s before he died. He's a modern author. He's thought to be kind of the godfather of the horror genre before horror was called horror and he's very much a modern writer in that sense writing for a lot of pulp magazines but here he is saying I really belong to the 18th century in these particular ways. So what I thought is that these letters are a provocation to look deeper. So I wanted to think about Lovecraft individually but also to raise questions about horror itself because horror so often for those of you who like it deals with ancient evils, these unwritten and folkloric histories, the return of these repressed sins of the past for those of you who've read Poe, it's all over Poe and then the question being is there a dominant set of terms for a particular era that Lovecraft authors employed to create that feeling like we're horror we're writing in the 20th century but we're trying to look and feel more like Frankenstein or Dracula these 18th and 19th century documents. So lots of people have actually written about this. There are many, many scholars who have done this. I'm not going to talk about them here but I just wanted people to know that I'm not making this up from scratch. I'm actually continuing this tradition of thinking about like what is the Gothic novel and how does it relate to horror? People are wondering that. And so one thing digital humanists like to do, at least I like to do is try to find a kind of quantitative edge to a humanities question. So my question was is there an observable presence of archaic or classical or pre-1800 terms in Lovecraft and or the horror of Lovecraft's time period? And then is there an observable absence of neologism in these texts, words that came into the language more recently? Did they avoid those words on purpose? And can a machine learning approach be employed to better explore the relationship of time period to genre? So I wanted to get into this whole area of machine learning to see if can we train a model to ascertain the date of a text and then run that model against Lovecraft to see if he can fool it. I thought that would be interesting. So in terms of presence I did this thing that's just basically the ratio of pre-1100 words divided by the 1100 to 1700 words. And then I also scanned the Walker's dictionary and just did a ratio of the terms in the dictionary, right? Do these authors like have a higher ratio of terms that are in that dictionary than other authors? Kind of a weird thing that you wouldn't do in any other context but Lovecraft said he looked at that dictionary so I wanted to do the same. I also wanted to look to see if these neologisms were absent so I needed a data set for that of neologisms. And then for machine learning I did supervised learning on I basically took this ratio this ratio and these ratios and I trained a supervised model with a bunch of different ways and then all in Python all using scikit-learn for those of you who've done scikit-learn machine learning very easy to use and just wanted to see how this would perform the next step being more sophisticated machine learning but I haven't done that yet. So what you get is things like this which is over time paradoxically literature actually becomes more Germanic which is to say newer documents use older words with greater frequency and the reason for that is because Germanic words are more common and everyday and Latinate terms are more erudite and sophisticated. So as texts become more modern or modernist they actually start using older words which is an interesting paradox. This is the same exact visualization using Oxford English dictionary data instead of dictionary.com data. The next one describes the actual data set of literature here 950 works of fiction from 1750 to 1989. We have basic metadata and term frequency for these. They're tagged as detective Gothic in science fiction and the person who did this is this scholar named Ted Underwood so I basically just grabbed his data set off of GitHub so that I could have kind of comparison. And this is basically Germanic-Latinate ratios by year and genre. So in red we have the horror and Gothic in blue we have not horror and Gothic and then this black dot right here is Lovecraft's Germanic-Latinate ratio and what we see here is that he's really not in any way a kind of outlier in terms of Germanic-Latinate ratios. He's actually fairly typical of his period. He doesn't look like he could still be in the swarm anywhere down here but he's not a kind of middle of the 1800 pack. He'd be on the top end of that. This is Walker ratios for fiction so how does this work? And you'll notice by the way this is smoothing the Lovecraft dot is here the kind of more modest smoothing is the Walker ratios in blue and then the combined heavy heavy smoothing is this gray line and Lovecraft is right on the line so he's basically as typical as you can get in terms of his Walker ratio. What I was hoping you would find is that Lovecraft's Walker ratio is super high he's literally going through the dictionary and making sure every word is in there that is not the case. He did not do that as far as we can tell. However this is these are Walker ratios so basically this is the same thing showing Lovecraft dead center of the mean and as you can see this is basically it's not quite normal distribution but it's closer to it and nevertheless Lovecraft is right at the mean. This one is the test of absence for neologism and what we see here is these post 1700 words just as you would expect the presence of them just crawls up and up as you reach 2000 that's exactly what you would hope would happen with these neologistic terms but also it's fairly noisy data these neologism percentages are never quite zero because of like OCR errors and things like that or just the data of these neologisms being wrong a little bit wrong but it does climb up in a way that as you aggregate all these neologisms you get this upward climb which is what you hope to get right I mean you have some good data this is the same thing except it splits the terms so the blue line would be words that came into English between 1700 and 1750 the red would be 1750 to 1800 etc etc and then these dots are Lovecraft right here they're hard to see so my next visualization will show you what they are but basically Lovecraft has more 1700 to 1750 neologisms in his text than average and he's in the sort of high end of that right but probably within one standard deviation of the mean the second you go to 1750 to 1800 he drops and then when you go to 1800 to 1850 he drops even more and then when you go to 1850 to 1900 he drops even more and then when you get to 1900 to 1950 neologisms he is in the absolute bottom 10% so we have not quite normal distribution here but pretty close and we have strong indications that Lovecraft was either deliberately avoiding these neologisms or his style was just so affected by his early training that he just tended to do that instinctively so then I did machine learning which is a supervised approach that I already outlined and what I ended up with was a Gaussian naive based classifier and this is not good right basically I was able to train a model that can guess the year of a text plus or minus 35 years with 74% accuracy that's not good for a machine learning standard right you would want like a 90% 10 year or something like that if you were really doing this but I was really interested in this because it still reduces a set of outliers who don't fit and I wanted to see if Lovecraft was one of them so here's what I found right these inside the band are off within 35 years the stars are correct date assigned and then these reds are off by 35 or more years and here's Lovecraft as one of the most marginal of that period so using this more aggregated standard he actually does come up as it places him as you know it thinks he's more like early 1800s pre 1850 according to this particular model so obviously that's not definitive it's just something that gestures as like hey I wonder if there's more we can look at here but I thought it was interesting enough to share and it's all basically related again another GitHub repository here called horror genre this is very sloppy I apologize it's a work in progress but I wanted to show it to you because it really and I'm running this locally as well right here this is essentially the same thing but what I was able to do is basically create these Jupiter notebooks that produce and save these visualizations they allow you to actually look right at my code and then see the result all in context so if you're interested in oh well why I don't know if anybody noticed but the neologism ratios we're dealing with in those texts are like 0.04 they're absolutely minuscule right like 0.04 percent of the words in this text are neologisms that's really low why is that the case well it's because most authors of the 1900s the absolute majority of their words are those either Germanic or Latin at words they just most authors most people writing just don't use a lot of those more recent terms they just don't and as you scroll down here you can just kind of see like how I would produce a visualization using a library like matplotlib or seashore right and so someone can actually go to github and this is this is the live version so you can change it and hit play and hit stop and all that kind of stuff but if you go to the github version you can look at something like this which is that rendered version and what this is it's a notebook called OED normalize and it basically has a narrative as well as code that actually explains how I retrieved that Oxford English dictionary data for those neologisms right going through Oxford English dictionary has a big database of terms and their year of origin into the the English language and then a lot of them are like you know the year would be 1805 and then for others it would be 180 question mark and then for some it would be early old English tech that would be like a text field so that's very much kind of a standard data wrangling problem right like I have to decide what to do with the words early old English and do I turn them all into integers or do I take all the integers and assign them to just categories or something like that and so turning that into like really structured CSV style data I basically just wrote a narrative explaining the choices I made and the thing is for students, for my colleagues for anybody with a humanities background but even for data scientists a huge aspect of this is I had to make choices some of them were good choices some of them were questionable choices but many of them were just judgment calls and they changed the result so rather than do something arbitrary and try to hide that fact and just make my data look really clean I kind of ascribed to the absolute opposite philosophy which is complete transparency about what I did and emphasizing the fluidity of that so having a kind of open data philosophy but also a data process philosophy right and I think a lot of people would agree with that as a general way of doing things and so this does that I just wanted to point out a couple of other things I don't know if anybody noticed but here I'm importing an SQLite database you can do that right in your Jupyter notebook pretty cool to be able to just jump into SQLite which is super powerful lightweight database engine and just use that in your notebook same thing with matplotlib I find that to be really helpful and I did all the machine learning in a different notebook so all that scikit-learn is being integrated into a Jupyter context as well back to this for the machine learning models so basically what I did is I created an SQLite database that just has the variables for all 750 novels in that set and then the Jupyter notebook itself will divide that into a training set and a test set and then you can actually retrain that model right in the Jupyter notebook and then run and get the results so it's all the thing that's not happening in the Jupyter notebook is that really heavy like going through each novel and running the the actual ratio calculation right so going through novel by novel and taking the full term frequency table and getting a ratio and you know it basically all it does is loops through all the words and checks to see if they're in the dictionary that is something that I just ran without a Jupyter notebook because it takes probably to do all 750 I'm sure you could write a really efficient version of this for all term frequencies from a MySQL database that has like I don't know it's like 18 million rows of data so that is probably like it probably takes a couple of hours to run you could probably write something that does that a lot faster I did not use parallelization for example so that would speed it up but I'm in a world where kind of good enough code and waiting around is not really a problem right I know some of if you're in software engineering that's not okay but for me you know if I can just leave my computer running overnight and then go get dinner I'm fine I don't mind so it's lazy I know but I you know it's all a question of what the return is for that kind of work does that answer your question okay cool so I just wanted to wrap up actually just conclusions and future goals so there's something called Jupyter hub with Docker spawner I don't know if anybody's ever used it what it does is every user in your Jupyter notebook when they log in it spawns a docker image so instead of actually like writing to your your web server it would create a docker which solves a lot of problems if anybody's had that type of experience so I haven't done that yet if anybody has done it I'd love to hear about how you got it going yes absolutely so what I do to set up Jupyter hub is you basically do people use like digital ocean or something like that some kind of you know virtual server yeah AWS same thing so if you're gonna you know create your Ubuntu server environment do a whole bunch of configuration right and then your Jupyter hub user is actually writing data right to your hard drive as a as an Ubuntu user they're assigned an Ubuntu user name so there are any number of basic security measures to prevent that from them from having access to like vital areas of that server but you still have to basically pre clear those users to be allowed to even like write data at all so even though you can go in there and sign up using GitHub using an open ID framework you can log in using GitHub somebody has to just log into the back end and approve that user well for issues of scale you don't want to have to do that right you want the whole thing to just be running on its own you know open sign up or some kind of sign up credentialing or something like that so I know there are other ways besides Docker spawner to fix that problem but I think Docker spawner is one way to do that because once you're in that Docker instance you're never touching the actual configuration or hard drive of Ubuntu you're in that completely containerized virtual space does that help okay so I would like to do that there's a way to do this where you're actually if you're working at scale you're actually connecting to a cloud a database in the cloud like AWS I haven't done that yet but I'd like to I would like to build a public collection of digital humanities Jupyter notebooks so that people can just jump in there and do stuff I think that would be really cool and have that be relatively open and then the last thing is I really want to get into this kind of JavaScript and jQuery integration because I think Jupyter notebooks would be very cool if they had just basic design functionality like I would like to minimize this entire section of the Jupyter notebook and go to the next part right I'm not interested in the background information give me a basic accordion right and then I'll get down to the next thing or you know a table of contents at the beginning would be kind of cool just stuff like that if you could do that in jQuery it would make it simpler so that's the end of my actual prepared talk and I believe we have time for questions yes yes yeah I totally agree so yeah I mean you can use map.lib for a lot of that stuff but a lot of you maybe just preferred sorry it's a python conference but maybe you prefer to use D3 right for certain things or you just like the way they look better or you want to write some CSS on the top of it or whatever it is so yes exactly yes yeah any any kind of like python library that you have installed in your virtual environment would then be accessible by the jupyter notebook yeah exactly one big exception to that does anybody use NLTK for natural language processing so you know how you have to like download the data and it gets stored to a folder for like these big data sets and corpora that are in there that presents a challenge when you're using my binder because you put NLTK into the requirements file so it automatically installs NLTK but it does not automatically install the data so what you have to do is create a local copy of that data in your github repository and the way I did that was just with a pickle but you could also just like have a data folder as long as it has the same as the data folder that the NLTK data is in so there are ways around it but it is a complication yeah I believe so I'm not 100% sure but if I'm understanding you correctly I think it's it's trivial I think it's very straightforward yeah yeah totally yet it would be set up with just a basic folder structure and what you can do is when you start the service from the command line you just specify where the folder where the notebook is and if you were to say upload any file it doesn't matter if Jupyter can even understand it I could upload my PowerPoint presentation to that folder using that upload button and it would then be on the server in that folder so again that's the problem with potentially security risks exactly any other questions so Jupyter Hub I've only it doesn't work like github Jupyter Hub is something that you install yourself and it creates that user framework where people can log in and if you have set up Jupyter Hub on your server it runs exactly like a Jupyter notebook locally so you could go ahead and take a github repository and clone it into your Jupyter Hub context and it would run exactly the same as long as the dependencies are all installed right that's the key so you really do need someone watching that server and you know creating my colleague Naray if she needs pandas and I haven't pip installed it into that virtual environment when she hits import pandas it's going to come back with an error you can't actually pip install from inside of the Jupyter context you have to do that separately maybe yeah I haven't actually done that but for all I know that totally would work I'm just checking to see if anybody tweeted a question but they didn't so yeah go ahead this is something that I have done my colleague I was talking to him about this just about a week ago because I wanted to do that and these things are built on top of I believe us a like something sort of like if you're using Flask and like Jenga you have these like decorator decorated templates Jupyter I believe is built on something like that the problem is that if you change the template it's going to change it for every notebook across the entire kernel and so doing that where it only changed one or something I believe there is a project where people are trying to create something exactly like that where they're separate but I don't know what it is so if you want to tweet me that question I'll ask my colleague yes yeah I think that's right if that was your question I probably didn't understand your question cool yes I'm so glad you asked that because I was supposed to mention that and I forgot but basically if you have multiple collaborators working on one notebook on your server you're going to have conflicts and it's going to be a problem so what we did for the workshop was we did let everybody share a space but we just very carefully explained to them not to do that and so we had everybody create their own IPYNB file that was just named after their name and so then we avoided that problem and that was really one of those problems of if you have Jupyter Hub you're all separate so you can't accidentally create conflicts but then you can't have that shared code space unless you put it on GitHub or somewhere else so again it's this whole thing of it would be really nice if you had notebook files that only one person can get into but then a shared folder that everybody can see so if someone is working on that and I don't know I'm not claiming to know everything there are many many things I don't know and this is one of them that you know feel free to tweet or talk to me other questions we're fine on time right like ten minutes so or comments yes that's fine hmm yeah yeah and also they're not like you would think a dictionary from 1800 would be a pretty good proxy for the year 1800 and it's not because people who write dictionaries especially back then that was an ideological document they were imposing which words you should use so they were actually very mistrustful of new words yes I don't recall off the top of my head but I do have a full list of all of that somewhere if you're interested and more than welcome to email me if you want to get or even dig into the data on the GitHub repository yeah yeah what year was that yeah no yeah well this is the cool thing about I think using some like Jupyter notebooks because I think it makes the content more accessible as well and to me so there are a couple of like big interventions in the history of the the novel and things like that but I would really hesitate to say that anything was caused by something right but there are these like exactly and so there are these like upticks in Germanic for example and one of the big interventions that people talk about is the early intervention would be like the spectator this magazine called the spectator and Addison and Steele who really like emblematized if not led a movement toward a more commoner type of language that like not absolute poor but maybe middle class people would favor so that would be moving away from the Latinate by the way Austin and others of her period would be heavily Latinate right that would be kind of the distinguishing feature of that and then the next big intervention sometime in the late 1800s has to do with probably the periodical marketplace probably the rise of like what we think of as like the contemporary popular fiction movement and then novels having much more dialogue once you have more dialogue you have more Germanic terms so that's a huge implication there and I think it's fascinating to see that just absolutely explode in probably the 1870s or 1880s with what's called literary realism and then the other thing I wanted to add about this approach of your friend who's an author who's trying to write Regency oh sorry well you shouldn't have corrected me we could have just created the rumor that she's your friend no the this method of sort of matching dictionary terms which I'm fascinated by it would do a really good job of fooling a computer in one way but if you were say doing a term frequency approach it would not fool the computer in that way right so for example just your ratio of say like your high frequency function words the of and those types of things that's what we call latent style so almost no matter how hard you try you're gonna have those sort of term frequencies that betray either your time period or your genre or most often just your authorial identity right which is a field that I've done a fair amount of work in authorship attribution but I find it very fascinating because what we do is we have these very deliberate ways we mask our style or try to pretend to be somebody else and then the computer can measure something else that's very hard to control so I think that's an interesting interplay yes so right now term frequency is by far the most dominant because it's you know it's the most resilient against OCR errors for example and also there are copyright issues so you can actually put on github all these novels term frequency tables but you can't necessarily put all the novels in word order but the competing model might be like a Markov model term order some kind of a deep learning with that I haven't done it but I have heard other people doing it and then of course you could take engram features but my colleague Ted I corresponded with him about this and actually I was able to send him the github repository with the Jupyter notebooks to collaborate right and get his feedback which I think is awesome but one thing he mentioned to me for anybody who does machine learning is that like usually engrams you know by grams or trigrams they don't actually work as well as single tokens because there just aren't as many repeats across the documents so apparently that just like typically does not work as well but those are and then okay so anybody who does machine learning knows you have these vector everything is just vectors right so you could do that with like the fonts you know you could train a model on typefaces I don't know anybody has done it but you could in theory that would be kind of interesting for terminologists like this to see if they have some translation where faders of public is very out of time or trying to be the costeater in the class again I don't know I don't think so but it sounds awesome I think that would be great yeah yeah what do you guys think the hypothesis there would be I would say a translator inevitably echoes the language of their time without even realizing it that's my hypothesis yes yeah well right just like interpretation so I'm out of time if there's one last comment we can get it in but go ahead oh yeah it's actually a json tree yeah so basically the entire ipython notebook is just a json file that the you know the difference would just look like if you change any json yeah thanks for that thank you everyone