 of the sort of most important scholars leading the field at the intersection of digital humanities and theater and performance studies. He has his PhD from Stanford and I believe did your undergrad in English at Yale, right? So he's really keeping it at the lesser institutions. His work is really quite interesting in the digital humanities has been a kind of methodology but his emphasis is predominantly in the 19th century and he's worked on the intersections of art and industry particularly looking at law and economics. His forthcoming book is copyright and the value of performance 1770 to 1911. And I believe that's with Cambridge University Press and can we pre-order it? There is it out already. You can I think, yeah. You can pre-order it so. And we were just discussing yesterday afternoon or last night that it is actually going to be listed with a law library of Congress number. So if it does not immediately show up in your theater collection, fret not. It might be in your local law library. So you can go in and collect it there. He does a lot of work on Victorian theater but the work that is I think most interesting to many of us is his at the Outstanding SA Award that he received last year for his article, Average Broadway, which appeared in Theater Journal in 2016 and talks about the value of big data and analysis of theater history and is the topic of his talk today. So without further ado, welcome Derek Miller. Thank you so much, Sarah. This here is a program from a concert by the New York Philharmonic in 1974. Five years before this concert in 1969, the decade-long music directorship of American whiz kid Leonard Bernstein had ended. Bernstein had done much to revitalize the Philharmonic, which was founded in 1842 and performed Beethoven symphonies, Mozart Concerti and Tchaikovsky ballet suites for an audience of primarily wealthy elderly New Yorkers. Bernstein was not, however, in tune with the cutting edge of modern music and with the musical lineage that inspired the contemporary musical avant-garde. Given the orchestra's seemingly hardwired diversion to modernist music and its progeny, the conductor chosen to replace Bernstein surprised many observers, Pierre Boulez, who came to the music world's attention as the enfant terrible of late modernist music. Born in France, the resident in Germany since the 1960s, Boulez quickly developed a reputation as a superb composer and the leading intellectual of post-war modernist music. And by the way, after his post at the New York Philharmonic, he went on to run a research institute in computer music, IRCAM, in Paris. So there's a connection here with the technological history of performance in music. Both as a foreigner and as a founder and symbol of that apex of musical modernism known as total serialism, Boulez was as far from Bernstein spiritually and aesthetically as the Philharmonic could get. Yet in June, 1969, the New York Philharmonic triumphantly announced a three-year contract with Boulez beginning in the 1971 to 72 season. He ultimately led the orchestra through six seasons, a tenure judged by many observers as an overall success. What, though, precisely did Boulez do at the Philharmonic? Well, I am going to attempt to answer that question today, but along the way, I'm gonna step back from this specific case of Boulez's tenure at the Philharmonic to describe some aspects of the process by which one studies the history of performance through data analysis. First, I do wanna thank David and Sarah for inviting me to participate in this wonderful program you have thanks to Jason Woodworth-Ho and to Aaron McDermott for their extensive logistical efforts. I'm really honored to talk about data-driven research and to represent in whatever small, weird way I do the work of so many colleagues present and past colleagues who are using data to better understand the history of the performing arts. Two preparatory comments here. First, I expect you've read in some fashion my essay Average Broadway and Deborah Kaplan's piece on the Vilna Theater Troop. Though I'm talking today about the Philharmonic, I do welcome questions or comments about the Average Broadway essay or about some of the things Deborah's saying during our conversation. Second, my examples focus on the Philharmonic, but then I'll show some other stuff that's almost entirely, in fact, I think it is entirely Europe and the United States. I'll say a bit more later about the inequities and ethics of data-driven research, but my own limitations as a scholar are very much in play here, so please do make a mental note that much of what I say can be extended with necessary adjustments to non-European and non-American performance traditions, and where it can't be so extended, it's worth thinking about and discussing why. Finally, if you have some questions as we go along, shoot up a hand, I'll find a stopping point and try to answer. I have a lot of prose here, but it'd be nice if I didn't talk for the entire 50 minutes that I have of text. So take a moment to interrupt me. All right, let's go back to our orchestra program and maestro Boulez. This particular concert was one of a famous series led by Boulez, The Rug Concerts, and it says it there in big letters at the top. Boulez initiated The Rug Concerts to change both the repertoire and the audience for the Philharmonic's work. They're among his most famous innovations as music director and have come to represent the impact he had in bringing contemporary music and contemporary audiences to the orchestra. But if we want to understand this rug concert's place in the Philharmonic's work, we need to look beyond this single event. How did Boulez make space for the late modernist music of which he was exemplar and champion? And for the predecessors in early modernism whose work the Philharmonic had played but little. How did the Philharmonic adapt to these pressures while also keeping faith with its most important supporters, its subscribers? And how did Boulez balance his admiration for a European avant-garde against the orchestra's great symbolic importance within the American musical scene? To answer these questions, we'll need more information, more data. So drawing from the New York Philharmonic's open source database of their programs, I'm going to assess Boulez's impact on the orchestra by looking at the types of concerts that they played, the popularity of the composers Boulez performed, the familiarity of works, and composer nationalities. Along the way, I'll generalize from this example to show you how and why so-called big data might be useful for understanding the history of the performing arts. And I'll be doing that by exploring a few topics. First, I'll outline the phases of a digital humanities or DH research project. Second, I'll address what data might mean in the humanities. Third, I'll show you some other data projects, many digital but some analog, to demonstrate the range of this kind of research and to root what might seem a fattishly modern enterprise in an old and deep strain of performance historiography. And then finally, I'll turn to the specific problem of organizing performance data and so-called database ontologies for the performing arts. All of this, of course, leads up to our workshop after lunch, when we'll take a look at a pre-compiled data set and make a first rough attempt at our own data analyses. But let's think now about the shape of a DH project with the help of Maestro Boulez. So data-driven research begins like most other research in the humanities with a question. From there, we define an archive, a set of information in the world that will help us to answer our question. In the case of Boulez's tenure at the Philharmonic, my archive consists, well, theoretically consists of the entire record of Boulez's tenure at the Philharmonic and any other available information about the orchestra. So I'm talking concert reviews, newspaper, magazine profiles, biographies, pages upon pages of administrative documents in the Philharmonic's digital collection, and that's all in addition to the database that I mentioned earlier. So at least theoretically, didn't go through all of this, obviously, but my archive is quite vast. Having defined an archive, one then transforms that archive from this messy collection of things in the world to an ordered collection of data, a term that I'll define more closely a little later. Let's call this organized collection of data a data set. And I want to distinguish the data set even loosely defined from the archive because the data set is first and foremost curated from the archive. And secondly, because it is curated, it is itself an argument about the data. In the case of Boulez's tenure at the Philharmonic, I'm concerned primarily with what the orchestra played. So my data set is the set of concert programs from Boulez's tenure. By deciding to focus on concert programs, I am arguing implicitly that the music director's most important effects are best heard in what the orchestra plays rather than say which musicians he hires as replacements when people retire. Now this argument could be wrong, but we have to be clear with ourselves and with our audience of readers and students that we're making these arguments. The process by which you include and exclude parts of the archive, whether by conscious decision or ignorant omission, determines the kind of questions you can ask and the answers you can discover. You've begun making arguments about your archive in the process of selecting your data. Now at this point in a digital project, most people shape their data set into a formal database. The database further refines and delimits how the data in the data set are organized. That process of organization is yet another argument about the data. I'm gonna talk later about databases and the kinds of problems they pose for researchers. The solutions to these problems are arguments about data. And for the most part, my analysis of Boulaz relies on a database built by the Philharmonic's own archivists, but I have adjusted it when I disagreed with their categories. That is, when I disagree with their arguments. Now digital projects then often use some kind of a code to analyze the database. Some researchers write bespoke scripts, as is the case for my work here. A script is a set of instructions for a computer. It's written in a language such as Python, which I use, or R. And these languages sort through the data and they parse it in different ways. In other cases, as we'll be doing this afternoon, researchers rely on pre-written packages or on graphical applications. Anything from Microsoft Excel to the software Gefi, which we'll be using this afternoon. And that software does the work for them. In either case though, there's a code base, whether you wrote it or Microsoft wrote it, that's acting on the database so as to summarize and organize the data in a particular way. Again, the code is an argument. For example, say we want to see how important Stravinsky is to Boulez's programming. We can start by counting the number of works that's by Stravinsky that Boulez conducted. But, do we want to count the number of unique works by Stravinsky, measuring the size of Boulez's Stravinsky repertoire? Okay, that's one piece of code. By the way, I'm the worst programmer on earth. Just flag that. Or maybe we want to count the total number of performances of Stravinsky that Boulez led. That's a different piece of code. But perhaps we'd do better to calculate what percentage of works that Boulez conducted were by Stravinsky. Third piece of code. In all of these cases, we're parsing the information in the database slightly differently. Our arguments are in the code itself in how the code interprets the information that we gave it. Now, what kinds of analysis are involved here? I tend to use rather simple quantitative methods, also known as counting. I work with trends over time and I focus on central tendencies and distributions, the kinds of things you learn in your first semester statistics classes and sometimes the first few weeks of those classes. Means, medians, modes. If you're analyzing texts, you might use techniques evolved from stylometry such as word frequencies or type token ratios or you could use more mathematically sophisticated methods, topic modeling or word embedding. You might use geographic analysis and make maps. You could create networks and study the connections between individuals and among communities. Or you might combine any and all of these approaches and dozens of others to suit your particular object of study. This point, it's time to share your results with the world and digital projects usually summarize the results in some visual form, either a standalone image or an interactive interface. These images are arguments. The kinds of information you make available through an interface or how you scale a graph or what variables you present all alter how people understand your data. There's a book called Lying with Statistics, right? Visualization is its own arts about which I'll say nothing today because I'm bad at it. Suffice to say that producing a clear legible summary of complex information without letting the visual effect overwhelm or overdramatize your argument is a very particular skill. And then finally, you put all of this work in the service of your argument. An argument that one hopes is of some interest to one's fellow scholars. So to summarize then, the stages of a digital humanities project are these. Conceive a research question, select an archive, create a data set, develop a database, analyze the database with code or a program, design interfaces and or visualizations to summarize your analytic results and use analyses as evidence to support larger argument. Each of these phases is, as I said, an argument in and of itself. And I should add that though I've presented these stages sequentially, of course, this entire process is iterative at any given moment. So you might be well into your coding only to find that you need to answer a question about material that's not in your database. Ta-da, back to the archive you go. And you'll also note that this isn't that different from any other kind of a research project except there's this particular set of tools and methods in the middle there. All right, let's return to Boulez and the Philharmonic and consider the kinds of concerts that the orchestra played under Boulez's baton. And please, while I'm talking about this, bear in mind the steps I've just outlined which you can imagine my working through for each of these charts. So when he took over in 1971, the Philharmonic played a few basic types of concerts which broke down like this during Bernstein's recently completed music directorship. Each concert type here is shaded a different color and here's the same chart for Boulez's tenure. And the most obvious difference is the expansion of the other category that yellow at the top during Boulez's tenure. During Bernstein's tenure, the subscription tour, stadium parks and promenade concerts make up 90% of the programming. That's the bottom four. This is a percentage bar chart. Can you move it? For Boulez, they make up around 85%. So there's that 5% change in what they're doing. The remaining 5% mostly fall into that other category which includes Boulez's two famous innovations in their formatting, the Perspective Encounters series and the rug concerts. Played always outside the Philharmonic's main hall and usually by smaller configurations of the orchestra, Perspective Encounters were Boulez's venue for hearing music written right now. All but three of the 58 works played at those concerts were Philharmonic premieres and a number were world premieres. The rug concerts were programmatically more like subscription concerts but they reconfigured the physical relationship between musicians and audience. Both formats received critical praise for the innovation and they helped combat the orchestra's reputation as a Stodgy Old Persons Club. But both series also had a very limited reach. The programs for the Perspective Encounters concert and the rug concerts were performed only once. Subscription programs by contrast are performed four times. Together the 17 Perspective Encounters and 30 rug concerts amounted to 4% of the concerts played during Boulez's directorship. By comparison in the same period the Philharmonic played 90 promenade concerts. The Summer Light Classical series conducted by Andrei Kostalonik's which amounted to over seven and a half percent of the orchestra's performances. Interestingly the physical layout of the rug concerts was inspired by the promenade's concerts reconfiguration of Philharmonic Hall. You can see people are sitting at tables in some of the spots of the hall. Perspective Encounters and rug concerts which together account for almost the whole difference in Bernstein's and Boulez's programming thus did not reshape what the Philharmonic was fundamentally doing. We can also look at how much Boulez himself conducted the orchestra and in what venues. Here's a season by season bar chart of concerts. Absolute numbers this time rather than percentages. And this time it's further divided into those concerts conducted by the music director which is the bottom piece with the slight shading and those conducted by other conductors, guest conductors. Three points here. First you can see that Bernstein conducted a lot more concerts particularly in his first seasons than Boulez did for the most part. Second Boulez did relatively little touring with the orchestra. Touring is in the dark blue and it's much smaller than it was during Boulez's during Bernstein's tenure. Third he led zero young people's concerts which are in red. That was the hallmark of Bernstein's tenure. They were televised there. You can still see them. They're amazing. He did none of that. So statistically the difference in Bernstein's and Boulez's approach to the orchestra looks like this. All of Boulez's growth and his impact that's the green at the bottom with the orchestra was in the expansion of other concerts and his regular leadership of them. So you've got the percentage here as the percent conducted by that conductor and you've got the absolute numbers and parentheses there. Otherwise he stepped back compared to Bernstein. In short despite some new and successful concert formats the Philharmonic under Boulez remained a subscription orchestra that ran a popular music series summer free concerts and tours with minimal disruption from its modernist leader. Meta commentary. First you'll know I've chosen to compare Boulez's music directorship to that of Leonard Bernstein. I did that because to understand what Boulez did with the orchestra I needed a baseline and it made sense to make the baseline just the previous tenure to see where the orchestra was when he came in. To add Bernstein to my study I had to expand my archive and thus my data set and my database from Boulez's tenure to Bernstein's. Second, notice that I used percentages in my discussion of concert types but counts in my discussion of how many concerts the music director conducted. In the latter case I decided, I tried it first with percentages but then decided that total count despite being harder to compare across seasons was worth including because it gives you a proper sense of scale. Third, note that in the end I avoided a chart altogether but used a data table with some minor formatting to summarize differences. And in each case I was trying to select a visual form that would best convey the argument I'm trying to make with the data. So that's the way the shape of the project stages. You can see them there. I've been speaking about concert types which is a specific category that's meaningful for the Philharmonic because each concert type implies different venues or audiences and therefore different programming. And when we think about analyzing humanity's data we have to attend really closely to what our variables mean. That is to the real world that our data represent. Some other disciplines are better at ignoring that fact. Let's dwell for a moment on this concept of data particularly within a humanities research project. When we talk about data in DH, what do we mean? How should we think about data? I recommend to you Lisa Gittleman's edited volume, Raw Data is an oxymoron for a wonderful consideration of this question and my discussion draws on that book and the essence of which I'm gonna boil down a few points. First, data are different from facts. Facts are ontological. They exist in the world. They are part of the endless stream of phenomena of real things that we in our work seek to understand. Facts are also by their nature true because facts are. For example, it is a fact that Pierre Boulez conducted orchestras. It is a fact that he led the New York Philharmonic. It is a fact that the orchestra performed music by Igor Stravinsky. Data are not facts. Data are not ontological but rhetorical. Data are not necessarily true. False data, still data. For instance, some data points from the RUB concert program might be orchestra New York Philharmonic, location Lincoln Center, conductor Peter Boulez. Even in the short list, you can see the rhetorical work that data does. First, the rhetorical, and also data's relationship to truth. First, the rhetorical. It's factually true that the concert took place at Lincoln Center, but it's equally factually true that the concert was at Philharmonic Hall, a part of Lincoln Center. By listing location as Lincoln Center rather than Philharmonic Hall, I've chosen them from among many equally true facts. I could also have written New York City, right? In each one of those carries a particular set of information about the orchestra's relationship to a larger cultural institution or to a specific acoustic space, if I say Philharmonic Hall, or to an urban environment, if I say New York City. Second, I named Peter Boulez rather than Pierre Boulez as the conductor. It's still data in my data set that Peter Boulez was the conductor. Just happens to be not true. As my wonderful middle school math teacher, Vern Williams used to say about calculators, garbage in, garbage out. So you can see already the work involved in turning facts into data. Gilman actually suggests we shouldn't even refer to data. The word derives of course from the Latin data meaning something that is given, but rather to capta because you're going out and taking things from the world because there's nothing given about data. Data are always already selected, always already rhetorical. Second thing about data, disciplines do a lot of work in defining what counts as data. I was trying to think of the good examples, audience reactions to a performance might not have counted as data prior to the advent of a reception studies mentality. You wouldn't think to care or record of those. You have to first conceive of the rhetorical possibilities of a set of facts before recognizing those facts themselves might be worth translating into data. And if you've ever done any work transcribing or translating from an archive into like taking notes on your computer, all the stuff you leave out, that's your rhetorical work there, right? Everything you're saying, oh, that's not important. I don't need to note that, I don't need to note that. And then of course, you know, two years later, you're like, oh no, that's my experience. Third, data give you a really false sense of objectivity and innocence. Well, I'm just telling you what the data say. This is of course a statement of pure sophistry. It should be dismissed out of hand. Data are themselves already rhetorical, already arguments and must be treated as such, and therefore anything they're telling us is also rhetoric. Fourth, data usually undergo multiple significant transformations. For instance, data often abstract a set of more concrete facts. In this example I had Lincoln Center abstracts multiple true facts about location. Data are often also aggregated. By this I mean that data get placed into categories that eliminate the specific underlying fact that they represent. For example, in my analysis before of concert types, I had aggregated perspective encounters and rug concerts into another category in those bar charts just to make it more easier to take in the visual image. Furthermore, in DH projects, the data we analyze are not actually documents, videos, sounds or images. They're digital representations of those things. Digitization transforms data into a particular digital form that is into bits, storeable on a computer, that further delimits and defines the object of study. So to summarize that, data are selected from among facts, not ontological but rhetorical, not necessarily true, defined by disciplines, likely to give a false sense of objectivity, but usually transformed by abstraction, aggregation and or digitization. Back to Boulez. And again, think about how my discussion of data might play into this specific analysis. Given Boulez's limited influence on how and for whom the Philharmonic played, his influence might be audible in what the orchestra played. Let's consider then Boulez's repertoire. One of the major expectations when Boulez took over the orchestra was that he would bring more variety to the orchestra's programs. But Boulez quickly earned a reputation for being too innovative with concerts that inspired walkouts and angry letters from subscribers which are all in the digital archive. You can go read some of them. The orchestra's annual report at the end of Boulez's premiere season, argued that this was just a PR problem, not a repertoire problem. I quote, the very favorable press reaction to Mr. Boulez's programming, perhaps overemphasized the innovative aspects of the season and produced worry on the part of some subscribers. The actual programs themselves did not warrant the concern generated by the press. So who's right? Did the Philharmonic become a ragged and bone chop of musical curiosities? Or did it remain a museum of romantic musical greatness? Let's look first at composer popularity with help from a visualization inspired by the work of John and Kate Mueller. This is a graph of the cumulative frequency of composer's works during Boulez's music directorship. Each time a conductor gave a downbeat for a work by Mozart, I added a tally to the Mozart column. And then I divided each composer's total by the sum of all the composer's tallies. And then I stacked them up, from most performed to least. At the bottom of the most popular composers, Stravinsky, accounting for around 7% of works performed. Next comes Beethoven at 5%, Ravel and Tchaikovsky at 4%. I think Mozart should be down there, too. Something went wrong with my chart. Ravel and Tchaikovsky at 4% each. Cumulatively, with a mere five composers, we can account for a full quarter of all the works that the Philharmonic played from the 1971 to the 76 seasons. Seven, I like so. Something went weird with my chart. I'll have to look at that later. I'll fix that before it gets posted online. But you can see the gist here, right? We've got a big, a small group that forms over a quarter of the things. It's not that wrong. Compare the same graph for Bernstein's tenure. That's definitely right. And here, seven composers account for a quarter of the works performed. So we can say immediately that Boulez, whatever else he's doing, shifting the balance among some composers or what have you, he's not reversing the inequities of the symphony orchestra model. A few composers remained heavily represented with a large mass of composers performed at nearly a single concert or concert series. Let's get a better look at the composers that Boulez privileged by resorting this data and plotting each composer individually. The X-axis represents Bernstein's tenure and the Y-axis Boulez's tenure. Because an orchestra formulates repertoire most directly through its subscription programming, I have included only subscription concerts for this graph. And for visual clarity, I've plotted everything on a log scale and included only composers that measured at 1% frequency or higher. So you have to imagine a thousand names clustered right here. Well, 500. Now the line represents where both composers performed, where both conductors performed a composer's works at an equal rate. Near that line, we find names such as Schumann, WC, Tchaikovsky. Some composers though were much more associated with one director or another. For example, Beethoven and Mozart, although they're both in the top echelon under both directors, swap places, right? So you can see that Beethoven's definitely not as high for Boulez's tenure. Bernstein's directorship favored Brahms, Mahler, Berlioz, Sibelius along with Shostakovich and the Americans, Barber and Ives. Interestingly, Ives was the composer that Boulez conducted the most at non-subscription concerts. Boulez's subscription programs featured Stravinsky, Haydn, Ravel, Bartok. Subscribers heard more in Liszt and Weber. We can understand Boulez's own tastes a little better though if we divide his tenure into only subscription concerts that he conducted and those conducted by others. So here's that same graph, but this time Boulez is, this is all Boulez's music directorship and Boulez is conducting on the X-axis and others are conducting on the Y-axis. Note first, Tchaikovsky. Boulez literally never conducted Tchaikovsky. So Tchaikovsky's continued presence in the repertoire shows us where the orchestra, despite the preferences of its music director, carried on doing what pleased its audience. Weber and Bruckner too were entirely creatures of other conductors. Well, who did Boulez conduct? We find some romantic names on the side of the line, Schumann, Wagner, Mahler. Boulez conducted a surprising amount of Bach for subscribers. What has gone on with my chart? That'll teach me to try to remake it ever again. Among the core Boulez composers, notice to the so-called Second Viennese School. The Second Viennese School, that's Arnold Schoenberg, Albon Berg, and Anton Weber are important because they, and particularly their serial compositions, represent the origins of the modernist music that the New York Philharmonic mostly avoided. Boulez had come to New York to change that. He told journalist Joan Pizer in a New York Times interview, the job of a conductor is to bring an audience to realize it's as important to hear Berg as to hear Mahler. And he did conduct Berg about as often as he conducted Mahler. But the most strongly associated Boulez composers are not those of the Second Viennese School. Rather, his key composers on the subscription programs are Ravel, Stravinsky, and Bartok. Each modern in his own way, but none as orally foreign as the Austrians with their tone rows. So to the extent then that Boulez freshened the repertoire, he did so not primarily with the core trio of modernism, the Second Viennese School, whose musical logic undergirded Boulez's own wing of the contemporary musical field, but with a far more palatable trio of composers. The annual report then was not wrong. Boulez's subscription programs were far less innovative than his reputation suggested. Okay, again, a meta commentary. First, you can see how disciplines define data in this section's focus on the musical work. Music's work ontology exists for many reasons, but I've accepted it just entirely in the preceding analysis. Second, I don't account here at all for the duration of different works. It should be very relevant that some works take much longer to play and to rehearse too than others, but I didn't deal with that at all in this analysis and instead treated every work as an equivalent entity. So you can see how the data here, though apparently neutral facts, are much, much, much messier than my analysis shows. Okay, that's a lot about Pierre Boulez and the New York Philharmonic. I know that later this week you'll spend some time brainstorming your own projects and to spur you a little bit, I wanna show you some extant projects and data sets, some analog, some digital, to give you a sense of the range of things that people have used as data. Consider first analog lists of plays. You all know, of course, about the Lord Chamberlain's licensing of plays for performance in Britain and there's a new project, the Great War Theater that made a database of all the plays licensed in England during World War I. Government records like that are superb information repositories. There's actually an amazing two-volume set of all plays filed for copyright in the US between 1870s, it's 1875 or 1870 in 1916 and it's online for free and Archive.org and Google Books and all that stuff. So I've been trying to work with that a little bit. I'm fascinated right now with lists of plays and you will find lists of plays all over the places, publishers catalogs, library collection recommendations, company season lists and other surprising sources. Of course, playbills and newspaper advertisements are fantastic resources. Here is the original London stage and massive 11-volume project, Sarah mentioned it earlier, from the 1960s that aimed to represent a complete daily calendar of theatrical performance in London from the restoration of Charles II to 1800. This was, as Sarah said, actually translated into a database in the 1970s, a very problematic but early DH project in performance history. Maddie Burkard at Utah State University is currently revising and revisiting that work to make a new and modern and more correct digital London stage. This afternoon we're gonna work from Theater in Dublin which is a similar performance calendar but it was made almost entirely, it was made entirely and then he went to publishers by this guy John C. Green over, I think it took him 25 years? No, 30 years, he was working on it for 30 years. And it was published a few years ago. And then of course, narrative collections of performance history such as T. Alston Brown's History of the New York Stage. That's full of data, it's just written out in sentences. Now modern research often creates digital versions of very similar information. For example, the internet broadly database and playbill vault which I use in my Broadway research record every production on Broadway. They're not, however, a daily register of performances. For example, their coverage of replacements and understudies is generally weak so you can't trace casts past opening night. You can extrapolate a great deal of information though about what's happening on Broadway over time as I argue in average Broadway. The Lucille Lucille Artel Foundation runs a similar archive for Off-Broadway. You've seen a bit today my use of the New York Philharmonic's performance database based on their own program archive. A similarly extensive in-house archive at the Comedie Francaise undergirds the Comedie Francaise Registers Project or CFRP. The CFRP is built not on program archives but on nightly receipt registers. A state and company mandated economic record series that over time includes increasingly more information about performances. In the industrial era one's even more likely to find in-house economic data and there's a bit of a boom in economic theater history in the 1960s and 70s where you'll find an outpouring of articles, books, dissertations, tallying productions and interrogating the theater market's operations particularly in the United States. And I definitely see my own research very much within that trend. Geography, another great subject. Some of Deborah Kaplan's research on the Vilna Troupe includes geographic data. Kate Ellswit and Harmony Bench have been exploring dance company touring routes. And Frank Hildey's theater finder is unfortunately now in some disarray but was an early and exceptional online resource. Play texts themselves are collected in all kinds of formats. Many pre-1920s plays are online for free in some form or another though their editions might make them less than entirely useful for analysis. One exception, there are others, the Folger Library's free online texts of Shakespeare which are fully marked up and I mean fully. Every word and character including spaces between words is its own HTML tag, XML tag in there. They also have a growing series of other early modern plays that they have marked up as well. Otherwise, you can dive into Project Gutenberg or Houthi Trust, download the OCR text, fix it, mark it up. You're gonna learn more about this later this week with Amy Hughes and get to it. Government records are fantastic as I said. Taxes, censuses, building permits, lawsuits, individuals or companies might keep daily, weekly or yearly financial records or diaries. There may exist probate records that show you what books an actress owned upon her death or wardrobe inventories. Anyway, data are everywhere and gathering that data to understand performance history is a long and I think noble tradition of performance scholarship. One reason that I think it's worth thinking so capaciously about what counts as evidence for theater and performance history when we're thinking about data is that while some data are attempts merely to translate a fact in the world such as a list of works performed at the Philharmonic Concert, other data are implicit within the database. It's why I always get excited when I see theater data of any kind. I know, just think about it long enough the data set has something to teach me about performance and such extrapolated or implied data can be another kind of evidence. So let's consider the familiarity or unfamiliarity of Boulez's repertoire. The Philharmonic's database does not include any information about when a given work was last performed by the orchestra but you can calculate it from the whole database. In later interviews about his conducting Boulez noted that even when conducting familiar composers he always aimed to play unfamiliar works. When I arrived in New York he told one interviewer I had them check back through several decades in the standard repertoire to find out what had not been played. He found, he said, enormous gaps. A fresher repertoire with thus feature not only a different coterie of composers which we saw to some extent but also unusual works by familiar composers. So when were the works that Boulez conducted last heard at the Philharmonic? Let's start with a Box and Whisker plot which gives the distribution of the number of years since a work was last performed in each of Bernstein's and Boulez's subscription seasons. I've left premieres off here because everything then would be a weird infinity or zero. You can see at least two things I think clearly here. Bernstein's most repetitive year was 1958 in which a full quarter of the subscription works and that's the bottom line here, the bottom of the box represents 25%. So 25% of the things he played had been performed the previous season. Boulez always did better than that and sometimes significantly better. Works at Boulez's concerts were a year or two older overall than those played by Bernstein. Second, his mean is significantly higher than Bernstein's, that's the small black box, particularly in his first three seasons. So Boulez's programs in other words contained fewer works heard very recently and a larger set of works not heard in a long time. Interestingly, however, Boulez's own conducting was not as big a factor in this unfamiliarity as one might think. Here's the box plot for just Boulez's seasons with concerts he conducted in red and other conductors in black. Only in 1972 does Boulez's own repertoire clearly be other conductors in terms of its novelty. In other seasons he actually leads a more familiar repertoire among non-premiers than do guest artists. Now some of that is certainly due to Boulez's repeating his own repertoire. In the second season he repeated no works from his first season. But in his final season, a full 23% of the works he conducted for subscribers, those subscribers had already heard under his baton. We can add premiers defined broadly as any work, new or old, never before played by the Philharmonic to our observations, and this time will include all concerts. Here's a count of premiers as a percent of total works performed during Bernstein and Boulez's tenures. As you can see, although Boulez's final season was the orchestra's, excuse me, most novel since 1958, his overall numbers don't differ significantly from Bernstein's. Again though, we can refine this data to look only at the music directors, and here you see that Boulez had three seasons in which over a third of the works he conducted were New York Philharmonic premiers. Well, Bernstein never reached that level. Indeed Boulez's minimum premiere percent was around Bernstein's maximum. Boulez's subscription concerts then absolutely programmed more unfamiliar work than Bernstein's, and he premiered more work than his predecessor. But Boulez himself was not the soul or even in some seasons the primary conductor for such unfamiliar work. So I hope you can see from this example the way in which the dataset implies other calculable information about performance that goes well beyond the bare information in the dataset itself. Excuse me. I wanna say a few words now about how to organize data. When you take a dataset and transform it into a database, you must decide how to organize that data. The structure of a database is also known as a database's ontology, great word for performance studies of course, how does the database define categories and entities within it? Whoa, whoa, whoa, what's a database? A database is a structured collection of data. It is as Levmanovich and Catherine Hales have described a spatial or paradigmatic form as opposed to narrative, which is temporal or syntagmatic. Every datum within a database is in some sense equal and they're all always present at once simultaneously. Databases aren't in the process of unfolding, databases are, and in this sense they acquire and exude a kind of facticity. You can use databases to support a narrative, but databases do not in and of themselves imply a narrative. Furthermore, you only ever encounter a database by means of interfaces and algorithms. Interfaces, how users interact with a database and algorithms, the way in which the interface queries and retrieves data remind us that accessing a database is also not a neutral act. So as with the stages of a DH project I discussed a little while ago, interfaces and algorithms are themselves arguments that reframe and redefine the information in the database. Okay, a few words on the kinds of databases out there. If you use a computer at all, you interact with what looks like a hierarchical database, also known as your file system. You have folders on your computer, folders called documents or music. And then within the music folder, you might have folders for each musician, Beyonce, David Bowie. And then within each musician's folder, you have album folders, Lemonade, Ziggy Stardust. And then within each of those folders, you have the atomic entities, the MP3 tracks, each with an order in which they are to be played, file hierarchy. You can, by the way, your file system is not actually a file system in the computer's operations, it just looks like one of your desktop. You can organize other information that way too. Let's look at our Philharmonic program and think for a moment about what the categories should be in our database. How might we structure this information in a useful way? So if we were to write up this information in an organized fashion for a hierarchy, we could do something like this. The program consists of basic concert information first, orchestra, performance number, season, music record, et cetera. And then we have a list of all the works performed in order, pretty straightforward, yes? And we can make this outline for every other Philharmonic program, filling in each line as we go with its appropriate answer. That's actually more or less the way the Philharmonic database is built. It's a flat XML file. Conductor, Leonard Bernstein, work, Second Symphony by Charles Ives, et cetera, et cetera. So this outline then, the form of this outline, becomes our ontology, the form of the database, into which we're gonna pour the information from all of our programs. Of course, however, there are a billion problems with this ontology and also with a hierarchical database. First of all, hierarchical databases are insanely redundant. Every single concert would have every single piece of this information. Secondly, it's pretty hard to search for a given entity. How many times do you know when Stravinsky's, how do you know how many times Stravinsky's music was performed? You have to go look at every single program and check. This is the place where the technological performance of the computer meets the ontologies of performance that we're describing, the history of performance we're describing. Today, there are a lot of different types of databases that avoid some of those problems in hierarchical databases. There are document databases, which store structured entities, graph databases, which record entities and their relationships to each other. Relational databases, which I primarily use in my own research. I don't want to get into the nitty gritty of these forms, not at least because I don't understand them all. But I really want to note that each of them suits different kinds of data, better or worse than others, and makes certain ways of thinking about that data easier or harder. That is, these different database forms require and imply different ontologies, which might represent the world, better or worse, depending on which corner of the world you are aiming to represent. Of course, though, once we start classifying things, we run into serious problems about our classification. So let's consider the nationality of the composers whose music Boulez conducted. The New York Philharmonic was and is a major voice for American classical music. Bernstein, as the orchestra's first American music director, claimed a natural affinity with his countryman's music around 29% of the 500 works he conducted at the Philharmonic were composed by an American. The avant-garde that Boulez admired, however, was almost uniformly not American. He kind of hated the American avant-garde, contemporary composing scene. Boulez disclaimed any interest in worrying, though, about a composer's nationality. I am always fighting the nationalistic point of view, he told the Times, but it's hard not to hear a plea against national favoritism as an excuse to conduct modern European music. However, ignoring American composers wasn't an option for Boulez at the Philharmonic, and he actually personally conducted 84 works by American composers. That comes to 25% of the works he conducted, which almost keeps pace with Bernstein's numbers. So here's a breakdown of composer nationalities during Boulez's tenure. And this is broken down by concert types. You can see one thing, another meta comment. You can imagine I made every chart I made for all the concerts, just subscription concerts, all the concerts with Boulez conducting, all the concerts with other people conducting, all the concerts with everyone conducting, and iterating through those options for everything to see what gave me any information. Digital analysis allows you to do that by changing a few lines of code. If I were doing this by hand, I would just make a choice, and that would be the end of it, right? So here's the breakdown of composer nationalities. The numbers are generally comparable to Bernstein's, except when we compare the subscription concerts conducted by the two men where we see clearly that Boulez preferred a German and an Austrian, that's the bottom, along with the French repertoire to the American, which is the dark blue. So he shrinks the dark blue there, right? We've got another problem, though. Boulez's favorite American composer, that's Igor Stravinsky. Stravinsky was born in Russia. He worked in Switzerland. He lived for two decades in Paris, and then he became American citizen after leaving Europe in 1939. If we recalculate Boulez's subscription programs and assign Stravinsky to Russia, his native land, American-born composers dropped down to fourth place in Boulez's hierarchy. Moreover, the Stravinsky that Boulez lionized at the Philharmonic was almost strictly not the American Stravinsky, rather Boulez focused on the European Stravinsky of Right of Spring and Petrushka. During Boulez's tenure, other conductors actually ventured farther into Stravinsky's repertoire, including a lot of the five works composed while using the United States. Boulez only conducted two works Stravinsky composed in the U.S., and one of them was a revised work from 1920. So the Stravinsky of Boulez's New York Philharmonic programs may have been literally American citizen, but he was musically Russian and French. So the question of Stravinsky's nationality is just one example of the kinds of questions that arise from any given archive when you create a database. Cumulatively, there are questions about how to describe and define the world in categorizable ways. Jeffrey C. Boker and Susan Lee Starr have just written this wonderful book, Sorting Things Out, which discusses the stakes of classification, classification of which the database ontology is one example, segments a world in space and time. It creates unique categories that are mutually exclusive and aims to be a complete system. And then the ontology you make becomes a kind of standard, a set of rules about how objects should be sorted that must be enforced and that have their own kind of inertia. Of course, the problem is that the ontology represents in an orderly fashion a disorderly world. It's impossible to represent reality precisely, completely and usefully. Organizing information requires in some way eliminating information, trimming information down to its essentials. So should Stravinsky be Russian or American? Should we catalog where a work was composed? Such questions also risk becoming ethical and political. Consider how we might record gender or race or as in the Stravinsky example, nationality. Classification of any kind, including creating a database ontology, constrains the representation of reality while also tending to reify the reality it does represent. Therefore, we must remain hyper vigilant about seeing the reality behind our classifications and remain particularly sensitive to our system's exclusions. Excuse me. How and why should we bother with all this data analysis? What can such an approach help us understand? Fundamentally, I am interested and excited by this work because working with data like this helps us to see context and change at a scale that clarifies the actual movement of cultural history. And that movement is far slower and more conservative than our histories, which are filled with anecdotes and exceptional inflection points usually convey. One cannot simply, however, write a data-driven history in isolation. In my example today, I've given you a brief overview of some of what the data tell us about Pierre Boulez's attempts to modernize the New York Philharmonic, but I have left out many important influences on Boulez's tenure, influences that will not show up in a database of concert programs. These include, Boulez's concurrent music directorship of the BBC Orchestra, a musician's strike in 1973, the renovation of Avery Fisher Hall in 1976, the same year the American Bicentennial, which affected their programming, festivals devoted to contemporary music and to composers such as Charles Ives, a shrinking subscriber base filled in by non-subscriber purchases. And as I said earlier, the data I described don't account for factors such as duration of various works or that some composers simply wrote fewer works so the orchestra had fewer to perform. Nevertheless, the picture that we do have is not, I think, entirely without some merit. So what can we say about Boulez's time leading the New York Philharmonic? On the one hand, he definitely tried to build a new musical culture, particularly around prospective encounters and rug concerts, and his subscription programming, particularly under his own baton, emphasized a different set of composers than those his predecessor had stressed. The specific modernist heritage that Boulez famously championed, however, did not loom as large in his programming as one might imagine. If we look at a network of music directors and their subscription programs from 1911 through the 1976 seasons, we find him comfortably nestled among his peers. So the blue are music directors and they're connected to the gold, which are the composers by how many times they conducted that composers worked or rather the percentage of times. And Boulez's network is in red. And you can see every composer has his group of things that the conductor, excuse me, music director has his group of weird composers that he was into in promoting, as did Boulez, but he's still hidden most of that central core. He's not sitting out here in La La Land, right? So I've spoken today in somewhat broad but also rapid fashion. I'm about a range of topics. I outlined the shape of a DH research project. I emphasized the way in which each stage is itself an argument. I then addressed how we in the humanities should think about data and then I turned to examples of data sets indicating where you might look for data. I discussed database ontologies and how translating a data set into a searchable database can bring us to reexamine the categories that we're using. Three quick topics before I close. The inequities of DH work, the ethics of DH work and how to begin a DH project. I mentioned that the outset many of my examples are American and European drama. This is, as I said, very much a function of my limitations as a scholar. However, it's also true that data-driven performance history favors a certain kind of performance tradition, one that has institutions or organizations that will create and archive data. Data-driven history is easier to do and better suited to studying archives than repertoires to use the Diana Taylor diet. This is not to say, however, that performance traditions without a written archive are impossible to study with DH methods. But the methods are really hard with good archives, so creating an archive from scratch multiplies your difficulties. And by all means, though, you should go and reimagine how to do DH work to study a performance tradition that's more repertoire than archive. But you should also recognize that peers working with the same methods will be having a far easier time of it and they might be ill-equipped to understand the challenges that you're facing. Second, the ethics of DH research. I want first to underline something I mentioned above about the ethics of classification. Every system of classification codifies power relationships. It defines the world and in so doing makes present things that we name and erases things that we don't name. Of course, we all know that the politics of visibility carries its own risks. So I don't really have anything to prescribe about classification other than to say that we have to be open and honest about the limits of our classification and about the power that classification grants. The second ethical issue touches on the collaborative nature of DH work. As DH research goes, I've become fairly independent. That's my nice way of saying I'm really controlling. I write my own code, for example, which will not entirely unusual, is also probably not the norm. I've also relied on part-time research labor to develop datasets. And in the early phases of my work, I did draw extensively on advice from librarians and DH specialists on my campus and elsewhere. DH research, I think, calls attention to the highly collaborative nature of all of our scholarship and how best to honor those collaborations and to give and receive proper credit remains challenging for a set of methods that still feel strange to most humanists. This afternoon, I'm gonna talk you through work with two datasets curated from the Theater in Dublin project. We'll do some basic quantitative analysis on one dataset, and then you're gonna make a network graph like that one I just showed you of another dataset. Perhaps you'll actually learn some skills, but more likely, you'll just get a clearer sense of what's like to do this work, and I hope of the satisfactions that can come from such research. How though, if you do want to do DH research with data, should you actually go about learning it? I think the only way to do it is to have a project, a question, a set of data, and just dive in. Yes, you should probably find a way to get comfortable with the command line and the basic concepts of programming, but barring the taking the time to do an entire year or so of computer science, you're gonna learn how to do this work best by just doing it. Find a problem, a set of data that might help you solve that problem, and then go ask people questions about where they'd start. There are a lot of materials out there on which you can draw books, blog posts, and tools, pre-made tools, and there are, I think, a seemingly infinite number of unanswered and fascinating questions out there. Fundamentally, as I said, I think working with data helps us to manage the limitations of a cultural history that's focused on novelty and innovation. In Boulez's case, the story has been that of a world-famous musical troublemaker come to stir up an old institution and make it new, but the data show us how the institution, while giving him some room to express himself, mostly carries on its lumbering way, doing more or less what it had been doing for the past 130 years. It's a reminder, then, that the process of the unfolding of cultural history is a larger process than a set of case studies permits us to understand. Cultural change often happens more at the margins than at the center, and this is as true of Broadway or the Restoration Theater as it is of modernist classical music. Schoenberg may be important, but what orchestra audiences really want to hear is Tchaikovsky. Thank you. Yeah, thanks for a half hour for discussion, questions, comments, thoughts, great. About anything. Yes, please. Remind me your name. Christine. Christine. So this is maybe just a trivial question. I'm just fascinated by the data set of conductors, and I wonder, was there a single work by a single woman composer in any of that material? You know, they don't record gender because basically everyone is male. So the short answer is probably kind of no. Yes, you do start to get female composers performed more frequently. There was a recent article out the American Symphony Orchestra Association or wherever they're called does do studies of their members programming, and it's still, it's some, I think it might even still be in the single digits across the United States. It's crazy. It just so elegantly makes your point about the selection of documents that it doesn't even rise to the level that we have today. Yeah, absolutely, yeah. This is, yeah, in fact, I had an example about this earlier when I wasn't talking about this day, I was talking about a friend in economics, I'd heard her give a talk once, and she has collected data about interpersonal relations and microloan repayment, and all of her participants in the study are women, and it never gets mentioned once in the entire talk. The economists just are like, homo-economicists does not have gender, and so, yeah. So it's entirely unmarked here. Yeah, absolutely. And if I could ask people to really project, so you can ask questions for the video. And or I'll repeat the essence of the question. Yes, for my name. Rita, thank you so much. This was such a great presentation. I had a question about something that I was trying to understand about the way that you're doing this work, the labor. Could you speak a little bit about things about where the majority of your time is going? How quickly you are producing this research compared to maybe other forms of traditional scholarship that we're used to in this field, the theater and the performance studies? Sure. I think that would be really interesting to hear alongside the images. Absolutely. Most of my time goes to Twitter, but then, that's kind of true. So one thing, what I actually mean by that is that it's really hard for me to know about how other people manage their research time and energies because I only see my own horribly slow, weird pace of scholarship and rhythm of work. So it's hard for me to think comparatively. What I will say is that what I'm mostly doing is I'm finding a data set and then doing what one could call exploratory data analysis. Just to say I have a few kinds of things I know I want to ask. What's the mean or median or mode of this thing? What are the categories here? Okay, how do they break down? They break down differently over time. Do they break down differently if I control for certain categories? If I sort them by conductor or something like that. And then I think about, well, what can I extrapolate further from that? So I do spend a lot of time writing code, thinking about what kind of a chart I want to make, making it, thinking that it's ugly, changing the colors, thinking of a really cool version of it that I then spend like four hours trying to figure out how to do this one thing that I don't know how to do in Python and then doing it and then deciding it's actually too complicated of visualization and going back to the simpler thing. So there's a lot of just messing around, frankly. So I'm doing less charting through the arc. Less of my development work is sitting down and writing out an interpretation of a passage of text. I'm in an English department so I can say this is a thing. And trying to think about what that means and going back and revising that then it is simply poking and trying to create things that might count as evidence in some sense to a larger argument. Does that answer your question a little bit? But I spend a lot too much time poking around and trying to find the perfect way to show something and then I screw it up anyway as I did with that composer. Keep it free from the chart. That's bothering me to no end. But that's what I do. The programming stuff is interesting. Programming's a really weird process if you've never done anything like it which is that you spend most of your time copying what other people have done somewhere else and then changing the variables to be whatever your thing is. It can be tedious work, particularly for a beginner because the mistakes you're gonna make are things like you're gonna miss a space in what you're typing and the computer is like, I don't know what's happening, oh my gosh. And it throws off some error and you're like, I don't, I can't, I still write code sometimes. I do this less than I used to but I still write code sometimes that I stare at for like an hour trying to fix. And then I go come back two days later and I'm like, oh my God, I forgot a capital letter in that variable. Like I cannot, why are you so stupid? You machine, right? So there's a lot of that too. But programming, you can teach yourself to program if you get rid of the mentality that's crazy and weird. There's like six things you need to understand. And then you can do basic stuff very quickly. Also, as you're gonna see this afternoon, Google Sheets and Microsoft Excel are magic. They are magic. You will see, you can type a question into Google Sheets and it will make a chart for you. So there's a lot you can do even without the kind of exploratory stuff that I spend time on. So yeah, that's my best answer. Jason and then we'll go across. I really appreciate how careful you are to say that everything is an argument because it's so terribly important. But I did have a question for you about how does your impression, twofold question, how does your impression of your accessibility to data guide your first question that you have? And then, what was some of the most compelling questions you've had where you realized, I don't have the data, not even going there? Oh, okay, well I've got a good, all right, I'll start with the latter one. So the questions are, what are compelling questions I don't have data answer? And the other, excuse me, the other one is, how does thinking about working with data change the kind of questions I'm asking? I think it means I'm fundamentally interested in questions that are scale, that are about scale in some sense, right? So that's what I'm usually thinking in mind of. I also, you know, as I was writing this, I think, well, how where do I start? To be honest, I also have been starting lately just by finding archives. So I'm working on a paper right now that I'll be giving a first version of it ASTR in November in one of the curated panels on nightly stage managers reports. I have a 579 performance data set of a Broadway run of a show with comments on everything that went wrong and the run time of every act. And I saw that and I was like, ch-ching, because it's just an amazing data set. And now I'm trying to, doing exploratory analysis and trying to figure out what on earth is interesting. And trying to find a way to create categories of the kind of errors that happen in performance and the things that get noted. One of the things that's been noted in the ones I'm transcribing right now are there are mice in the theater and every time the stage manager catches one with a trap, he draws a little mouse on the stage manager report, which is hilarious. But I'm like, is this a facilities problem? This is a facilities problem, right? But I'm like, who are the participants in this error? Vermin and the stage manager and this is a facilities problem, I guess. But is it different from the temperature problem which the actors complain about? So I'm thinking about these things and that's the work that I end up doing. So in some ways I'm going back and forth all the time between question and data. What's an interesting question? Things I don't have the data answer. So here's a question and one day in the next year I will do this because it's been driving me crazy and there's no data on this but I'd like to make a fair me, what's called a fair me estimate in which you estimate the order of magnitude for things. So how many, you know, how many piano tuners are the questions they used to ask students to get consulting jobs, right? How many piano tuners are in New York City? So you have to think about, okay, what's the population of New York City? And you get generally right, oh, it's around eight million people. Okay, how many households are there in New York City? Oh, let's say it's two million. And so you're just guessing magnitude really of the size of things. So here's the question that I want to do a fair me estimate for and I need to sit down and think a lot about it. What is the amount of theatrical activity in the United States in a given year? And I have some data points of which I can verify things like census records of how often people go to see, how many adults see a show in a given year. So 20% of American adults see a play in a given year. Think about how sad that is for what we do. So that kind of information can be a check, but that's the kind of data which just doesn't exist. So I'm interested in a question like that. Inferring how many plays we're missing from the early modern period or something like that. Those are other things that I'd be interested in discovering. Yeah, I'll try to do this really loudly. Okay, I'll repeat the essence. I'm not so good at it. Okay, so as you went through each of these areas, you kind of parentheses this sort of what I'm calling the margin chart. Because you say, well, this is what's available, but I can't get to this. And so, for example, the graphic displays, I was thinking, well, there's the network one, there's the side-by-side one, there's all, which kind of chart is not there, like a moving one? Yeah, yeah. Or a one that shifts while you're displaying it according to an order, a random order, okay? That might change how you're thinking. So another one was, you kept asking, so what does this say? But that was more in the margin chart of what's not there. Someone brings up gender, race, ethnicity, class, accessibility, those are in that margin chart of data that's not there. So as you're going through this, do you think of having, maybe it's a ghosted or a phantom side to what you can show us that hasn't been able to be mined? So in other words, that creates not just the argument, but in some ways a kind of counter intuition that's also data. So how many things were missing from what I did? And what is that, in a sense, that missing stuff due to how I'm thinking? Yeah, well there's one version that has a very direct impact which isn't so much missing as just on the cutting of the floor, which is that there are a ton of charts and other things that I've made that I say I can't do anything with this or this isn't relevant or I only have an hour so I'm not gonna talk about this today. So there's that marginal work. The margins of stuff that's left out because it's not in the data set in some fashion is, I mean, so there are interesting ways in which, so another, for color girls who considered suicide when the rainbow is enough, right, there's a, I think I have the complete, but I'm working on getting it digitized and transcribed, data set for those performance reports from that shows at the public and then on Broadway and then on tour. And every now and then something about race comes up in the stage manager's reports. I've just flipped through them. I haven't been through them in detail. But otherwise, race is pretty much on mark in this show that started Black Women on Broadway and was a huge sort of revolutionary piece of theater in so many different ways. So that's gonna be a fascinating case to really think about that in, I hope a useful way. Again, there are limitations to my own scholarship. I wanna say, right. So one of the things though about this kind of work is that in theory, so this New York Philharmonic data is all online. You can go download it right now. It's all, you can go to the New York Philharmonic website. There's a link to the digital archives and then there's a link to download the complete program data set in XML or JSON form. You can download it right now. And in theory, you could remake everything that I made. If I were a somewhat better citizen scholar, I would post all my code online. So you could see exactly what I was doing and you'd probably find a few errors where my code ruins my analysis. I've found those in work that I've done, right? So that's a terror of this work. And it's not so different from misquoting a source or something like that, but it's a version of that. So in theory, it's reproducible. And some of the stuff that I leave in the margins, other people should be able to discover. One of the reasons I think I've been able to do what I have with this kind of research relatively quickly is that so many people just haven't bothered with this kind of work yet that there's a lot of room to answer simple questions. I'm also interested in the simplicity of this. There's a weird simplicity to this, right? Literally how many times did he conduct a work by Stravinsky? It's an interesting thing to know. And maybe I can figure out what's interesting about it in terms of an argument. There are more complicated versions of that discussion to be had, an argument to be had. But we need to know the first stuff first before we can start doing the other stuff later. So those are a few different answers. Is that the sort of address that you were getting at? Yeah, please. Kimmy, thanks. I've taken a math class in nearly two decades. Yeah, yeah, yeah. Where to start? Because I know that MIT offers an online open course in baseline Python programming, which I've taken and which is all. But I'm curious as to someone who's not a statistician but has a data set that they could parse. Where did you start? Because it seems like you're almost entirely self-taught. I was thinking back when I started doing this, gosh, this seems OK. Why does this seem OK? And I remember I had taken two programming courses in my life before I did any of this stuff. And I had a specific problem. And the first code I wrote was even worse than this stuff. So my badness manifests itself in various ways here. It just might not be legible to someone who doesn't recognize what good code would look like. The math. So I don't use that much sophisticated math. I wish I knew a little more than I'd be more comfortable using more statistical techniques, more rigorous statistical techniques. I also think that rigorous statistical techniques are a bit of a trap for us. Because rigorous statistical techniques are often divided. So for example, OK, the running time. Ooh, I can use this, ha! So this is the runtime chart for mornings at 7, which was on Broadway in 1980 to 1981. This starts off super long in previews. It's really slow. And then they get their act together. And then they open. And then they get fast. And they run fast for a while. And then they start to slow up a little bit. This is just a trend line of what's happening. They start to slow up a little bit. And then they live a little slower for a while. And then forecast members leave at once near the end of the run. And the show gets basically eight minutes slower for the rest of its run. That seems meaningful and interesting to me. But if you run a basic statistical reading of this, which is basically, is this information or is it noise? Statistics says this is noise. This isn't noise. These are people doing things. Of course, the show starts slow in previews, gets fast, and then starts off with a lot of energy. And then people get tired. And the show slows down a little bit. And then you get four new cast members in. And this show is totally different and takes longer. Obviously, that's true. So the statistics can be a bit of a trap. All I've done here is literally add up the duration of the acts. I mean, we can all add and subtract. So that's not complicated, right? Means are division, right? Or sums and division. Medians are just finding the middle of a series. And moreover, computers do all this stuff for us. So you don't even have to worry. You type, you use a module called numpy for Python. And you type np.mean. And then you, the name of your variable, your list in there of runtime. And it spits out a number. You're done. So there's, and that's, it does that for really complicated stuff too. So there's a lot you can do with math if you trust and are confident that you'll sort of understand what's going on. That doesn't require you to go back to spend a year relearning all of the math you've forgotten. Or maybe we're never taught. Most of us, I think, weren't taught much statistics in school, although it's a more important literacy, I think, than calculus. I started with, I think, Code Academy, which just has this funny little way to learn to program. You need to, what do you need, alright, what do you need to learn to learn to program? Actually, here's what I think you need to learn. You need to understand what a type is. So there's different types of information in a given program. Different languages have different types. For example, integers. Number, you know, whole numbers, right? There are integers one, two, three, four, five. There are floats, which are in between. 1.5, 1.25, et cetera. There are strings, which are lists of characters. There are lists, which are just collections of things. This is all the Python name for these things. Some of them are called arrays and other languages and stuff like that. There are dictionaries, which are, it's a great name for them, because they really are. They're called, they're lists of key value pairs. So the key will be like Hamlet, and then the value will be the actor who played Hamlet. Right? And so that's a dictionary. You learn what the types are in your language. Then you're gonna learn what a variable is and how variable assignment works. So you say X, which is a variable, equals eight. Now X is equal to the integer eight, and you can do things to X. You'll learn what basically the iterative property's looping. How to loop, because that's fundamentally what I'm doing, right? I'm saying go through every program, and if there is a Stravinsky piece in there and Boulez conducted it, then add one to this counter I have, right? You'll learn conditionals, that's the if part. If you know types, variables, conditionals, and loops, there's like one other thing. There's like input, output. You can do almost anything. And you can do that in almost any language. There are books, there's humanities data in R, there's introduction to text mining for humanists by Matthew Jockers. There's Steve Monford or something like that, a book about exploratory programming. There's a million ways to learn this. There's open stuff online. The programming historian is a fabulous website with tools to talk you through things. So any of those ways. If you can get comfortable with that and you have an actual project, then you can ask people to help you solve any given problem you run into, and or search online for the answer. And so one of the most important skills of DH research is learning to Google well. And so you figure out what your question is and Google it and there's someone on, some poor SAP online asked a question on a chat board called Stack Overflow, on this board called Stack Overflow, and then got yelled at for asking the wrong question or duplicating someone else's question. But there's the answer for you. I have never asked a question on Stack Overflow too scared, I couldn't take it. So yeah, does that sort of answer though? Yeah, okay, good. Yeah, Mike. I wanna switch more from maybe methodology and how to do something into a question that might be more about curriculum and labor. So the issue I'm thinking about right now, I'm a graduate faculty member and among our various requirements in our doctoral program is that students have to learn one foreign language. It is a variable use to our students. I think I continue to feel passionately that it's a valuable thing for a student, particularly if they are an American student, learn a second language. And I've been thinking a lot recently about the value of learning a language that we wouldn't normally think of as a language in various kinds of software coding. So maybe this is too speculative to ask, but where do we find in the US or anywhere else, a department say in the humanities, particularly in English, that is enabling students to sub out the traditional, we define languages, Italian and Spanish for mine, for example, with coding? Off the top of my head, I can't recall, I think there are some right now. And the change has been in the past couple of years. These are the kinds of things I've seen in my Twitter feed and then like follow the link to the announcement and then I've forgotten. There are some. There are institutions that have strong digital humanities laboratories at them. UVA is an older one. Stanford Literary Lab is another. University of Nebraska has an amazing one. Northeastern has an amazing digital lab, which is a combination of faculty from computer science and the humanities. And so they really work together. It's incredible. I don't know how all those programs are structured and what the requirements are. There are also a growing number of digital humanities certificates of various kinds being offered. In terms of subbing out for a language, I mean I think that's part of the largest story of the way that language requirements are shifting at programs. A lot of programs have dropped their language requirements entirely. So they're not gonna be replaced by something like this. I'm in an English, based in the English department. So we have different ideas about language requirements and we've had some of those conversations but not about computer programming. I enlist them in my languages on my CV though. Not because I'm so amazing that I wouldn't put me in a classroom to teach the languages but I am comfortable working in those languages with students at a certain level. The one thing I'm terrified of though is teaching computer science course students these languages. Because there are many computer science students interested in using data to study the humanities but I'm really scared of being in a class with them by myself because my code is bad and I'm gonna do stuff. They're gonna be like, why did you do that? And they're right. So I'm scared to have to learn in that way about a subject that I'm really not expert. So I think there are a lot of tricky questions there. There are a few people working in the field who I think are really amazing programmers in addition to being incredible scholars. Not, I'm not sure of anyone in performance, theater and performance who's an amazing program. Well, Doug Reside at the New York Public Library is an actual real like great programmer. Otherwise not in our field. Like Ben Schmidt at Northeastern is an honest to God like great programmer and you can write programs. Matt Jocker's wrote a whole program, a whole package for R to analyze sentiment in novels. There was a great then argument about the statistical package that he used there. An argument that took place sort of slowly online as people said, well, wait a second. What you're getting here is just an artifact of the mathematical formula you've used. And then, well, what does that mean? Or is that right? Or is that not right? It was fascinating. But it was a little weird also because we sort of like just call a statistician and then maybe they can tell us. But it's hard. It's hard to think about counting this as a front language in the same way. We wouldn't count music theory as a language that students use or Le Bonnotation or, so I think it's used so much more instrumentally in that sense rather than as opening up a discourse because if a student came in and said, I want to do performance theory and code studies, that's a student I would be happy making a case for to say, you need to count the languages they're learning. When they learn LISC and FORTRAN and BASIC and C and then you need to count those. That counts. That's what they're doing. But for someone who's using it instrumentally, I'm still a little more cautious. Code studies is amazing by the way. People study that like rhetorical studies of code. Amazing. Yeah, so I'm gonna, I think your presentation, something that it really made me think about that I think you really highlighted and it's sort of piggybacking on what Catherine was asking is all about what's missing. But more significantly, I think what you're thinking is how you're finding your questions and then how sort of charting those questions. And I think it seems like the humanities really offer us the ability to think very reflexively about the data and about metadata and how our own process is working. So I wonder, have you charted at all your line of questioning and how your questions have changed through time as you've done this work? Is that something that you've thought about doing? Because it would be really interesting to see not only what's missing, but just what is the path that you're thinking is following as you look at these different sites. Again, here's me being terrible. I haven't and it should be more clear in the history of my computer files on my phone, my computer, and it kind of is there. There is a rich tradition because of our interest and reflexivity of people talking and thinking a lot about methods. I've done that for you today in some sense. I have a piece coming out in the Routledge Companion that Nick Leonhardt's editing about databases and ontologies. So I've done a little bit of that. I haven't been as, because fundamentally I'm just using computers to count things, I don't think I yet have as much to say to people about methods as those who are discovering ways to use new statistical techniques or try new statistical things. I think I've made some cool charts every now and then, but that's not something I feel like I need to go say, hey everybody, I did the same thing before. I mean, it's for the field that I'm saying that. It doesn't need to be a huge method conversation every time. So I've been less attuned to that. I also think there's still enough skepticism or confusion about what DH means for so many different people that I don't have much invested in that. If I can show you that what I've studied has something interesting to teach us, then I'm happy. And if you hate the method, if you refuse to believe what I'm saying because of the method, then that's a different conversation and we can have that, but that's not the one I want to fight first, I think. Does that sort of answer what you were asking about? Thanks, John. Christine, again. I have a question about rhetoric. Yeah. So I take your point that data is always, organized data is always rhetoric. And I'm just, I'm really wondering about then the process of turning those rhetorical forms which you organize into language, because a lot of our primary training is in how to narrativize what we're talking about. And yet, as you say, the way that history is narrativized tends to look at the exceptional moment and create drama and that a lot of training and writing is about not just organizing arguments in a meaningful way, but about a forward and moving linear process of expounding that. So I'm wondering in your own practice about the relationship between how you organize and conceptualize the rhetoric of data and then the move in your brain or how you see the process of translation into the linear form in which you're obviously highly trained of verbal rhetoric. Yeah. So two answers to that. One is fundamentally that doesn't seem different to me from any other, right? If you just give me a play and say, right across reading of this play, it's because I have the same problems, right? I need to make a linear narrative of a set of evidence that I've grabbed. So it's fundamentally the same. Although then I'm thinking about, do I want to talk about the percent here or the absolute number? But that's still a back and forth of selection right, the heresy of paraphrase is always a problem. At the same time, though, it's one thing I have thought about is when you see the visual and when I'm just telling you what's going on and when I, how long to let you dwell on certain aspects of the visual and not. So I think this actually didn't end up happening. In average Broadway, near the end, there's two charts of, what is it? The duration with which shows ran and I do the first box and whisker plot without outliers and then the next one without liars and like the boxes disappear. And I wanted that to be on the next page, right? Because you needed to see it after you understood what I was saying by leaving out outliers. So there are moments when that sort of delay of the impact that I think about that a lot more. But honestly, I'm also trying to use that more in my writing that doesn't involve it too because it's finding a way to just sort of say what we're seeing. And that's still super hard for me at least, super hard. So, yeah, that's my best answer for that. I think we're done, yeah? All right, so we're gonna have to stop with this, yeah, we're done. Thank you, thank you very much. Thank you guys so much. Thank you very much for coming, congratulations. So, we are now gonna go down to get your IDs so that you're officially convened. The day will lead you away. Thank you, the day will lead you away. Thank you, the day will lead you away. Thank you, the day will lead you away. Thank you.