 My name is Dan Chudnov, aka Dee Chud, and to my left in order are Dan Kirchner, Burgess Jules, and Laura Rubel, who each of whom you'll hear from in a moment. We're going to talk to you a little bit about what we're doing at George Washington Libraries to capture social media, to support research teaching and collection development in libraries. It's a real treat to be able to stand here in this great room with so many friends and colleagues and to actually have a turnout that's a lot better than the Redskins got yesterday. It's very exciting to be able to share this with you, and I have to first of all thank the IMLS who has funded us under their SPARKS grant program to do the work we're going to be talking about today. In addition to Joan and Diane and Clifford and everyone at CNI who helped get us here. So we're going to tell you about a traditional project here. It's actually really easy once you get your head around it to think of what we're working on as a very traditional library project. We're focused on the kinds of things we've been doing for decades quite well. Maybe the material is different here, and that's why it's worth thinking about. But in a lot of ways it's a lot of things we know how to do. We're trying to save the time of our researchers. We're working with at-risk resources that are online, and we're looking at licensing scenarios for them. And we are essentially expanding the scope of what we're collecting. So through those lens, here's a little bit about what we're up to. I worked at the Library of Congress for a couple of years on the Twitter archive project there up until 2011 when I went to work at George Washington University. And in addition to the technical challenge of collecting dozens of billions of tweets, which they have a pretty good handle on there now, fortunately, it was very interesting to see the demand come in from all over the internet for access to this data. There were hundreds and hundreds of requests for researchers in any discipline you could imagine to get access to historical Twitter data. Now, I don't know the latest about what the Library of Congress is up to in terms of giving that data away, but as far as I know they're not yet, although I do know they have the data safely. And there's a lot of news in terms of corporate commercial access to the data that you'll hear more from Laura about in a moment. But when I went over to the George Washington University libraries in the fall of 2011, it was really interesting that the first thing I saw in the GW Faculty Research News letter that goes around was a report on a study done led by Kimberly Gross of the School of Medium Public Affairs at GW on how mainstream news outlets use Twitter. This was work done in coordination with the Pew Project for Excellence in Journalism and the primary finding of this work, which was quite excellent and I encourage you to read it. You should probably Google it if you can't quite catch the URL, sorry about that. That basically news agencies use Twitter and things like it to adopt and promote the same message they're putting out on their mainstream channels like the web and their video stream and all of that, which in and of itself is not a dramatic finding, but nobody had studied this yet. And it's a question that gives them a foundation to do more sort of layers of work on top of that. This is the way research does. So I was very exciting to see people at GW who were working with some material that I knew a little bit about. So I reached out to Professor Gross and asked number one after congratulating her, how do you collect your data? And the answer was clear. They were as social scientists collecting the data quite literally hand by hand, tweet by tweet. Now when I tell you the steps, you'll get a kick out of this if you know what I'm talking about and what I'm getting at. Step number one was they subscribed to Twitter's RSS feeds. Okay, maybe, okay, there's a couple of people who know that Twitter no longer has RSS feeds. Number two was they subscribed to these RSS feeds using Google Reader. All right, now I see more head shaking. Google Reader no longer exists. They would then fold, spin, mutilate, literally copy and paste the data over into Excel. And when I say they, I mean her students in her class were assigned copying and pasting of data into Excel where they would then do some additional coding of it as social scientists do and pull the data into SPSS and Stata and their more familiar tools ultimately. And this was a very painstaking progress and what I found out when I asked if they needed help was yes. They will take any help they can get getting this data. They're doing too much work for too little data. We're literally talking about thousands of tweets, not billions, not millions, thousands. That study was done on a slice of data taken over a couple of dozen accounts over a couple of weeks if I'm not mistaken, maybe even one week. So you're looking at about a 4,000, 5,000 tweet data set that took all semester to gather code and collect so they could actually do their analysis of it. It doesn't scale and it's not something we can expect our researchers to do and we certainly don't want our people teaching classes to assign their students. If you wanna know how well this went in terms of the learning experience, you just need to look at Professor Gross's student evaluations at the end of the semester. The students didn't like copying and pasted tweets by hand, one by one. So this is not, as I mentioned before, in terms of the demand the Library of Congress has gotten, this is not an isolated case. We have several people at GW doing work in this area, not just in Twitter, not just in social sciences. We have computer science researchers just like many of you do and people in other disciplines as well looking at different slices of what social media gives us. And if you look at ProQuest, VCs and dissertation database, you'll find over 5,000 unique items about Twitter data in one way or another. And if you click through, that's actually seriously considering Twitter data, not just computer science. It's really a cornucopia of disciplinary looks. In just the last few years, there are a lot of people at our institutions doing this and they can use our help. And if we don't save their time, we're putting them behind the eight ball. So we like to think of this service we're providing that we'll tell you a lot more about in a moment as saving time of researchers as a strategic advantage, getting them to do their work more readily and more efficiently. Laura? Thank you. So we found that we had researchers not just in the School of Media and Public Affairs, but across our campus doing research on social media and Twitter. So political science, business, computer science, public health, even the English department. And there were certain kinds of data that they were interested in. They were looking at how specific Twitter users might tweet. So members of Congress or how news organizations might tweet. They were also interested in looking at particular keywords. So how keywords were used during the presidential debates or other hashtags were used in Twitter. They're basic values they were interested in in addition to the user and the time period, the text itself of the tweet, links that were in those tweets, and then also retweet and follower counts of various points in time. And they're interested in data at the size of tens of thousands, not tens of millions. So we're not talking big data here by any means, but they are data sets that may be larger than particularly social sciences and humanities researchers may be accustomed to working with. So they need to limited files to import into their software, maybe Stata or some other kind of coding platform. And they're interested in historic data. So an event or topic might not emerge as being of research interest until sometime after the fact. They don't always know when it's happening that it's something they're gonna want to research later on. Now getting historic data usually requires licensing it from a Twitter certified data reseller. And these are companies that have exclusive rights to license access to the Twitter fire hose. These are the four companies that are the current group that have a relationship with Twitter to do this. Datasift is a UK company, NTT data is in Japan, and Gnip and Topsy are both in the US here. And it's a changing marketplace. You may have heard just in the past couple of weeks that Apple announced that they've purchased Topsy. So things are changing. There are some other companies that have relationships with these companies to provide access to Twitter data. So for example, TechSifter is a company that provides a platform geared for researchers to do some coding analysis with Twitter data sets. And they all offer bulk data sets of historic data. So they're who you would need to go to if you wanted to purchase a historic data set. They're pricing generally factors in the number of days that you're searching, the time period that you're searching. So a longer time period is a higher cost. And then the size of your resulting data set is a factor in the pricing as well. We've found in speaking with them, their data's not cheap. The commercial sector is their main market. And while that's the case, they're very friendly and receptive to working with researchers. So we've been talking with them a little bit our needs and they've been very helpful in talking with us through that. And I think we'll be hearing some announcements in the next few weeks of some products they've been able to develop more for the academic community. At this point it's been a little too pricey for our researchers to actually purchase data at the commercial price level. They're also used to working with customers who can handle very large data sets and who wanna work with the fire hose directly. So there's definitely an area there where we can help them deal with these challenges. So what can we do in libraries to help? We can help them get the data they need at the scale that they're interested in. And Dan's gonna show you the tool we developed here in a moment. And we've also can help them navigate this vendor space and understand the options that are out there and help them proactively collect the data that they need. One of our media and public affairs faculty members, David Karp has written about the problem for social science researchers that the data they wanna work with, blogs and social media is so transitory and it disappears so quickly. So he calls for setting up what he calls lobster traps. So just as you might drop a lobster trap in the ocean and leave it there for some time and see what comes through, pull it up from time to time and see what you've captured. He calls for setting lobster traps of our own to capture social media data. So we can proactively capture this kind of data now before it disappears, whether or not we're sure exactly how it might be used by our researchers. And we do know that Twitter and social media data disappear all the time. When Congress turned over, when the 113th Congress started at the beginning of January just now, we noticed that the outgoing members of Congress were shutting down their Twitter accounts as the rules of House and Senate require them to do. And they had been tweeting as members of Congress during that time, but those tweets were no longer gonna be available. So we were able to go out and very quickly try to capture as much as we could so that the researchers in our meeting public affairs and political science departments would have that available for them to use. So at this point, I'm gonna turn it over to Burgess to talk about the archive's perspective. Hi, Burgess Jules, university archivist at George Washington University. So this is probably the least technical part of the presentation. But I wanted to start you off since we're all tweeting these days with a perfectly well-crafted tweet that's ready to go about how we're using social feed manager to collect records for university archives. So for us, when Dan told me about this tool, I had just started a process to engage student organizations at the university about how to transfer their records over to university archives. So this tool, this app came about at the perfect time for us. For university archives, we're finding that it's a very practical tool. It offers us instant value to address a major collection development gap for university archives specifically, which is documenting student organizations. Why did we choose student organizations? We thought it was a good test case because it's a well-defined group and they're very highly active on Twitter. And they are also a great representation of student activity on campus. University administration is increasingly getting interested in student life, student culture, and how to document those. And I think that's mainly because of marketing and promoting the university message, which is good for us because it makes the university archives relevant. It's also a difficult collecting area because these groups come and go, their leadership come and go, and there's not really a history of documentation, culture of documentation in student organizations. So as a result, a lot of student organizations in student life in general are not well documented in the university archives. What you'll find mostly are administration offices, administrators are the ones who are really well represented in the university archives and students are really kind of non-existent. So we're using this, we hope to use this as a way to sort of bring students back into the university archives and to document their life and experiences there. So GW on Twitter, it's a very highly active user community back in 2009, I think some local newspaper here named us the most highly active Twitter university around or something. So it's a mix of students, administrators, offices. There are also 400 student organizations at the university including Greek, cultural, social, political, activist organizations and most of them are exclusively on Twitter. Most of them don't have any other web presence. So we're thinking about this, I mean this is really the only way a lot of them exist to the public. So this tool is gonna be really helpful for us to help bring some of these groups back in. So what we've collected so far since March of 2013, we're currently tracking 329 accounts that hasn't been updated for a while but over 200,000 tweets that we have in SFM specifically for this university archives project and Dan and I did a test the other day just looking at tweets for the month of October and there were over 10,000 in just in one month. Now are all of these tweets historically valuable and significant? No, but the point is we're capturing them and we've figured out a way to pull them in. What happens next? We're still trying to work on that but the point is we went to make sure that we were getting this information and capturing these accounts and capturing these tweets before, I don't know, did the student groups dissolve and they all disappeared. So that was for university archives. For special collections, we see this, I mean there are multiple ways that this could help general special collections but two really important ways we see it helping special collections is by creating new types of collections, these are new types of records, social media records and by enhancing existing collections. So let's say you get an organization sort of physical papers of 2,500 linear feet or something, if that organization is prolific on Twitter or the social media, now we can go in and grab that part of their collection. It helps us get a more complete record of our organizations that we document of people that we have in our archives. Beyond special collections, the National Archives and Records Administration has pretty much decided that social media is a record because government business happens there which makes it a place that records are created. As a, so one of the goals and responsibilities of the Presidential Records Management Directive issued in 2011, asked National Archives and other government agencies to come up with innovative ways to address electronic records management. National Archives has decided that social media is one place, not just Twitter, but sort of all social media platforms is one place that they wanna work to try to figure out how to capture some of these records and they're currently trying to think about several options for doing that. Some include in commercial options which are really what's the only thing available right now other than you sort of manually doing this. So we think that SFM, social feed manager has a role to play in helping government organizations, university archives, special collections, build these collections. The narrow bulletin on guidance and managing social media records that came out this year really makes it clear that the National Archives, this is a place where they want to be. So we hope they'll maybe they'll contact us to see how we could work with them on that. Moving forward, this tool like I said is really as far as university archives, special collections is concerned, it's not very complete. All it does right now basically is capture the records. We really need to start thinking about, well in the next phase hopefully, we'll start thinking about collection development policies, how to roll this tool, how to roll these social media records into sort of general collection development within our organizations. I think Dan Kershner will show later a screenshot of all the unique metadata that's attached to each tweet you send out. So it's a very rich sort of rich piece of metadata that's attached to every piece of this record. So how are we then gonna fold these into our catalogs? How are we gonna sort of help with discovery? We need to start thinking about how some of these tweets, so SFM does not, if I understand correctly, does not collect full links, right, sort of. Okay, so we need to figure out how to pull these things in, links, images, and also some of these groups do have websites, right? So we need to start thinking about how we could roll SFM out into a wider audience and to sort of roll in web archiving and all that so we can really try to capture the full record. And now Dan Kershner is going to give us a demo. Now that Burgess and Laura and Dan have talked about this mysterious social feed manager tool, hopefully it's not anticlimactic, but I'd like to show you what we've been talking about. So I am one of the software developers who's contributed to this tool, as is Dan. Actually, Laura, I think you've put in a ticket. And the other software developers, and we actually have a part-time graduate student, but we're looking to go beyond our institution and have others participate. In fact, we have a meeting on Wednesday where we invited other institutions, not just software developers, but to help us not only develop ideas where we want to take it going forward functionally, but I think on the second day we're going to have a hack fest for those who want to stay around and contribute to the software itself. So we're looking to leverage beyond just our team and make it a truly collaborative effort. You'll notice here, I'm going a little out of order here, but our code is on GitHub as our repository and it's open to everybody. We like to use GitHub as an integral part of our software project management at GW Libraries. And that also helps us and it makes it easy to collaborate with others outside of GW. So as far as the basic stack, we use the Django framework, which is built on Python language, and a couple of libraries, which make it a lot easier for us. The Django social auth library helps with the authentication, with in this case the Twitter credentials. And Tweepy is the Python, that's the PY part, wrapper around the Twitter API, which is kind of implicit there. So we're using the Twitter API through the Tweepy library. And as I mentioned, everything you can go see it yourself. It's on GitHub and there's a license.txt file, which you'll find to be very MIT software licensed style. So it's freely distributable and things like that. So I think on that note, let's just go to the actual, we have a local instance running here. And let's see if I can do this. Okay, that's a little bit wide, but let's see, no? Okay, well, we'll do our best here. So as I said, for those who have built basic Django framework applications, this type of styling should look fairly familiar. It's actually, it's boot swatch as well. So the homepage right now, and again this is in, it's very early stages. So you'll be able to say that you saw the SFM back when it was just this. So right here, and this is probably the biggest screen it's ever been on. Just, you know, that as well. We're setting new records today. So the homepage, as far as the UI, really just provides kind of a random sample of some of the Twitter user feeds that we're tracking. Here we have a setup with over 1300. And then off to the right a little bit, just we're kind of lacking the display here. A basic graph of the number of tweets per day. We're looking to look into other data visualization that might be useful. And a sampling of recent tweets. So I can't seem to get to the right there, but I just wanna show you how you would set this up. So again, this is just out of the box, the Django administrative page. So you add Twitter users that you wanna follow here. So if you look in this column, these are the names that we've set it up to follow to gather the tweets. The actual tweets here are the user items. And so once this loads up, one thing you notice is that we are capturing the entire, I'll go back to this later. We're capturing basically the entire JSON structure that Twitter gives us, whether we present all of those different, here we go, all the different elements or not. That's something that we've chosen certain data elements right now to present in the Excel download, which is the next part. But we do archive the whole JSON that we get here. So what I'd like to show you is, I'm not quite sure what this Twitter feed is about, but how about WRGW? That's our radio station at GW. I have a history in radio myself. So this is probably one of the more useful features here is if you click on this link here, you will get, it will generate an Excel, let's see. Okay, all right. Okay, so these are all the tweets that we've gathered for the WRGW Twitter feed. So we assign our own ID, that's kind of the object ID in our table, the primary key. Some of these columns here are straight from the JSON that Twitter provided back. Some of them we've cleaned up a little bit to make it easier. As an example, if there are multiple elements, sometimes we'll, like for example in the hashtags, column will just comma separate them to make it easier to work with. We capture the URL of the tweet, whether it's a retweet, the text of the tweet itself, of course, and then if at least the first two URLs we pull out, this obviously something we could expand the future. And given that the URLs are usually link shortened, we try to grab the unshortened original URL as well. So that's pretty much what we're aiming for here as a more useful form for researchers to work with. And let me go back to the application here. Okay, some other conveniences which are kind of off to the right is you can link to the actual tweet itself to the entire JSON object. We have a slide here with one of those. So let's see if we can go to that. Okay, good. Oops, okay. Yeah, those links that I was referring to, if you click on the cached version, you will get something that looks more like this. This is actually one of the first tweets ever, I believe. So this is the nicer layout of what we get back. So you'll see all the fields that we're capturing in that one field in the user item table. Okay, so again, if you go to our GitHub repo, you'll be able to see not only the code, but a window into the types of issues that we are looking to work on. And some of the other items that we're looking to attack in terms of, in the longer term, enhancements, just capturing the feed from an individual user doesn't really capture the whole conversation. Really the question is, we wanna get the interactions and the other tweets that reference that user, let's say. So we have to grapple with the problem of how far out we wanna branch from that stream. There are obviously many social networking sources, besides Twitter, there's Flickr, there's, I mean, we could go on here. So the way that we've built the application, hopefully is general enough that it should be fairly straightforward to expand it so that we can do this for more than just Twitter. It's the same idea, just apply it to other sources. So the libraries that we've used, for example, Django Social Auth, is pretty much ready to use with more than just Twitter. You'll notice that we do not, at this point, capture the items that those links refer to. So for example, if a tweet has a picture in it, all we capture is the link to the picture, but we don't capture the image file itself, so that's something we're gonna look into. This really is just providing a raw dump of the streams that we've opted to capture, but there are some possibilities that we can include some analysis tools, some searching tools to make it easier to work through the data that we have captured. You're probably wondering how the mechanism works that we're gathering the tweets. Right now, we have some cron jobs set up that run every couple hours and go through each of the IDs and call out to Twitter, which presents some problems because Twitter has rate limiting where you can only call out to the API for certain API functions a certain number of times in a given time period for an individual user, actually for an individual authentication token. So there are some ways that we're looking to improve the application so we can do more things in parallel so that we were not so limited by the rate limiting. And nicer ways of, let's say that you have already captured some tweets outside of this tool, ways that you can import it in and make it part of the same database, perhaps other ways that we can export the tweets that we have captured besides just the form that I showed you. And while I personally think the application is not that hard to install, there's always better ways that we can make it, package it up. We realize that not everybody's a software developer, but if you do wanna check it out, the instructions are there on the GitHub page for the project. So we certainly encourage everybody to check it out if you feel like participating, you can send us a pull request. So are we due to break it 3.15 or 3.30? 3.30, okay, great. Every one of these topics is a lot bigger than three or four words. If any of them interest you, we only have one or two slides left so we can dig back into them and we'd be happy to answer questions or spell out any of these a little more. But just to wrap up, our immediate next steps, under the aegis of this IMLS grant, which we'll run through next summer, our goal is to improve SFM the app so that it meets diverse research teaching and collection development needs. And when we say diverse, we mean multiple institutions using it. We'd like to see many of our peers as many as possible get into the business of extending their own collection development and supportive research to use tools like this to provide an advantage to their researchers and their environments, that there are different institutions. We are also under IMLS working Wednesday and Thursday this week with a group of people who have kindly agreed to come in and join us to talk about the scope of what the app does, how it relates to use cases they've run into at their own institutions. Everybody from people like us who have similar research requests and are trying to catch your student groups and that sort of thing, to archivists who are coming from a few different institutions who've had to do something like this and reach out when we got somebody coming from UVA who was involved in trying to catch your social media around the hullabaloo with the president's situation last summer. I don't mean to refer to it so lightly, but I think you know what I mean. They very actively went out to try to get things from Twitter and blogs and Facebook and found it quite an interesting set of challenges. So hopefully institutions like that with immediate problems like that as well as sort of long-term goals that can be met with this sort of lobster trap strategy we heard about earlier. We'll be able to use this app which in a way that is reliable. That's the number one thing. We want you to be able to take this and run it and define what you need to do with it, trust that it's going to work and if you need to get a lot of stuff at once, have a way to scale up as Dan mentioned and have a pretty easy path to get started as well. And of course we're looking for more people to work with. The code is on GitHub. We work in a GitHub way. If you want to improve something, feel free to reach out to us on Twitter in person over email, whatever works for you. Send us a pull request, whatever. We work that way. So pull requests are very welcome. Our license is a, as you heard earlier, an MIT-style license. The only thing we'd ask is an explicit assignment of copyright. Other than that, we'd welcome your feedback in the form of questions, comments, code and suggestions. So thank you.