 Thank you for the opportunity to share with you today our work at George Washington University. We'd like to tell you about a library ecosystem which is developed organically since we first shared this work with CNI in 2013. This program may be unique in certain ways and it's serving as a rich source of research data for researchers not only at our institution but beyond as well. My name is Laura Rubel and I'm a software development librarian and I'm joined by my colleague Dan Kirschner. Hi I'm Dan and I am a software developer and librarian as well at GW Libraries. As Laura mentioned, we first presented this work to you at CNI seven and a half years ago actually and a lot has happened since then. What now is an entire suite of resources systems and services at GW Libraries began initially as a small tool to help a faculty member avoid manually having to cut and paste tweets from the Twitter web interface. So as word spread, we began helping other researchers at GW, and the more that the word spread, we realized that we needed to be better able to scale the service we were providing to help collect data on behalf of researchers so with the help of grants from NLS, NHPRC, and the Council on East Asian Libraries, we rebuilt the social fee manager tool with more of a self service interface. Now, several years later, we've realized that we needed to add consultations to help each new researcher refine their research question, learn how to use our systems and refine their data collection strategy. We've added recurring workshops about collecting and working with social feed social media data. And, in fact, as we will show you some of our work with faculty and students has led to more in depth research consultations and even external collaborations. Finally, we realized that we needed another system to facilitate access to and subset data sets that we've already collected. And so we have now tweet sets for that. And we've taken some of those data sets and publish them on Dataverse. So another view of our whole ecosystem, which by the way also includes the ability to collect from a few other platforms Tumblr, Flickr and Cina Webo. And here are some researchers. And we're going to describe now some of the components of this system. So starting with the social feed manager. Social feed manager, we'll call it SFM, is a web application that we created at the beginning of this whole process. This is the more self service version of it to allow users to create data sets. And using the Twitter API but without having to actually do the programming. And we created it for users within our institution. Now here we're looking at a sample view from inside SFM where a user might specify the Twitter accounts hashtags etc that they want to collect from. Where the user can then download the data after it's been collected. So we did develop SFM at GW and we're describing the role it plays in our ecosystem that includes other systems as well for researchers who can get the data and even without using SFM like through tweet sets. But SFM is also for researchers beyond GW as well through the social feed manager, the social media data that is collected using SFM. And in fact, SFM is being run at other institutions. So code for SFM is open source on GitHub, and as other institutions around the world have started to run SFM for their projects. We're actually starting to see code contributions from developers and projects at other institutions. So, to we use two main platforms to facilitate access to data sets that we as librarians have collected and opted to make available. So tweet sets is one of those platforms and it allows users to derive subsets from those data sets that we collected for users on the GW campus network. Tweet sets allows users to download the full data of the those subsets that they would derive so we're looking at images here from a screen where they might select the source data sets that they want to work with and then where they would have parameters for their own subset, perhaps like a date range or matching certain terms, things like that. So users that are not on the GW network can still access this resource, but there would only be able to download the tweet identifier so it's like one identifier for each tweet, and then they would need to use a tool. There are three tools out there to re hydrate the IDs, which means requesting the full data for that corresponds to each tweet identifier from Twitter directly. And Twitter will provide that back to them. If the tweet is still posted. So for a smaller set of our data sets, we've chosen to also publish those data sets on Harvard Dataverse, where again users would only get the identifiers but many people are finding those useful as they go and re hydrate those data sets. So, there's no subsetting available on Harvard data sets, Harvard Dataverse rather, but Dataverse offers present preservation features and also Dataverse generates a DIY digital object identifier, so that if a data set is used in research it can be properly cited and we are in fact seeing citations of our data sets in published research. So these two platforms tweet sets and data verse. We are able to provide access to data sets that we've created and enable them to reach a much wider audience. So now that we understand what role social feed manager plays in doing the actual collection and tweet sets and Harvard Dataverse in facilitating access to those collected data sets. So let's talk about data flow where social feed manager is the the component that's that's interacting directly with Twitter the source of all the data. We as librarians use social feed manager to create collections but as mentioned, we, we have we invite students faculty and researchers within GW to create their own collections on social feed manager, and they can download the data from their collections at that point in the flow. So the collections that we as librarians have collected. We post some of those, the ones that we feel would be of interest on tweet sets where users can derive subsets and if they're on the GW network they can download the full data sets if they're not they can still download the identifiers. From the data sets that we post on tweet sets a subset of those we've chosen to publish on Harvard Dataverse. And of course that's just the identifiers but we get a digital object identifier on there, and we're seeing plenty of public usage and Laura will talk about that more as well. So we meet with each researcher as part of their entry into using social feed manager, if that's what's needed or tweet sets. We conduct workshops, and we've actually been invited to guest lecture several times in GW courses including first year university writing programs on topics like fandom audience in the age of new media. We've spoken to data analytics and text mining courses because they find social media data to be excellent data sets for learning those techniques. We've lectured in political science courses. And of course other services that we provide are helping with research collaborations. So a little bit about the nature of the, the content that we collect. So here you see a variety of topics. We have a, our largest set right now is on the Coronavirus tweets related to that we, we make a point of consistently collecting from major elections in the US, and government accounts, something that is sort of an emphasis at GW being in DC. Events such as hurricanes, and, and other events that we've found to be of interest to our researchers. Now, there are more sets that you can view on the tweet sets application which will have a link at the end. And this is only what we as librarians have chosen to collect the, we have a very thorough list of all of the different topics that our students and faculty and researchers just at GW are collecting and it's quite diverse. To get a picture of the usage of social feed manager and our data sets and, and how they've been used we've been tracking usage with several approaches. We keep a running spreadsheet of GW SFM usage. And we're seeing usage across campus departments from business international affairs computer science, religion, even the med school. So we know we're reaching campus broadly across disciplines. This tracking of support and interactions also helps us share support responsibilities as a team. So this is a screenshot from tweet sets and tweet sets keeps statistics about the source data sets downloaded and the data set format so whether it was a CSV or tweet IDs, and also by on campus network versus non GW users. Some of the data sets that have been the most popular are the news outlets data set coronavirus data set and climate change tweets as well as numerous elections data sets. And we've recently added Google analytics to tweet sets to get a better sense of how people are finding it and where they are in the world. This is an example of one of our data sets up on the dataverse platform the data repository where we've posted to the ID data sets and it automatically collects usage, just download usage statistics, which is helpful at a high level and understanding reach. So, while we don't have any more detailed usage from dataverse, we do get roughly weekly emails from someone who's come across the data sets and has a follow up question. Sometimes they're requesting access to the full tweets which we can't share under Twitter terms of service, but other times the questions are about the collection criteria, the collecting process or gaps in the data, wondering when we might be next updating the data sets. So we have a lot of this kind of information in the data set description data set users often want some further context. We've heard mostly from grad students and faculty from around the world. So Netherlands, Germany, South Korea, Australia. The point I want to make is that posting data sets is not the end of the process you should expect to provide further support and communication. What's particularly interesting is tracking how the data sets support published research. So these are some example citations for some recent research published by GW faculty or researchers. Sometimes we hear directly from GW researchers that they've been successful in their work, getting it published other times we come across the research much later after our interaction, as their work has made it through that publication pipeline. So for example, citations of work by researchers beyond GW who use one or more of the data sets we posted. Because we don't usually hear from those beyond GW, making use of Dataverse or Tweetsets data sets we monitor for publications that mention our site or data sets so we have a few Google alerts going on some relevant keywords, and also periodically do some database searching. And this is all fed into a group Zotero library. It's probably where the researchers aren't always consistent in how they cite data sets. And so this kind of tracking has a manual aspects to it and does require some time and effort. However understanding the reach of SFM and our data sets helps us and our leadership being formed as we discuss this program and decide what resources to put toward it. So this is off in being able to have a better understanding of where topical interest in the data sets is and what we might collect in the future. Finally, it also makes the point that this data is proving to be reusable. We know that it's being adopted by researchers here at our institution and worldwide. So as we meet with most users of social feed and a jar we get to participate in their initial planning of their project. So we're supporting researchers early in their work as they're considering where and how to get their data. As they're thinking about which research questions they might pursue and and how they might do their analysis. While some of the faculty who use SFM are experienced social media researchers others are new to working with Twitter data. So they may find SFM useful for exploratory research, trying out methods and data that are that are newer to them. Students who are looking for social media data are often working on capstone papers or dissertations or undergrad honors thesis so they're digging into this data for often a multi semester project and and learning some new methods as well. So it's great to position the library early in this process. It's also open the door to participating more deeply in research on campus. So as an example, Dan has been involved in work to analyze each and tweets by candidates for US Congress. I'd like to analyze the claim that COVID-19 was a misinfo DEMEC by looking at URLs in tweets about COVID. So he facilitated data workflow around this large data set collaborated on data analysis coding in our and Python, including the data visualizations and was a co author on paper. I've also been part of conversations and work by the program on extremism and the Institute for data democracy and politics here at GW. I was recently involved in helping an American studies professor track the origin of a Facebook meme that you see on the left there relevant to the argument of her paper. She connected up with the library because of our social media consultation service. I worked with her to use reverse image search web archives and social feed manager to confirm some tumbler metadata around the meme. So it turns out the origin of this meme was tumbler not Facebook. And the faculty member was then able to speak to the woman who originally created the meme and interviewer about it. So it was a really interesting research support activity that came out of our involvement in this space. There's a broader social media data community that includes librarians archivists developers and social media researchers are developing some expertise in the library in this area has also provided an opportunity to share with that professional community beyond GW. We've provided workshops through the IMLS Laura Bush 21 funded project said work or continuing education to advance web archiving and given some webinars for the digital preservation coalition in the UK. The archives on leash project is doing amazing work to develop platforms and educational resources for web archives users. And we've contributed some data sets to past data thons that they've run. This talk now community is made up of people from all facets of social media archiving and research. They're doing great work leading development of software and other resources to support ethical approaches to this work. So providing both the tools to collect social media data and the services to help users be successful working with the data has led to some benefits to campus researchers in the library. So for researchers, most obviously they get the data they need for research questions from a range of disciplines. They get access to otherwise unavailable free historical data. We're opening up access to data to users of all technical backgrounds. So regardless of the role of their university whether at the university whether their first years grad students or faculty SFM and tweets that help them get data from from these students get this hands on experience learning about data collection considerations and faculty now have data sets easily available for teaching data skills to students. And then finally faculty can move into research areas that may otherwise have been difficult to pursue. For the library. This work really positions the library in a unique way in the university environment as a data creator and as a data publisher. So we're modeling good practices around publishing data that is well documented and supports reproducibility and other fair principles. And the data is findable accessible and in common formats expected by researchers such as CSV. We're leading in collecting data sets of cultural value as events of significance occur whether it's an election or a natural disaster or a campus event and archivist and special collection librarians are key partners in this work. Social media data comes with important privacy and ethical considerations, and we embed these concerns in our conversations with students and faculty. So the library becomes part of this conversation on campus around data ethics and privacy. Librarians and developers get firsthand experience and involvement with research support using larger data sets and helping the organization build relationships in this area on campus. So this work really can lead to opportunities for library staff to become further or more deeply involved in research projects, support grant funded research on social media data and use our related skills, working with data as partners in these kinds of projects. So this type of work also has its challenges. We are a small team of three developers essentially. And with several other software development projects responsibilities and things like teaching workshops being available for research consultation consultations and other responsibilities. So we find that all the activities that we've described associated with assisting researchers at GW with using SFM tweet sets, and then of course the data itself that they get through those systems takes time and effort not to mention that we receive those inquiries from external researchers about our data sets and about the software. And although the bulk of our time is as it probably should be focused on serving our immediate community of GW researchers. It's still quite time intensive. So, if you go to our GitHub repositories for SFM and for tweet sets, you'll see that we have plenty of tickets for feature enhancements fixes and other issues. And maintenance may not be flashy. And it takes time and dedication, but it's an ongoing necessary investment for us to be able to continue to provide the support and data to the many researchers who depend on us with not only the software men maintenance but especially with the services we provide, ensuring that time and tail they're both visible and measurable can be difficult. We spend considerable time as Laura described keeping records of exactly who we help and with what project, but even that's more of a list and still hard to quantify in terms of the hours spent providing that value. So, Laura also described what we do to try to capture the end effect and impact of our assistance in terms of research output. But many of the projects that we support don't ultimately turn into publications that we can cite, but we still contributed to the education of the students that we work with. Although we for our part as data providers do our best to educate the users, at least the ones that were in contact with about appropriate use of social media data. We've still had to ensure that the overall activity of providing the service and collecting the social media data sets is something that our institution is comfortable with from a legal and policy standpoint given the privacy issues involved inherent to the social media data. So the, another consideration is the fact that the Twitter API is evolving. So that evolution provides both opportunities with any new affordances that are coming out. But also, we have to keep up with changes that might impact the parts of the API that we're already using. And then an important consideration is the fact that tweet sets itself with its ability to for a user to generate custom subsets of each data set can be quite computationally intensive. The data sets are very large and generating subsets from each one can can take hours or in some cases even a day or two so we've made some improvements, but we're providing that as a public service that reaches even beyond GW. And we need to balance, of course, the value of providing that public consideration with the fact that those it's provided on a limited resource on our infrastructure for free. So, in terms of some some ideas, a few ideas at least for future directions. Especially over the past year we started to really see more interest from software developers from a few other institutions using SFM already propose improvements and submit pull requests for the SFM software, and some of them have expressed their willingness to participate in a potential community software development sprint. And that's part of sprints like that for other open source software. And we feel that that model where other adopters of SFM in this case are truly stakeholders might be a good way to improve the long term sustainability of the software that really is the the workhorse in terms of collecting the data that flows through and forms the substance of all the other services and research output that we enable. So, another, another potential idea that we've had is as a library. We're starting to think from an archival perspective about perhaps accessioning some of these data sets, and maybe including them on our archive space instance, and really viewing them in that way. So, last but not least, we would encourage you to reach out with any thoughts questions ideas that you have we have an email address for this project it's SFM at GW.edu. And the two main links we'd like to leave you with our the link to our social feed manager project it's not the software itself but it's everything about the project. So if you can view and even download some of the data that we've posted tweet sets. Thanks for the opportunity to speak with you, and we hope to see you in person at future CNIs.