 Good evening everyone. Thank you for joining us this evening here at Carnegie Mellon University. My name is David Shear. I'm with the University Libraries. I'm the scholarly communications and research curation consultant. This evening we have our first of many open access week events. This evening we have our open data as teaching and learning resource, a panel discussion with open data faculty champions. So first off I'd like to say thank you to our panel for joining us this evening. Thank you to my colleagues at the University Libraries of Decayne University and the University of Pittsburgh for co-hosting this event. I'll now turn it over to our panel moderator Bob Grattick from the University of Pittsburgh. Bob is the project manager of the Western Pennsylvania Regional Data Center and the University of Pittsburgh's University Center for Urban and Social Research. So now I'll give you to Bob. Thanks. Glad you're here today. So we're gonna talk a lot about stuff that I deal with every day and that's open data and it's gonna be really great for me to listen to what our panelists have to say because we're also really trying to understand a little bit more about how to bring open data into the classroom as part of the project that we have. So Got some great panelists here, and I'm gonna introduce them as we go. So it's gonna be looking at the front from left to right. We've got Dr. Christopher Warren associate professor of English from Carnegie Mellon. Then up next we've got Rebecca Nugent director of undergraduate studies and teaching faculty. Department of statistics also here at Carnegie Mellon. And then we've got Dr. Gibbs Ken Yongo director of the MSED program and educational studies and the associate professor of educational statistics and research at Duquesne University. And then finally on your far right Dr. Jamie Booth colleague that we work with a lot at Pitt. Assistant professor of social work at Pitt. So glad to have you all here today. And I think I think what I'm gonna do is sit down in a second. But the lead question is really just for each panelist to go ahead and in about 30 seconds or so just give us a sense of how they've actually used open data in the classroom. So Chris. Is it this is folks can hear me? Good. So it feels just a little odd for a humanities professor like myself to be talking about using open data in the classroom. In fact the word data itself is one that I have had a hard time sort of using and the humanities of that hard time using. However, I started my career as an early modernist focusing on the literature of the 16th and 17th century. In the last five to ten years there has been a deluge of linguistic data and historical data that my colleagues and I have really been trying to figure out how to work with and teaching has been an incredibly important context both for us to learn about how students might themselves use some of this data and also to sort of try to figure out what questions we can ask. So most of the books published in English between 1480 and 1700 have now been digitized in one form or another and students in my classrooms regularly consult the digital editions oftentimes consult the full text transcriptions and are asking big questions with deep import for the stories we tell about literature history theology and culture. So you know, that's the that's the 32nd version. Thanks Chris. Well, thanks again for having us. It's it's wonderful and I'm looking forward to hearing it everybody's perspectives as well as all of your questions and perspectives. So as director of the undergrad program in statistics we use data of course all the time. So I have kind of the opposite relationship with data that Chris does in terms of I think about finding open data sets every single day and developing curriculum for our program. In fact, one of the hallmarks of the undergrad statistics program here at Carnegie Mellon is that we don't rely on textbook data sets and after the intro levels we don't actually have textbooks. I mean it sounds odd, but most of our work in the classroom and research projects for our undergraduates are all driven by real interdisciplinary research problems that exist that are being done in our department or that are being done by other people that we know or that have been posted online and we can read the papers and get the data, etc. And we take those and we create products for the classroom in a way. So how do we take a real research project and turn it into something for the students to be able to do? That relies extremely heavily on open data access so that we can get all the materials that we need and create an appropriate project and since we have hundreds of students in our statistics program, we find ourselves doing this quite frequently. So that's kind of what we're doing with respect to open data. Good evening. Yeah, so for me, I'm based in the School of Education at Duquesne University and I'm responsible for teaching graduate courses in statistics and research methods and again, you know, by nature statistics you need data to, you know, teach students the different statistical analysis. So one of the things that we have been encouraged to do, actually, I would say in 2010, I was fortunate enough to be one of the fellows to get, to be sponsored to go to to a training on how to use large-scale data sets to teach in classrooms. So I among like 24 other fellows nationally went to Stanford for a week-long training and we were exposed to large-scale data sets that are collected by the National Center for Educational Statistics for Statistic which is like a federal agent. So they sponsor these large-scale data collections and they encourage professors to use these large-scale data sets to teach in the classroom. So that's how I got introduced to large-scale data sets. So ever since actually I've been using these data sets to teach like, you know, some abstract statistical concepts because, you know, when you teach, for example, assembling distribution to students, for them to understand exactly what you're talking about is might be difficult, but with these large-scale data sets, it becomes like a population. It's a symbol, but you can use it as a population and you can draw a symbol from it and then, you know, it can make sense to students. So that in that regard, it has been very useful to use these large-scale data sets too for instructional purposes. Hello. So I teach a community and organization class in the School of Social Work to macro practitioners and we use open data to do community analysis and assessment. So we pair access to city-wide data about social services, about populations, looking at the change over time to make inferences about specific neighborhoods and we pair that with things like windshield surveys and going into neighborhoods and talking to people who are working in social service agencies or people who are working in community organizations or residents to try to paint a holistic picture of a community or a neighborhood and then we try to move the students based on their understanding of that data into thinking about what would I do as a social work practitioner to address the needs that I have assessed, both using some of the qualitative data that they collect, but also some of the data that is available through the city of Pittsburgh that is open and so that's how we've been using open data to inform practice. So informing what we would think about as an appropriate way to approach a community or address certain needs that practitioners are able to identify. Great and I think one of the things that I think would benefit everybody up here would be just to show hands how many of you are librarians in the room? Say we're about half librarians, half non librarians. Anybody else want to ask questions of who's in the room? Okay, that's great. So the first question I'm going to throw at the group here is one that we've been really involved with with some training that we've been doing with the Carnegie Library of Pittsburgh. And we've done a lot of guest lectures around open data and coming to different classes and we've noticed that just a whole wide array of data literacy skills that the students have, you know, you come into one class and you've got a ton of experts and you go into another class and you have to explain what data is. I wonder if you could assess that based on your work and it's neat that we have such a wide array of practitioners here from statistics, you know, across other disciplines too. So maybe we could talk a little bit about that for a few minutes. Who wants to take that first? So how would you assess your students' data literacy? What do you think they do well and what do you think they still need when they walk into the door first day of class? I'll go ahead and start partly because I suspect my students might be some of the least data literate, although I don't know that's for sure. But that's actually, for me, one of the most important reasons for teaching with open data. So my students are very good at thinking about our objects as texts, as books, as arguments. They're less good at thinking about our objects as data. They might think of the things that we study as, you know, things that they access via computer, but they're not as literate as we might hope, about sort of thinking about the way that these data sets, you know, these books sort of became the corpus in which we found them, the ways in which they were assembled in this way rather than that way, and also the questions that we might ask once we start to think of them as data. So part of the teaching that I do is to sort of ask students, you know, what they think humanity's data is, and where can we find humanity's data? And those very questions are, in many cases, the first time students have been asked these questions, but also it's also a kind of illuminating moment, I think, for them as they sort of start to travel down the intellectual journey of wondering what kind of questions they can ask and what kind of answers they might be able to produce once they start to think of these objects as data. So we can go in order, we can go in order again if you want. One thing too, we could maybe think about is what would that, you know, if we're going to provide our students with skills, what would the, what would we put in the curricula too is maybe another way to phrase this whole question. So just at the other end of the spectrum, I would think that the statistics students that we have, the students who already know that they want to be statistics majors when they start our program, consider themselves fairly data literate. Their definition of data literacy might not be exactly what we would consider data literacy as professors, but they, they feel sort of comfortable around data. Now we also provide introductory statistics courses for all different kinds of majors in the Dietrich Humanities and Social Sciences, which Chris's department is a member, and also all over the campus. So we certainly do see a wide range of, of sort of levels of data literacy, and I think Chris's description is a really good one, that the notion of what data actually can be is kind of a new one for our younger students. With respect to the statistics students, what we find for data literacy is that they're not very good at filtering what is useful data or good data versus what is not. So there's this idea now, right, we could just have data streaming off the internet, just, just grabbing everything and, you know, and, and we'll store all of it and we'll run crazy models. But they, they have a hard time about with the idea that you need to find the good data. And so how do you define good data? You need to find the important features and extract information from that data set. And a lot of data that, that exists is garbage with respect to the problem that you're working on. So teaching those kinds of skills, not data literacy so much in thinking about what data can be or, or how we might get it. They're actually pretty good at the getting data. It's, it's trying to discern what would be good data and what would be bad data. And bad data just meaning doesn't have information in it. Not that it's corrupted or wrong, but just that it's not useful for the problem that we're working on. They have to kind of learn that filter. So, so for, for us in the School of Education, one of the challenges, actually, it's a challenge that we have because most of our students, they are not statistics majors, but they are required to do statistics for the purpose of completing their dissertations. So you have students coming with all sorts of background in terms of their quantitative skills. So that's actually a challenge now to say, how do I help these students now understand when we talk of data? You know, what's meaningful data? And luckily, actually, one of the data sets that I use in the educational longitudinal study of 2002, as I said, collected by the national, by the federal government, you know, these are complex data sets with like, you know, some complex assembling designs. So for students, obviously, they are going to have some challenges in making sense of this data. But luckily, though, usually these data sets, they come with some code books. So there's a code book that come with the data, explaining each variable that is in the data set, how it was measured, how, I mean, the definition of each variable. So in that regard, it helps students. I found that to be very helpful for students to understand, okay, when we say we are looking at this variable, okay, what is it? How was it measured? What is the scale of the measurement of the variable? So in that regard, the use of code books helps students to have a better understanding or to improve their literacy of the data. But again, you know, it's a, I'm glad you raised that point because it's something that I struggle with. As I said, like, our students have a wide range of spectrum in terms of their skills. Some will get it right away. But others, you have kind of to walk them through and also again to help them understand what is meaningful data, right? And it has to be something that is appropriate in trying to answer your research question. So you might have this research question and you might have this data. But it might not be meaningful data because it's not appropriate to answer what you're trying to find, you see? So again, making those connections helps students understand the nature of the data and improve their quantitative literacy. Yeah, so social work students are more like humanities students where they don't get any really formal statistical training and don't have a lot of comfort with data. And so I think a lot of my students look at the open data part of the assignment and think, oh, I'm just going to check this off the list. This isn't going to be meaningful for me. I'm not going to be able to get anything from it. And we're not even asking them to do statistics. We're just asking them to look at some averages and look at some changes over time. We're not asking them to do anything complex. But what they walk away with is a better appreciation for what those numbers can tell them and what that can add to other methods that they might feel more comfortable with. So they might feel more comfortable talking to a community group because that's more within their skill set that we're teaching them. But adding this piece to it, they appreciate it more at the end. But they are intimidated by it and they don't know how to navigate the websites to access them. They don't know how to navigate FactFinder on the census. And so a lot of what we do is just working through those sites to locate the data that they need in order to understand what they need to understand about the neighborhoods. Thanks. So Rebecca, you brought up the point of good data versus bad data. But I think I also look at data in the work that I do is data with context and data without context. How do you bring context to the data? So understanding the processes that were behind collecting the data, why the data was collected, the budgets behind data collection with a lot of public data matter a lot. So how do you bring that context into the classroom? This isn't necessarily a question for you, but you brought up a point that I want to get into. Our most successful projects that we create for our classrooms have several pieces. So there would be information available for a data set that could be easily distilled by say a professor or a group of a research group. Could we take this paper and can we essentially talk about the types of things that Bob just mentioned? So how were the data collected? What's the mechanism behind it? Why did we do this in the first place? What was the application? And we provide the students with maybe say a one or two page digest that summarizes that. And then we have the data set of course and usually depending on the level we've done some amount of cleaning. So it doesn't matter if the data set is totally cleaned up online for example in open access. When I get it I need to clean it and manipulate it at a high level for my younger students and for my more advanced students we can give them kind of the messy stuff. So having that variety is good. And then also the code book that Gibbs referred to that's having these variable definitions. Very often I need to take those variables and turn them into something that maybe the students are more comfortable with. So having something that's very well defined and clear is good. And then if there's like the original paper that we can point to. So we don't hide from the students that these are coming from real research problems. We point to the papers, we point to the results. We point to what other people have done with them. We want them to go and read. Not necessarily understand the papers but if they want to go learn more about astro statistics for example or learn about the American Community Survey that's just coming out of the Census Bureau or that type of thing. We want them to have some resources. But we don't expect them to go become experts because we change the topics every few weeks. And so that's what the one to two page summary is really nice. We're able to be successful in the statistical modeling and handling of this project. Kind of given enough information to get the context. Now if they were doing a thesis for example for the whole year we of course would expect them to have a much, like a higher degree of understanding about the problem. But if it's just a class assignment we give them a little bit of information. So that should give them enough that they can try to make some smart decisions on modeling and we sort of set them in the right direction and then of course they can look for other resources or ask us if they have other questions. Great. Anybody else want to chime in on how you bring context to the data into the classroom? I know Jamie you do a lot of that. I'm not putting you on the spot but I'm putting you on the spot. Did you roll your eyes like Liz? No that wasn't a no. How do we bring context to the data? So I think it's like what I've been talking about where we do triangulation type stuff. We don't do a lot of digging into the sources of the data because I tell them what are reputable data portals to use, what is acceptable and what is not acceptable as far as quality data because I will get students who look at, they'll just Google the neighborhood and they'll come to me with data from a realtor's website about the crime rates in the neighborhood or it just isn't good. It's bad data and so the way that I check them on that is we go to the source, what is a good source and we match it. We say is this matching up? Why might that not be? Which one can we trust and why? So we do some of that stuff really checking to make sure what they're seeing and understanding is congruent with other things. Is how we really give context. I mean we're asking our students to really become experts in a geographical area which is kind of a different way of doing things. So they have to go to the Carnegie Library and they have to look into the history of the place. They have to, like I said, they have to talk to community folks and then they have to supplement it with this open data. So that provides context. This is a question that really excites me as a humanist because context is what my students are really good at. And so when we start to talk about data and open data, again the hurdle there is significant, but once we do this is often the first question that students ask. Well this is an interesting data set but who made it and why? And I find that there's a sort of famous term in the humanities, the hermeneutics of suspicion. That is to say that humanities are sort of suspicious by their nature. And so they sort of train this hermeneutics of suspicion onto these new objects. And so the kind of work that they do once they start to think about a data set, it might be sort of data analysis but it might just as easily be a kind of historical approach to how this data set got made, which is incredibly valuable for some of the reasons that my co-panelists have mentioned. Feel free, if you want. Yeah, so you know for my students one of the things, one of the projects I require them to do is to go out there and look for appropriate data that suit the project that they are working on. So for this class project I'll say okay, whatever data you obtain out there, first of all it has to be relevant to your research questions but you also have to describe to me the context of those data under what conditions or under what circumstance situation where the data collected. So they have to understand that the variables that they are using, the data they are using, the assembling designs, everything, under what situation, under what conditions where those data collected. So that kind of forces them to have an understanding, not just to take data and run some analysis and report the results because you can do that with any data but they have to go to the source to have a better understanding of how these variables were measured and how the data, how the sample was selected and also the definition of those variables. Yeah, so I want to get to a lot of the effective data management practices in a second but a lot of you talked about finding data and this is something that I'm really interested into is, oh that's fine. It's just a blank. Discoverability of data. You know I know some of you encourage students to check out some sources. How do you, you know, where do they look? Where do they get the information about where to look for data? Do you think they have the skills for that or not? What do you think the folks in this room can help fill in in terms of data discoverability and looking, you know, do they make use of guides? Do they just go on and Google everything? It's kind of interested to see how your students approach that. So for me, I use this, the actual name of the website, like the data set library, something like that. It's a website where they have all sorts of data sets, again collected under different contexts. And actually one of the main papers of those data is they're encouraging people to use them for things like class projects or assignments. So I provide them with that as a way, as a starting point to say, okay, I wanted to start here. These are all sorts of data sets that are listed. There's a name of the data sets and then description of the, you know, of those data. But I tell them now, this is just a starting point. You know, maybe what you're looking for, it's not, you know, it's not here, it's not on this list. Then there are so many data sets out there that you have access to. And the one that I actually use specifically, the educational longitudinal study, is one of those that I say, you know, this is, but as I say, like most students, the challenges accessing both those data, unless you spend some time, you have kind of some special training, not special training, but some familiarity with those data. It might not make sense to you at all. So I give them the kind of leeway for them to explore, but also give them some guidance as to where to start and then how to proceed from there. Are we still talking about? Yeah, discoverability. Yeah, okay, it's awful. It's awful. You spend hours and hours and hours and hours trying to find the perfect data set and the perfect problem to demonstrate whatever it is you're trying to show in your class. That's the problem. It's not that there aren't data sets out there, that there isn't sort of, you know, a credible wealth of interesting problems. It's when you're working on, when you have graduate students working on sort of like research topics of their interest, or you have students working on a thesis project or something where they can go and search for things that are maybe a little more open-ended, then yeah, it's a little bit easier to do that search. If you are interested in trying to find a data set that shows a very specific type of issue with one kind of experimental design, or a very specific type of issue with one kind of sample design, you know, you're actually searching for it as like a teaching mechanism, that's terrible. All the professors, we always laugh and talk about, you know, does anybody have any data that they can lend? So in our department, for example, as we've been finding interesting data sets and discovering what features they have that are interesting as teaching tools, not as research, but as teaching tools, we kind of have a repository that sort of exists on my laptop, that then we all share them. So we share data sets kind of across, you know, if somebody finds something good, we let other people, you know, know about it and use it. If there were some kind of repository that existed that was about, these are open, you know, access data sets that have these types of features, for example, they're really good ways to demonstrate X, Y, and Z, and it was like a database library, et cetera, or something like that where you could say, I'm really interested in looking at sort of, so Chris does a lot of work in networks, like I'm interested in looking at kind of some network analysis, oh, these are great data sets that show those types of things, or I'm interested in doing this and that, that would be fantastic. So that would help people, including myself quite a bit. Yeah, that's something that we complain about all the time at statistics conferences, how much time people spend trying to find perfect data sets. So we should have some kind of sharing mechanism so that it speeds all that up, yeah. So the data sets that I find useful are even more difficult because they're place-based, and so a data set with a geographical identifier is gold and really difficult to find, and then if we're talking about students getting into a very specific place like Pittsburgh, it's even more difficult. So a lot of the things that we're looking at are trying to get access to DHS data, or trying to get access to crime data, or trying to get access to some types of these types of things, which is becoming better, thanks to Bob and folks at Uxer, it's becoming easier to access those, but the specific challenge of local data makes that even more difficult. I want to jump in because it's a really, really good question. One that's really near to me as well, that researchers generally go to one of two places to find data. They either go to Google or to a domain repository with which they're already familiar, and so if it's not in the latter, they'll go to Google, but Google has limited choice because as we all know Google only indexes HTML, it doesn't index the variables, it doesn't index anything PDF, that's the codebook or anything, it just index what's on the HTML website. So students will often go to Google when I ask them, where did you look for the data set that you're looking for, then they usually say I went to Google. So there are options that are out there. There's places like Data Site or Share or places like that where you can do some searching for additional raw data but it takes extra work and they're a bit hidden and a bit part of the trade, the tradecraft as opposed to what general knowledge that's out there. The Research Data Alliance has a couple of working groups that are set up to try to solve this problem, improve interoperability between data repositories, but my sense is that will be a long time before that actually takes place. So I don't have anything more to say than just it is a really difficult choice. I'm going to teach a couple of workshops in November here at CMU on the tips and tricks for finding raw data, where to look, where some of those portals are, the repositories and so forth that people don't really know are out there but they actually have good data. There was one case, for example, where a student needed a raw data set that showed daily water use because they had an algorithm they wanted to run against this data set and it had to be daily water use in a metropolitan area. And we found it. We found a CSV that was put together by the town of Yorkshire in England on a site called Data Hub, which is where a lot of governments and local governments will put their data sets. But I only knew that it was just one of a few places I just decided to run the search and it was just finding a needle and a haystack. So that should be easier but at this stage it's just not. Thanks. Anyone else want to jump in on the discoverability? Yeah, I might just say that in the humanities, I mean I'm increasingly teaching in a field that's known as digital humanities and paying more and more attention but we don't have a sort of data portal for humanities data whatsoever. And a student who's interested in humanities data is more or less at the mercy of two or three completely arbitrary websites that collect sort of random data sets. Having said that, the key distinction in the humanities that I would make is between unstructured data and structured data. There is just an enormous amount of unstructured data by which I mean text of books, the text of the Google Books Corpus, which is available for text mining and hot to trust and via other means the number of structured data sets is very small. And so one of the things that is a challenge for us in the humanities, there isn't a big push to sort of create structured data sets. So the work that my students oftentimes have to do is to sort of figure out what kind of questions they can ask of unstructured data. So just kind of raw text, digitized text. I think one last point in that regard is the kind of critical role of copyright. So students, I mean one of the reasons that I'm in digital humanities in the first place is because I specialize in 16th and 17th century books, just to say books not covered by copyright. So students who are specializing in 20th century literature, 21st century culture, that sort of thing, they're actually at a distinct disadvantage because they can't do the kind of large scale machine learning sort of text queries that we can do when we work in 19th century, 18th century. But again the skill set is incommensurate in the profession with the incredible range of questions that people can ask. I just wanted to quickly add one thing that is helping us in statistics is there's a very large push among all the research journals for reproducibility. And so anytime anyone's publishing anything, whether it's a method that they're trying on several data sets or if it's just an application based paper where they have their concentrating on one applied problem, they are required to post with the paper, they're also required to post all the code that they used and the data sets for example, which gives you kind of a great jumping off point in some ways because you can see the analysis that they did, you can see how they formatted the data. I can't tell you how many times I found a really cool data set, opened it up, seen that the format was going to take me a couple of hours to get it converted and honestly I just didn't do it simply because I didn't have those two hours. But some of these journals and repositories are requiring people to put them in specific formats and so that's helping as well. And that just ties to this unstructured versus structured. So the reproducibility is helping us. The unstructured versus structured is a big problem. That's simply, again, time. It's simply time. If those problems can get solved or if the access could be easier to these more structured data sets we could of course do much more with those data sets. I'm going to throw a question out there. I think we have a question in the audience. Oh, I'm sorry. One second, let me bring the mic out. Tell us who you are. I'm Ethan Pullman and I'm a liaison for English philosophy in modern languages. I also coordinate in structured. And listening to you talk about the problems both in terms of finding but I think you also mentioned the way that the students think about the problems is also one of the aspects of teaching that open data. Made me kind of think, and I wrote this down here so I can remember it, but so have you done any collaboration with librarians in terms of embeddedness with the courses that you teach and in what way have you done that? And based on that experience if you had it, what kind of things are you looking for librarians like us to help with because one aspect of our job is to also do teaching and I think just listening to everyone's experience there's a lot of ways that we probably could contribute. We always have our class go to the library and have a librarian give a presentation on the different data sources that they can use to get the information that they need and then also show them the Pennsylvania room and how to discover information in that place which is different than open data but we do have our students go to the library for one of the classes and interact with the librarian and get all that information. That question makes me think that probably something that I should be doing because honestly I never thought of it that way because the only way how I utilize the library, my collaboration with the library is to help my students in my research courses in terms of utilizing the library databases but I never thought in terms of helping with issues like data accessibility or data literacy to have that collaboration with the library and I think that's something which I would definitely think about. So I usually have every semester my classes go to the library we have a session there and the skills that we focus on are mostly about like how do you use the library databases to do some research to say pull out some articles from peer-reviewed journals and to do some critique of those existing articles. And we've had a collaborative session about systematic review of research not necessarily like the utilization of data but something mostly like about doing some research. So I think that's something I would think about definitely. We don't have something required but I think it would be a great idea to have some kind of workshop where the students would go and kind of referring back to my earlier remarks about the students having difficulties discovering what kind of reasonable meaningful data sources and what aren't. That would be a very useful workshop or something of that nature to have at the libraries. We actually are starting, we're getting ready to start having a project so to speak with the Carnegie Mellon Library and the Department of Statistics. There's going to be a post-doc hired this year to start a two-year fellowship that is going to be working in the libraries on data visualization and curation but then also doing statistics projects with us. So that's Lisa Linsky's working on that. So we're getting ready to start kind of like a kind of a tighter connection in that area. And then to your second question about what would be helpful. And Lisa and I have had some conversations about this too. I really believe more posting of, instead of people just posting like a thesis that they've done kind of more, what else did they do? Like where did they get the data? What kind of code books did they use? In some ways these repositories or these databases that exist in the library system need to have more pieces I guess I would say. You know it's not just about the written document. It's like all the work that went into how did that written document get created. It's just extra information but it would be incredibly helpful to be able to use all of that thesis work and use it in other places like in the university in the classroom or for other people's research etc. So kind of expanding the notion of what we should be archiving and saving. I'm saying keep it all. In terms of the dissertations that there's no requirement right now for students. I think it's probably very dependent on university as my guess. But I can't speak for what is happening in other universities. Yeah I think it's very dependent on university and I think that you can just kind of say here's the PDF of the thesis and then they go hit the bar and like light it on fire. And they don't think about how they need to give us more information and I think if it were requirement that you need to provide like we said the code the sources the data sets that you used in addition to the thesis that kind of stuff would be would be helpful moving it in that direction. Because actually a few colleagues in the room and I are working on a project that's trying to do exactly that. Outreach to students who are required to deposit their dissertation and encourage them to think about sharing the data. So that's really great to hear. I guess something that you know because it's not a requirement incentivizing that is a bit of a challenge because it takes time as we know we're talking about how to make how data can vary in terms of its quality. So creating the code books and the data dictionaries. Any thoughts about I mean this may be an off channel discussion but I think we've really welcome any input about about working to do exactly that. This is a focus for us this year. Are you referring to how to to encourage people. No no to encourage students to think about requirement for example. Yeah right now it's not a university requirement so we're kind of appealing to you know we're appealing to the we're suggesting the benefits to both them and to research and appealing to their like the public good you know to encourage them to think about it. Well it's a wider dissemination of their work right anytime anyone is accessing their data sets or their code books or the work that they've done they're being cited the work is being you know advertised to others etc. I mean that if you're just going to talk about it like a purely you know personal benefit I guess to the writer right to the person who put it together. I mean I personally think we should probably make it a requirement but I think that's not what. Yeah you know from my experience one of the things I observed is just having some informal conversation with colleagues you know one of the issues is you know it costs money to collect data for the most part. So people they feel like if they have some it's a property that they feel like they have to be compensated for. So this is why like in education a common practice is to include the instruments that were used to collect the data but not the actual data per say or even the code because the kind of you know sort of protected property. So I think that's one of the challenges which I again not quite sure exactly how we can you know overcome that hurdle but you know that's a but of course some are free you know like especially the larger institutions that spawns at this large scale datasets you know one of the requirement is oh I've seen some disclosure the way they are supposed to say you know the data collected how are you gonna make those data public you know so and I've applied you know like for a couple of grants it's one of under data management there's one of the requirements there to say the data you are collecting how is those data how are those data gonna be shared with the public. So there is some effort some movement in that direction which hope you know again if it's made a requirement in some regard I think that will help address this issue. Yeah I'm gonna be working with your students somewhere probably they're gonna probably want to share data with me at some point and I think it would be that was actually a question I had Nora was you know do you ask your students to share data back do you have them fill out metadata do you have them create data dictionaries for their projects it sounds like there's something there and it seems like maybe that's a follow-up conversation I'm trying to thought there did you have anything Chris on this one. Just a just a very brief and embarrassed no which which is to say those would be eminently useful things I sort of wish that you know I were in more of a place to sort of have students understand the usefulness of those things but I think that's maybe a second order problem. Changing gears a little bit here. How do you bring ethics into the classroom it could be open data or data in general I mean you know there are ethics even in terms of what data isn't isn't available and talking about why you can't have the DHS record level data or the requests I got last week from a student about can I have household level blood test data for lead do you talk about that stuff at all and how do you do that? Well in our classes if that is going to be an issue it's taught as a unit in the course they students go through IRB training just to learn about IRB training in the different levels and then we also have several experts in our department who work on privacy and confidentiality and about how about what level are you able to recover individual level data so we have quite a few examples that we can use to teach that concept in the classrooms so that is definitely discussed if we're just doing observational data in order to demonstrate something with a linear regression we would not choose data sets that would require that extra kind of level of training but if it's pertinent to the problem or if it's pertinent to their like individual thesis research for example then they're taught like a modular unit on those types of things usually I think I don't know usually goes pretty well they kind of get the idea of you know you don't want to well I don't know a lot of these students say everything on Twitter and Facebook and their social security numbers and their mother's maiden name etc but we work on it we work but they but I think they do get the notions pretty pretty well yeah so similar to the like we have you know someone come from the IRB to give a talk to talk about how to safeguard you know when so okay even if we are using public data because I know this differs from institution to institution but even if we are using public data we are still required to go through the IRB process now so one of the things that we do is to help them understand the difference between public data and restricted data because these national data says they also besides the public data they have the restricted data so you know help them understand that if you want to have access to the restricted data you can still have access to it but you have to go through these steps it's not exempt you have to do the it might be expedited process through the IRB but you still need to go through the process and again also to help them understand that each time we talk of public data it doesn't necessarily mean is the data we ethically collected and the example that we give like you know this guy Julian Assange who is releasing all these data it's not public data but the way he collected those data was not ethical you see so don't always assume that you know if you have public data those data were ethically collected or you can use them whichever way you want because you know they are public data so helping them understand the importance of how the data were collected and those other different pieces also help them with understanding the ethical considerations in their research because ethics is such a central part of social work and discussed in almost every course that we do not necessarily ethics in relationship to research but ethics in relationship to our clients and thinking a lot about the people behind the data when our students look at data I would say that they are on the side of being overly ethical which might be strange to think about but thinking I can't make inferences about anything because I need to think about the individual I need to think about you know the context that it was created in I can't make generalizations these different types of things so there aren't a lot of ethical issues with the data sets that I'm using so those conversations don't come up but I would say that our students you were talking about a skeptical perspective in the humanities they take a skeptical perspective of data from baseline and so that's how ethics come up in our conversation yeah I was going to say something similar in the sense that my students sort of start from the perspective of thinking ethics you know I teach Aristotle's, Nicomachian Ethics you know I mean it's just it's a perspective that's in some sense kind of in the DNA of the humanities these days so there's a lot of thought about you know what is okay to do with this data and the politics that help to create this data which is slightly ironic because most of this data that we're talking about is about dead people and the ethical issues from the perspective of IRB and so forth are non-existent so there's a kind of it's interesting to hear you say that there's a sort of over ethical approach yeah in many cases I feel like my students could do very well for themselves by just getting along with things and treating this data as data so do your students have a particular reaction when you use open data in the classroom versus other data sets what's kind of the value of open data in the classroom as part of your work a little bit of a tough question but do you see a difference when you throw a real data set on the table as opposed to something that's just made up data or data that fits an exercise well but doesn't have any relevance yeah so I'm going to answer this question from in terms of you know the use of because in statistics one of the options we have is to use simulation data you know like there are so many applets online right now for example as I said earlier if I want to demonstrate for example sampling distribution you know there are a lot of applets I can use online so but when I'm using for example open data open access data that is the large scale data sets that are available to show to students okay let's select a variable for example mathematics achievement we want to observe if we take a sample from this big data set and try to look at the distribution of that variable how does it change as we increase the sample size so they'll be able to relate of course we are talking of an abstract concept but they can be ready to a variable like mathematics achievement because it's actually there they'll be able to understand you know this variable was measured by recording the scores of students so in that regard in terms of using real data sets for me I found that to be students you know they can relate to the real life situations and be able to understand at least to some extent you know what we are talking about when you talk of these variables yeah and for me it increases our students feeling of efficacy around being able to do this when they go out into the field so because we are trying to teach them practice and we are trying to teach them skills that they can apply when they become a social worker the use of open data really shows them that it's available that it's accessible to them that this is what I should be doing when I'm out in the field I think that if we use some type of simulated data they wouldn't get that feeling of this is a practice that I will use throughout my career as a social worker or the skills necessary to work with that data for us we use open data almost exclusively across the entire program so there's not much of a comparison we might use some simulated data for one off classroom example but by far and away the students are all working with real data sets and real problems they like it a lot more the students are far more attached to the real problems that they can and it doesn't even have to be a topic that they get super excited about but that they can see it happening in practice it's nice also to vary the kinds of data and the types of application so they can see for example some kind of methodology that's being taught in the classroom it might be applicable to several different context so they're not associating one type of model with one type of problem for example a side benefit we've discovered is that it makes them a lot more interesting to interview so companies will come and interview them and ask them questions about the types of skills that they've learned and they are much better at explaining the skills they've learned if they have a context that they can talk about oh I just worked on this kind of project in my class we were trying to do this this is the type of data set we built I did these types of methods they're better at talking about that than they are about just saying I've learned this type of method and it had these issues if they can tell a story when they're talking about it they're far more comfortable and I think they actually have like learned more and then they're more successful they get jobs that's good so yeah so it's it's not something that I sort of emphasize very much when I teach but I do you know use open data quite a bit and the reason is you know in a sense very similar I mean these are these are real world problems in the humanities which is to say that you know there are a lot of open questions a lot of sort of things that we don't know in the humanities and problems to which we can you know devote sustained database detention and so I wouldn't even know where to start to kind of simulate a humanities data set you know there are far too many interesting questions to ask of the actual humanities so and feel free to jump in with some questions but what are some lessons that you've learned about open data as a tool for teaching and learning whether they're things that worked or things that have been an abject failure so you know I'd like to hear a little bit of both from the panel or even even folks from the audience if you've got something to share I'd you know welcome you to jump in as well I'll start with an abject failure I have discovered that if I think some data set is amazing and the problem is amazing but the documentation is not super clear on how they generated the data or or maybe how the data were collected or there's some really important piece that's missing in the description and it's not my data set and my problem and so when I turned into something I've accidentally created something that's very difficult for the students because I didn't know that they did some interesting mechanism that forced you know there to be this weird pattern for example in the data that the students couldn't deal with so I've had that happen where kind of due to lack of information I've created problems that are just kind of not super fun or not maybe worth it then like they didn't kind of learn they were just mostly frustrated and it wasn't a learning experience I think that the most important thing is finding that sweet spot where they can feel like they, because in my context they can't discover what I'm looking for so sending them out in the world and asking them to find it would be too much they would become immediately frustrated and so I need to provide them with that information in order for them to feel successful so finding that sweet spot between setting them up to be successful but also allowing enough space for them to do the learning I think it's probably the most challenging thing because you don't want to do all the work for them you want them to do as much work as possible but you also don't want to you don't want to give them an impossible assignment which when you're talking about finding open data you really have to be sure that they can find the thing that you're asking them to find because if you're asking them to find something that doesn't exist then you're, I think that that's where a lot of the failure could come in because one of the things that happened and this is an actual case where a student came to me and was asking for data from newspapers textual data that they wanted to do analysis on rhetorical analysis and the biggest issue is while the examples in the class have been driven from open data and the methodology and so on and so forth they weren't open to really think more broadly so there's a critical gap here not just in you know I know what the professor's methodology and data is but how do I go look for other data to do what I want to do based on that and what criteria I need to learn to use so for me it was really important to kind of contact the professors and say hey can we talk about this assignment and what are you trying to do because sometimes it's really important for the librarian to know what they're up against so one thing I would say about sort of successes and failures is the kind of double-edged sword of dirty data so dirty data from one perspective is a problem it's a problem that frustrates students it's an impediment to answering the questions that are driving their research and some of my most abject failures have been with students who just are confronted with the problem of data that takes a lot of labor to sort of get into the format for analysis and here's the double edge of it that another word for cleaning data for cleaning humanities data is humanities research which is to say that if you you have to be deeply embedded in the context of the data you have to know a lot about which data is junk and which data is meaningful and sort of the process of creating a clean data set is actually the bread and butter of research in the humanities and so there's a kind of sort of weird incentive that I've got and I haven't really figured out quite how to navigate it which is to put students in the context of dirty data data that doesn't immediately answer their research questions and make sure that they sort of experience the felt edge of cleaning data and experiencing the felt edge of cleaning data is synonymous with doing work in the humanities so for me one of the things that I observed is like specifically this large scale data that I work with is so complex in terms of we are talking of thousands of variables and by the way this data set is a national sample of students who were 10th graders in 2002 so it's a national, I mean like you can imagine so not only variables on the students, on their parents variable on their teachers, on their schools so we have a multi-layered sort of data set so one of the things that I found like one of the challenges is for students to have I mean once you figure it out it's not difficult to access but you know students will need a certain level of skill for them to be able to access the data so that alone is a kind of an obstacle for the students to you know it's not something that you just open and then you have to go in there you take the variables you select but one of the big benefits I discovered is as a teaching outcome the students have come out with well developed research skills because one of the things they are getting out of this is to be able to make a decision of which variables should I extract and how do I decide to know that you know everything has to be put in the context of your research questions so we are using this for instructional purposes they are doing this but at the same time they are developing research skills because again they are graduate students they are required to write their dissertations so by the end of the semester they now have a certain level of understanding how do I craft my research questions how do I select the appropriate variables to measure and things like that so despite being complex in terms of accessing it you know it is a major benefit in terms of those skills that they acquire at the end of the semester this is maybe primarily for the librarians in the room but anybody is welcome to answer what do you wish faculty would ask you about I'm Matt Marsteller I'm a liaison librarian in the engineering and sciences my question would be dealing with you know hoping that you would view a librarian as a cohort in crime when it comes to educating people what skills with regard to open data open data literacy would be in the toolkit of your ideal liaison librarian and how might we obtain these if we don't have them I think similar to what Matt was saying and my name is Ula, I'm a librarian here engineering and science librarian at Carnegie Mellon and similar to what Matt was saying but how do we make data more findable what can we do to the data to make it easier for others to find it so are there metadata schema out there that may be helpful are there ways of choosing keywords, choosing variables choosing something that might make it more findable putting it in the right place the main repository is the best maybe it's not or maybe there are other repositories if there is no main repository where is the best place to put it to make it findable those kinds of questions Hi, I'm Lauren Collister scholarly communication librarian at University of Pittsburgh one of the questions I wish that we would get asked more often is what's next so the students use the open data in the classroom the culture of open science so that we have more of these items in our repositories that will lead to better data more publications that are openly accessible that people can use when they're working with open data to learn about it and then how can those students eventually go on to do their own education in the open so how can we foster that kind of culture of openness beyond using the data sets in the classroom too I have a very specific interest actually and that is I wish you would ask us how we can help you assess these skills that you need because there are some of us who are constantly thinking about how to assess digital literacies media literacies, data literacies and all kinds of literacies so actually one of the comments that you said a little bit ago was very helpful so again my name is David Shear I'm the scholarly applications and research creation consultant here at Carnegie Mellon was we wish we could deposit those other things that go along with the data set available that it's not just about the publication the peer reviewed object and it's not even just about the the mitigated data set that gets created but it's all the other additional information and objects that go along with those things that help provide I think what I saw during this whole panel has been a key word which is context being able to write that additional information that helps provide context about the data set why it's being talked about in the papers and the publications but being able to then provide those materials through things like institutional repositories and data repositories and disciplinary based repositories so that all of those materials are available but also as researchers and as the people that are creating the data sets get credit for all that additional work because I think as we talked about all those different materials the code books the mechanisms there's a lot of work that gets built to create those things beyond just even creating the data set so I think it's one of those things where based upon you as members of the academy what can we do as librarians to make sure that we're documenting and providing access to those very necessary objects so thank you so a question I have actually was about the different training mechanisms that students receive and maybe what else can the libraries provide you beyond just teaching your students how to use the libraries and it's I think what got tossed around a lot and I think Jamie I think you mentioned a lot in Gibbs as well is having the students have that real-world experience getting their hands dirty and using the real open data sets and I wonder what do you do with students when they create their data sets do you have them go through the process of creating a data manager plan or understand that what that is and have you thought about it and if you have is that something that you would want to have or see in a class my students to do I'm not asking them even to collect data necessarily let alone create code books or those different types of things so for me you know like our students again these are like not statisticians but you know social scientists you know who are doing statistic for the people who are doing research so of course we teach them research in literates but I you know listening to all this discussion here probably should teach them also statistics literates I mean like or data literates you know some quantitative literates so I don't know like whether that could be maybe a possible collaboration between the library and as you know the professors we create these repositories and then you know have a session or even maybe an introductory course in which students are introduced because one of the things we like when I one of the first thing my first course in a semester I introduced them statistics or data students I tell them we all use data in our everyday lives I mean all of us so we make those connected somebody say oh they get surprised you know you have used statistics before you have used data before you know we interpret the weather before you check it to see whether it's right oh that's data we collect so those connections sometimes I feel they are missing so I think you know the it could be the role of the library also to make these connections between real data as well as you know what we are doing for them to help them with the data literates skills you know the issue of connecting all of these things I mean this is part of the conversation that's going on particularly with the new information literacy standards and looking at frameworks I'm sure you've heard of competencies from middle states and so on and so forth so I think it's gonna take a lot of collaboration to figure out where how this is gonna be assessed and how it's taught and where the students are I mean that's part of the challenge I think that so in statistics we many of us are spoiled in the sense that we've teamed up with a scientist on some interdisciplinary problem that has our that scientist has or team of scientists have already collected the data maybe we've participated in the experiment design discussion at the earlier and so the data set is kind of handed to us in some ways and so they wouldn't had maybe as much training about kind of how to sort of collect the data and go through that process with you guys but the flip side of that is that we do have people working in statistics who are creating their own data sets so I just had a meeting with a student yesterday who is scraping election Twitter data for like the last four months and so we've had lots of conversations about different features that you might find and what are the ethical implications and different like what do we do with different pieces of it so I would think that having some kind of I'm using workshop loosely but some kind of workshop or training procedure that is available for students on how they might collect data kind of in the modern age that it's not coming in some just like CSV file or something that someone's just handed to them but how do you actually scrape it's not actually that hard but it's it's hard to know kind of what to scrape maybe or like what to actually get so thinking about what to save what not to save that would be useful I mean we would definitely send if that existed then we would send students and then it would be much more efficient than us trying to teach them one on one right if there's just a place for them to learn that material that would be very helpful yeah that kind of so this is slightly off topic but I wanted to share it because I thought it was amusing I'm teaching an introduction to digital humanities class right now and I sort of used the phrase dirty data and that inspired one of my students to create a twitter bot of the works of the Marquis de Sade so that now exists in the world so the question about sort of the infrastructure because I take it it's an infrastructure question for the disseminating and making discoverable data sets I mean I think of it from a disciplinary perspective of what's the work that students produce in English classes you know what's the work they produce 20 years ago or even 5 years ago or even today in many classes it's actually a term paper right and I think for me that's the an interesting analog in the sense that I have never heard anyone seriously propose that we archive you know student term papers for my undergraduate Shakespeare class do we have a different standard for the data sets that they produce in my introduction to digital humanities class the answer might be yes but it also might be no and it might be worth thinking about but the other thing is that this is coming from the perspective of a convert I've recently been introduced to Jupiter notebooks which are a really powerful way to kind of show process that is you embed code and then you sort of run little snippets of code and show the results and that is the kind of thing that I would love to see more of in libraries and the reason I know about it and I'm so enthusiastic about it is because Matt Burton taught workshop on them in the pit libraries and you know ever since you know he taught me how to do it I've been doing it and just loving it and it's the kind of thing that I can see being super super useful to students who you know again this problem that Rebecca brought up of not not knowing the background the process not having access to that something like Jupiter notebooks is a really good way of sort of showing the code showing the results sort of understanding the process of how a data set got made and it would be great for libraries to keep them in repositories Hi I'm Christy I'm the outreach and communications librarian at Duquesne and a little bit of context something we're working on at Duquesne is trying to work on promoting and marketing our faculty members to consider publishing and open platforms which is hard so a question I have for you and I guess the other faculty members here who publish and have been using open data and open platforms how did you first come into it did someone show it to you were you always embedded in it was convincing people who are not aware of it can be a little trickier especially when you are trained believing that you need to publish in traditional formats and not make it open right so can I ask are you referring to just sort of putting something up online in an archive or something just as an example no I mean like traditional repositories open access journals making yes also just posting it online as well either just the data sets instead of the published materials but how how did you start using it in your practice right so in fields like statistics, machine learning, CS that kind of thing everyone puts stuff up as fast as we can in archive under the different fields because you're claiming it so if that's kind of a repository that has a lot of weight behind it and its reputation then it's almost like an early publication of a sort it's sort of I'm working on this here but as a result it's tagged with the date I've submitted it so you've submitted the paper but journals take forever and so it's kind of like your way of saying this is mine while I'm waiting to see if it's going to be published and that is so this may be a very discipline specific thing I'm sure this is different I don't know if there's not an archive for unless yeah so maybe there should be an archive so yeah there you go you can start yeah so in some ways it's it's beneficial to the people doing the research because you're sort of claiming it while it's undergoing the revision process and people actually search archive because they don't want to wait for the journals to come out so people search archive for different tag words and who's publishing what and they'll look at the papers because it's considered like you don't post it on archive until it's kind of like pretty much done and you're submitting it so it's not half finished work it's supposed to be like a product that you're proud of so people use that quite often it would be odd if you didn't in our fields but I really think question is going to be extremely field specific I'm sure Chris is going to give a totally different answer than what I did yeah actually I know like there's a journal in our field statistics education what they've been trying to do is to encourage you know instead of getting credit for publishing articles actually there's a section where if you write I think they call it like data stories whatever they say you get credit they appear reviewed so the section just for people to develop these data sets so that's another way in which again they've been trying to it's actually an open access journal it's online so people can go there you know you can very accessible open access journal sorry just as a quick example of that some colleagues scraped over 60,000 dating profiles from okCupid and they got permission from okCupid to do it so it was on the up and up and then they wrote they created this data set which was incredibly rich and hysterical and then they did they wrote an article about this data product that they had created online it was peer reviewed and accepted and now anybody who uses that data set is citing that those people created that data product and they're using that etc so there's a push for sometimes the product the data product itself is actually it's a contribution and you should be writing about that you have created this data set right now we actually are trialing a proprietary product that actually has citation information on a lot of open data sets and it had arranged it with the citation connection part of web of science and so while you have the chance a couple of you from Carnegie Mellon maybe you might want to give it a spin but it was rather interesting to see where the citations were coming from and where the data was being deposited yeah I just want to say it's it's about time now and I just wanted to say thanks to all of our panelists for being here and all of you to be here in the audience and just real quick any final remarks any observations I've got a few I mean the biggest thing I heard is what we heard earlier context really matters not only in terms of understanding your data but also making it useful and usable in the classroom and I think anybody else wants to chime in and then we'll close it out final remarks observations anything that you want to take away from today I've been struck continually about sort of some of the similarities actually between the sort of social work context and the humanities context and one of the things that it feels to me would be a really fruitful sort of collaboration is some work with people who are coming from the one side of the aisle which is in which ethics are deeply embedded context is deeply central to the work we do and sort of more data-driven approaches you know it feels a little bit like chocolate and peanut butter this is maybe not quite the closing remarks that Bob was expecting but I actually just wanted to thank all of the librarians in the room for doing what you're doing which is I have multiple librarians in my extended family and which is essentially toil day in and day out making sure that we have access to all of this information that in some ways we take for granted by being able to just type it into a database and it shows up and it's magic for us but there's a lot of hard work behind the scenes and it's exciting to sort of think about what new products and new connections we could be making right from with departments and libraries and so I'm excited to participate in this type of open data access week and see what comes of it maybe it's a question to you guys what's the next step do you envision a situation where I guess that's the theme that came out of here to encourage collaboration between the library and the instructors in terms of and I think this would be an exciting project in which through our library we build these repositories for data and then we collaborate the students will go there get some action on the literacy on quantitative data and things like that so I was just curious whether you see this as the direction to which things are going short answer yes similarly it's exciting to hear more people interested in open data and I'm really excited for my students and the possibilities of what they may have access to in the future and how that will shape our social work practice because the more we know the better work that we can do and the more I can incorporate that into my teaching the better practitioners I'm going to make so hearing that the library is interested in collecting this and making it more discoverable and that this is of interest across disciplines I'd say like feedback matters to all of us and so we should talk to each other as much as we can and if anybody has ideas for just local sources of data whether it's like crime data like you mentioned Jamie or other things talk to me because we want to be able to help you we consider ourselves like mini librarians or deputy librarians like little bads or something I need a badge nor but one of the things I'd also like to do is invite you to our event with the Carnegie library down the street on Saturday your students are all invited it's day to day we're going to take over part of the library or 1030 to 330 sorry we're going to take over part of the library for tons of activities and events related to data we're going to have people there to show tools that the city have built on top of open data I'm going to bring a ton of old Sanborn fire insurance maps from the 30s and 40s that somebody gave us and we're going to do something really cool and fun we're going to have people talking about redlining and showing redlining maps so anybody's interested in open data students send them down the street they love to have them and we'll keep them busy and then they'll come back learning about all sorts of great things that librarians can do with all of them so hope to see you there there's flyers on the table tip your librarians on the way out thanks again so I think with that you all join me in thanking our panelists and our moderator so again thank you all for coming just so you're aware our next open access event will be on Monday October 24th and the ID8 studio February here at CMU which is very fitting kind of our conversation about open data and I'll be talking about open publications so our next event is supporting open publishing the article processing funds of CMU in the University of Pittsburgh at 5 o'clock on Monday thank you all again