 Thank you so much, we're going to go ahead and get our panel started today. Again, my name is Lisa Zalinski and I am with the Carnegie Mellon University Libraries. So today we have our open data panel and moderating our panel today is Katie Webster, the Dean of the Carnegie Mellon University Libraries. And with that, I'm actually just going to turn it right over to him. So thank you. Thank you Lisa, good afternoon. Those of you sitting at the back will probably learn fairly quickly that my voice has a habit of going quiet. So I can either keep on shouting or you can move closer to the front and it's your choice. But if I do fade away, either wave your hand or move. Delighted to welcome our panelists who are all leaders in the open data movement in Pittsburgh and beyond. And they represent an array of disciplines from humanities and social sciences to engineering and the natural sciences. Properly sourced and curated, open data arising from scholarly endeavour. They represent a tremendous resource for the research community. And if we get our job right today, that resource can exist for many generations to come. Broke this speaking and I'll repeat that invitation to go to the CFA building on the CMU main campus at 6.30 to learn more about it. But open data from my simplistic perspective is data that anyone can access, use and share, and has a license permitting those activities. Good open data can be leaked to so that it can be easily shared, talked about and it's available in standard and structured format so that others can rely on its traceability, right back to where it originates, can determine the evidence trustworthy and can enhance scholarship into the future. Our four speakers are going to share some of their innovative approaches to create data to make it openly accessible and of the infrastructure requirements they need during their research activity and beyond. I'm going to introduce each of the speakers one after the other and then invite them in turn to say a few words about their activity. We may deviate, it's going to get very informal, white-hearted event. There's lots of food there, there's a bar over there, don't wait till 5.30 to talk up. Just get up, move around. We'll do the same then. Those of you who were at the peer review panel discussion last Monday will have seen some cookies provided by our hosts at the University of Pittsburgh with the open access look. I'm not sure how they did theirs, but ours have been prepared by Lisa and Anne Marie using proper proprietary open access cookie cutters which were printed on the 3D printers on the library yesterday. They are solid things, so 3D printing is sustainable. They are the code to make the room printouts also is available, so if you really have to be art, you can be those 3D printers. So to our panel, and I'll start with Mario Brous' causes to me. So Mario Berger is an associate professor at Carnegie Mellon in the Department of Civil and Environmental Engineering. He's interested in making our built environment more operationally efficient and robust through the use of information and communication technologies so that it can better deal with future resource constraints in a changing environment. He's the faculty co-director of the IBM Smart Infrastructure and Analytics Lab at CMU as well as the director of the Intelligent Infrastructure Research Lab. He's received numerous awards including Dean's Art and Clear Fellowship from CMU earlier this year. He received his BSc in his native Dominican Republic and his master's end doctoral degrees from Carnegie Mellon. Sitting next to Mario Brous' Christopher Warren, an associate professor of English at Carnegie Mellon where he teaches courses on Shakespeare and early modern culture. His research interests include digital humanities, law, literature, political theory, early modern literature, global studies and the history of political thought. His first book, Literature and the Law of Nations, 1580 to 1680, was published earlier this year by Oxford University Press and is a literary history of international law in the age of Shakespeare and Milton and Hall. A digital humanities project he founded with Daniel Shore, the six degrees of Francis Bacon, aims to be the broadest, most accessible source of who knew whom in early modern Britain. The beta site was released earlier this year. Chris will tell you much more about it. He did, in his background notes, tell me that it's been featured in a variety of erudite, scholarly journals such as the Smithsonian Magazine, Mental Foss and for the various Brits that I'm looking at you for the reaction. It has also been featured in The Daily Mail. I've been derived from that, what you wish. But we are very proud of you. Next to Chris is Geoff Hutchison, an associate professor in the chemistry department at Pitt. His research is in materials chemistry, particularly rapid screening of molecules and polymers for engineering, for energy applications. Geoff is here because he was involved in starting the Pitt quantum repository, which can be used for teaching and research. And it's the first database that easily allows students to view 3D molecules on their phone with computed properties of over 100,000 molecules using quantum chemical methods. And they've also developed the open source chemistry program at Fogadro, used by half a million people in 20 languages for a variety of do-it-yourself quantum calculations. Geoff also pointed to the CMU connection in that Professor John Popel, a former CMU faculty member, had a long-term vision of a global quantum chemistry database. And even close at home for me is the fact that John Popel was a Nobel laureate and his Nobel medal is on display just outside of my office. And Bob Credit is the fourth member of our panel. He manages the Western Pennsylvania Regional Data Center at Pitt's University Center for Urban and Social Research. Bob provides overall project management at the data center and also takes the lead in building relationships with data publishers and users. He also is responsible for community engagement efforts. And we've discovered a shared passion for being brutal about the traffic in and around the park of Pittsburgh where we both live. So if you fancy speeding along Farms Avenue, you've got us to deal with. So I am going to turn it over to our panelists and invite Madio to start with some open remarks. So I think you do well put some of the background information about who I am and what I do at CMU. But I'd like to maybe start with more of an anecdote of how I got involved in open data and use that as maybe a starting point so that if you have questions later you can refer back to that. So my quest, I don't know if I would put it as strongly as a quest, but my work in open data started because when I was a master student here I was faced with a question that led to my THD and is still ongoing and is a question about data. Somebody actually was an energy SAR at CMU, that's the actual title. An energy SAR, a person who was in charge of managing energy for all buildings on campus. He came to us, to the masters in the group that I was in and said, we have a problem here at CMU where we only receive a single monthly bill for all our utilities and we don't know where we're spending energy, which building is spending more, which building is spending less, who should we charge, is it the dean of engineering, is it the dean of arts, we don't know. So we took that as a question that we thought could be answered by just measuring. We thought that data could be accessed very easily. And as master students we went on and acquired some instruments and started deploying them around campus to measure electricity consumption at the finest resolution we could find. And that actually eventually led to a conclusion that it is pretty impossible, it's kind of impossible to measure exactly how much you're consuming and know how you're consuming it. And that'll give you one example. So right now you have a single monthly bill, but if I give you information about how much electricity you're consuming, let's say in real time, not just monthly but just in real time, would you be able to make better decisions? So this particular second you're consuming a thousand watts. Next second, a thousand and fifty. Next second, a thousand and a hundred. You don't know what to do with that information. So that data alone wasn't enough. So we started a whole project to figure out if we can translate those measurements into useful information. So we saw a jump of fifty watts in your house. We knew that was exactly a TV. And it couldn't be some other things because we know that TVs jump about fifty watts. And that's, we started to build patterns that we could teach computers to then recognize what was happening in your house and build fingerprints for individual appliances and let computers deal with understanding exactly how much you're consuming when it realigns. And that led to my PhD thesis and everything else. But to solve that question, we need a lot of data. So we don't know what these patterns are, what they look like, and there's no published information about that. So we started a quest to gather all that information and that led to publication from my team and from people around the world on measurements of energy consumption that is annotated and walk you through it so that people can then use all that data to train models and solve it. So I'll stop there, but that's kind of how we got started. Thanks. So as I think the only human is on the panel, I should probably explain that data is not a word that comes naturally to me. In fact, it's only in the last few years, coincident with being at Carnegie Mellon, that I have even sort of thought of myself as being involved in data. And I also sort of want to talk about that story and how that happened. 60 Weeks of Francis Bacon is a project that sort of reconstructs historical social networks, specifically those of people who were alive between 1500 and 1700. And sort of, you know, early on in the kind of genesis of the project, my collaborators and I thought this would be a good idea. But we thought we would go about it in a very traditional humanistic way, which is to say we would look at manuscripts and lists and books and write things down and sort of, you know, turn them into some kind of computer digital artifact, but data was never really part of that sort of conceptualization. But this is a very sort of CMU story. There was a colleague who was hired at CMU when I was in 2010 in the statistics department, and we were sort of walking down the hall. We share a hall in Baker Hall at CMU. And said, hey, do you want to get lunch sometime? Sure. What are you working on? I'm working on that. What are you working on? You know, I'm thinking about reconstructing historical social networks. And this colleague in statistics said, actually, there are some fascinating statistical problems there. And I was blown away. You know, I was blown away. Like, why would statisticians care about historical social networks? You know, I care about great works of art, great works of literature, where people get their ideas, their language. That this is a statistical issue was just kind of incredible to me. And so ultimately we sort of had a series of conversations in which one of his Ph.D. students, at that time the master's student, sort of worked with us a little bit to sort of identify sources of what I soon learned to call unstructured data and to, you know, which I used to call books. And so to sort of identify sources of unstructured data to infer social networks from these sources. And then over time we sort of developed sort of working ways to visualize that data and manipulate that data and learn more and more about the way people were connected in history. And over time I have been convinced about the sort of need for open access in a couple different ways. One way is that when you think about the sources of unstructured data from which you reconstruct historical social networks, one of the kind of key sources for us has been stuff that's already digital in digital form. Ideally HTML or text files. Mostly secondary scholarship. A lot of the secondary scholarship is in the public domain but that's everything published prior to 1923. What that means is that if you're reconstructing social networks from text published before 1923 you are effectively reconstructing the biases and omissions of the Victorian period basically. And so there's a whole kind of century of knowledge more or less that has to be omitted. So, you know, we've been sort of looking at more and more ways to kind of work with works that are in copyright in order to actually have an accurate picture of who knew whom in this period. And the second sort of commitment to open data has to do with what we do with our sort of document matrices and the probability matrices that we developed. You know, if you go to our website right now, 6reesoffrancisbacon.com, you can download all of our data for free and manipulate it and do what you want with it. That's very much a kind of commitment to the recognition that there are a lot of things one could do with this. We do some interesting things with it, I think, but we don't have a sort of monopoly on what you might learn from this data. You know, it's really exciting to me to think about combining this data set with data sets from earlier periods and later periods and be able to kind of say how have social networks changed over time from 1500 to 1800 or from 1500 to 2000. That's a sort of possibility that's only possible if data is open and accessible and that other researchers who are sort of asking sort of parallel or constant questions can use that data and so we want to sort of put it out there and very much the hope is that over time we're going to have, you know, a really rich world of new knowledge developed because historians, literary scholars, historians of science, art historians are thinking about the past in new ways and learning to see how data can help answer questions that are important to humanities. So I'll actually start with a historical side to this, right? So modern scientific methods, scientific thought, really began in the Enlightenment. Before that, particularly in chemistry, like you had alchemists and when they came up with the discovery, they really did not tell anyone, right? They'd write things in various codes, they might not even tell their apprentices and, you know, what transformed science was the Royal Society in starting to publish scientific discovery and philosophically transactions and other scientific journals. And the transition occurred, right, in earlier science and mathematics you'd keep these ideas secret to protect your job to a transition where you'd say, okay, I'm going to tell you all the details so that you can reproduce my discovery. That's how you trust me, and I'm going to publish it and then I'm going to get impact for that. And, right, the fact that we've got hundreds and thousands of scientific journals speaks to the massive success of that. The question that now, right, as we move into a digital realm, let's call it the cheminformatics, right? And that's something I'm actually involved in. How do you process chemical data in new ways to build networks and allow others to build on the raw data files? So when I was a graduate student, I went to Northwestern and John Popol, after he left the CFU went to Northwestern and gave a presentation, and he talked about this idea of a global quantum database. It started because when Popol began in the 70s doing these quantum calculations, they're extremely time consuming. They still are, right? The methods, Moore's law is wonderful but people advance new methods that are more computationally intensive. And so he had this idea, look, why should everyone repeat a calculation of the properties of benzene? Right? Someone should do it once and then publish it in some electronic format that others can access, right? And then, you know, maybe you do, you change things but you can build on it. The problem, of course, is that a lot of chemical software, one of the chemistry methods, are sold. And the company that John Popol created out of CFU actually saw this as a threat to their bottom line, right? Because if you are publishing the results of all the calculations, then you don't need a license to the software program to do all the calculations. So I was really inspired by that presentation that Popol gave and thought, well, look, you know, I had taken plenty of computer science classes. My father had done more computer science and databases, real-time databases and so on. This just sounds like a database problem. It's not going to solve any problem. But as a grad student that sort of sat on the shelf, when I came to Pitt, it was something I was interested in and people said, well, you know, start doing that after tenure. So, okay, fine. You know, I was still interested, as Pete said, right? My group developed this package called Apigadro. It's a user-friendly way on the desktop to draw up molecules and run simulations. I designed this software so that I could take undergrads and high school students and get them to do this kind of research. But the question in Apigadro is, again, the question that John Popol had, well, okay, so I draw out this molecule. You know, if someone's already done a calculation on it, shouldn't I be able to click the button and have it just import the data? Why should I have to do the calculation again? It doesn't make much sense. It's, you know, in 2015. And a colleague came to me and he said, hey, I've got this great idea for education. He said, look, when we teach chemistry, one of the significant problems in teaching chemistry is we draw things in two dimensions on paper on a blackboard in a textbook. But molecules live as 3D objects. And it's a key skill for chemists to be able to take that 2D depiction and then think about it in 3D. And some people do better at this. It's a learned skill. He said, look, every student we have in chemistry class, they all have smartphones. And we're basically telling them, turn off the smartphone and don't work with the user. That seems silly, right? The smartphone is at least as powerful. It's probably thousands of times more powerful than, you know, the computers John Popol was running calculations on in his 70s. And they've got cameras. So, hey, let's put a little of QR barcode or URL on the slide. And they zap the QR code. It takes them to a web page there. They've got the 3D view of the molecule. And I said, oh, this is perfect. You just figured out the access way. But what you need for your method to work is you need a database. You want hundreds of thousands, millions of the most interesting molecules, right? So someone wants to lecture about aspirin. They don't need to go fetch the structure of aspirin, right? It's there at their fingertips. And so Dan and I, we would have a proposal on the Dreyfus Foundation funded it. And that was the genesis of PQR, right? And so on a smartphone or if I actually have a net access on an iPad, right, I can come in and let them be cold, right? Here's molecules. And I can move them around on my screens as three-dimensional objects. And we said, look, this is a great opportunity. So each record in the PQR has a separate, sightable digital object identifier, DOI. We worked with some people in the Pitt library who hooked us up with ways we can mint a few million DOIs without breaking the bank. And we've set them up as living data structures, right? Because we fully expect people to build on them. And we just see that as a continuation of the scientific method in the 21st century. Anybody come up with any good panelist drinking games? I was thinking that maybe I could write a word on here and not let any of the other panelists see it. Anybody wants me to do that? So I come at it maybe from a little bit of a different perspective. I'm out in the community and I'm working with data and I have been for a long time. So I've been at Pitt for six years and before that I spent ten years at Heinz and Carnegie Mellon. And pretty much that whole time I've been helping people either find and use information about the communities in southwestern Pennsylvania. So about ten years ago or so, I collaborated when I was at CMU with folks where I work now at Oxford to create this project called Pittsburgh Neighborhood Community Information System. Anybody heard of it? I don't know why that's surprising. Anybody has. But really it was our goal to try to unlock data from the government. This is pre-open data. We were working with a lot of community organizations. I have a city planning background. So we would get calls from faculty members or community nonprofits working in the middle of the home wood trying to understand who owns this property that we're looking at or is it vacant because we can't really tell or are the owners paying the property taxes. So those were all important questions to community development. We really couldn't answer it unless we had a systematic way to actually get the data out of government and put it together and then share it out with people. And so it was a very manual project. It was like me going to meetings and saying, hey, can I have this data? This is what you want to do with it. And you can trust me because I'm from the university. So, you know, we didn't have the most robust infrastructure. And then once we started to build the relationships within government, everybody was great. But we realized that capacity wasn't there. So people would email me attachments like an Excel file or a CSV file. The data wasn't just a giant mess and I tried with my rudimentary data skills to clean it up. And we did that for a few years. I throw it on the GIS server whenever I have the time to update and the data set was not at all timely but people were pretty appreciative because it was really hard to get data and it still is in so many ways. But, you know, through that work, we became part of this national community. The Urban Institute has a program National Neighborhood Educators Partnership and it's a network of 30 plus cities that really do what we do. We're data intermediaries in the community. We help people outside of academia or even within academia find and use information about their communities. And so what we're trying to do with the new project that I managed, the Western Pennsylvania Regional Data Centers, take that role of being that intermediary, helping people with data, and then connect it to an infrastructure around open data. You know, because we didn't have the legal infrastructure. You know, we had people sign a waiver saying I'm not going to build a responsible university if it's for me to do something with the data. We didn't have any other infrastructure other than that. We didn't have a technological infrastructure other than the GIS server. So it was really hard to really do anything at scale. So after a while, I kind of got a little frustrated with wanting to do more and not having the skills to do it. So kind of sat down with people that we knew in the city and talked to people in other cities and just kind of watched what was going on with open data because it's good to be a last mover in a lot of cases. And just some of the observations that we had, and I'll try to get into what we're doing for some of the moderated questions that Keith is going to have. But the demand for data is growing. It's more and more apparent that people want to use information to solve problems. And in a lot of cases, it's even harder for people to find and use that information. The more data that's out there, it's a real problem. What do I use? Tell me what's important. So those are the kind of things that we try to answer. People also don't want to really go to a website and play with the interface and build the visualization to get data. Our old website, you had to use the map to make your own maps. You couldn't really change the colors on the points and things like that. You can only download records a couple of thousand at a time. So people want to use data on their own terms. They just want it all at once. Download it. API is great if they're going to build a tool. But you've got to give it to them so they can pull it into Tableau or pull it into something else that they want to use. They have great problems encumbering them with any tools that they don't want to touch. So that's another lesson we learned. We also learned, I think this is another lesson that's pretty easy from the story I told, data owners, especially those in the public sector, really don't know how to publish data well. They don't know how to manage it well. And I think there's an opportunity there in what we're doing to really encourage them to document their data and improve the quality of it through a community process and feedback. And those kind of things. So we're really enthusiastic about the response that we're getting from the city of the county, our partners on this project. People also want fresh data, not stale. For me, it was really hard for me to update some data sets that were real pain in the ass. More than once or twice a year was just like too painful. I didn't have the funding to really go and hire anybody to do it. I was kind of doing it for a bono. And so, you know, you can see the importance of not having fresh data. I'm like, yeah, if we can only get you this updated list of property owners every month, then you could actually track who owns the, you know, who owns or which investor that's a real problem in the county is buying properties and target them every month. So one of the goals we're trying to do is actually encourage routine automated publishing data. We also learn people don't talk to each other about data. You know, we would learn about things and then we tell other people about things that other people were doing with data that were pretty cool. But those people weren't talking to each other and there was no institutional framework that enabled that to happen. And we're hoping to do some things around that too. Two more. Infrastructure's an afterthought. People just thought about data for their own project and they didn't think about it each other, how other people needed to use the data. And so we're trying to bring an infrastructure perspective to what we do. And it's not just about solving problems, but it's about helping other people solve problems while you solve their own problems. And then the final piece that we learned from our last ten years at work was that problems cross borders. You've got so many issues here, whether it's wastewater issues, vacant abandoned properties, that we've got 130 municipalities in this county, 42 school districts, one county government, how many police forces? I don't know. And then if you go outside of that, you've got even more governments. You've got regional levels of government. You've got authorities. And if you're trying to just solve...