 All right, everybody, welcome back. So in the next 45 minutes, we'll have a fireside chat with Marsha Hebert, as Ritwik put it, the one and only Marsha Hebert. So Marsha is a professor of robotics and dean of School of Computer Science at Carnegie Mellon. His research interests include computer vision and robotics, especially recognition in images and video data, model building and object recognition from 3D data and perception for mobile robots and for intelligent vehicles. His group has developed approaches for object recognition and scene analysis in images, 3D point clouds, and video sequences. In the area of machine perception for robotics, his group has developed techniques for people detection, tracking and prediction, and for understanding the environment of ground vehicles from sensor data. He currently also serves as editor in chief in an international journal of computer vision. So now he will give us, first, give us an overview of some challenges and challenging projects at School of Computer Science that deal with complex data, and then we will open up the floor to have an audience ask some questions and we'll have an informal conversation. So, Marshall. Thank you so much. Welcome, everybody. Yes, so I thought I would share with you a couple of challenges that we are exploring in using and reusing data. And I'll talk about three areas that I think are interesting. Those are not the only three areas, but I thought I'd point out those three areas just to get started with the conversation. And I'll illustrate those three areas with research that we're doing here in the School of Computer Science. So the first area has to do with the nature of one particular type of data that is becoming increasingly important and requires kind of new tools and new way of thinking about the data. And that's what I call, I have to find maybe a better name for it. Let's call it human generated data. So those are things that involve data from human evaluations, ratings, reports, surveys, things like that. And the problem with that data is that it is typically very noisy by definition because we people are very noisy and disorganized in our thinking. On certain it's incomplete. It's very biased because we all are biased in a variety of ways. So that makes it very challenging to deal with this data. And those are a few examples of those things. You can have satisfaction surveys, various types of evaluation, various types of human ratings. So for example, Amazon Mechanical Turk is one way to collect that kind of data on a very large scale. And in fact, if you look at all the potential application, it goes across the board in all the application you can think of, including healthcare, various product trading, even in things that do affect decision, that decisions that could have far-reaching consequences in terms of bias and equity, like admissions and hiring, and of course, various data from crowd sourcing. So again, this is a different type of data than what you think typically about when you think about laboratory data or scientific data. And the problem again, is that to be able to really handle that data, one needs to have a deep understanding of human behavior, psychology, social sciences, et cetera. This is, by the way, those are examples from the work of Nihar Shah in the machine learning department who specializes in this aspect of the research. So this is an illustrative example here. And again, this is on one particular application, which is probably important for many people here on this call, which is peer reviews of papers, right? And if you look at the data from peer reviewing of paper, by definition, it's going to be extremely subjective. So you have to now normalize the processing of that data based on the model of that subjectivity. It's biased, people have, and we know that, raters have different biases based on the origin of the papers and so forth of the topic and so forth. Miscalibration, and this is, of course, the well-known issue with human raters that we all have a different internal scale, how to calibrate those scales is another issue. And finally, noisy, you know, there's a number of reviewers may not be qualified, there's noise in the data and so forth. So those are the kinds of things that are not as directly characterizable as in other types of scientific data that require, again, a deeper understanding of human behavior, human thinking, societal aspects and so forth. So here are some of the things that need to be done there is to design a new algorithm for fairness, perhaps design those new algorithms in a way that rigorous mathematical guarantees can be derived, which is very difficult, something that is possible for other kinds of data where you can have some strong statistical guarantees, some strong tests of fairness of data, for example, which is very difficult in those cases. As I mentioned, reaching out very far from what we think typically as data science into psychology and economics and social science, et cetera, to be able to deal with that data. And finally, deployment at scale, the at scale being, of course, the issue in doing this. So that was just a quick mention of that first kind of challenge, how to deal with that, that kind of human generated data, human generated evaluation data at scale and with formal tools. So that's one aspect. Of course, another aspect when we talk about data is that we want to be able to process the data or use the data in a way that guarantees privacy and security. And when I mean privacy, I don't mean just a privacy of say human subjects, which is of course an obvious issue but also privacy to protect perhaps how the data was acquired or some other knowledge about the data that needs to be protected. This is particularly important in reusing the data because reuse of the data is often prevented by various aspects of the data that need to be preserved and remain private. So how to use the data while retaining privacy of some aspects of the data is another, of course, another major issue. So there's a few ways that this can be done. You one could imagine outsourcing the processing of the data to the cloud or one could imagine instead of that having collaborative computation, meaning distributing that computation that use of the data to a different agent for different types of privacy. So there's two major techniques that one can use for this. One is what is called enclaves which is basically creating basically a box that is secure within which that processing can take place. A second approach is to say, well, I'm not going to have that box that's secure but I'm going to have the way the data is distributed and shared to be secure, basically using on-coded protocol, cryptographic protocol to share the data. And by the way, those examples are from the work of Wainting-Zane's group in the School of Computer Science. So this is just an illustration of what we mean by an enclave which is basically an environment that has a self-contained set of tools to do the data processing so that the data can be isolated in that enclave. We can develop code within that enclave and one code on the data in a secure fashion, meaning completely isolated from the outside. So some of the research issues here is to be able to support the wide enough range of functionalities so that we can do rich enough processing on the data and also to protect against external access doing this. The other option, of course, is to do the cryptographic sharing of data. This is again an example from Wainting-Zane. This is a system called Helene which basically developed a specialized protocol to be able to share data securely across different agents. Now, the main issue here is to be able to do this within a reasonable amount of time. We know very well how to encode data securely and to share data securely. The problem is to be able to do this in a way that is efficient from a computational standpoint that can scale to a large number of samples. So that's a key research issue that's being addressed here. And again, the reason for looking at those research issue is to be able to use data in a variety of modes from completely open to having aspects of the data that are kept private throughout the entire computation on the data. The latter being again particularly important for reuse aspects where some of the aspects of the data must be protected. So moving a little bit further on this idea of distributing the computation, we can look at ideas in using data in a completely distributed way. So a typical centralized view of machine learning, for example, would say, well, send all the data to a centralized location, run some learning algorithm in this central location and output produce the output. Let's call it W. An alternate view of that would say, well, I don't want to share that data. I don't want to because some of that data needs to remain private. So I'm going to do some computation locally and that's illustrated here by those nodes in that graph and I'm going to somehow gather the output of that computation to generate my final output. Again, important for any use of data when we want to keep some aspects of that data private. There are many examples of this. You can think of medical data, for example that needs to remain private but that can learn intermediate results that can be transmitted. Computation at the edge on personal devices, phones, watches and so forth or home devices. So the idea here is to be able to address this issue of privacy by doing a massively distributed processing of the data and to then connect those nodes together in a way that guarantees that privacy. The examples here, by the way, are the examples from Amit Tarkovar and Virginia Smith from the machine learning department. So how far can we go in those ideas and still maintain privacy? So this is a graph again from Amit and Virginia's work that show for a particular task, it's not terribly important what the task is. It's a classification task but it basically shows the accuracy as a function of the number of samples in a learning task. This graph is with non-private learning, meaning no attention is paid to keeping any data separate or private. Now, the state of the art, which is standard differential private learning, is this bottom curve. So we basically, as soon as we try to enforce some kind of privacy in a standard setup, we see a massive drop in accuracy. In this case, about a factor of almost four in the accuracy. And now if you do the kind of approaches that they're suggesting, you can go back to something that is close to the non-private learning approach. So this is basically a long-winded explanation to say that there's a lot of opportunities in this research in developing new ways of thinking about how to process the data, how to protect privacy to a much higher degree than was possible before. And again, by privacy does not mean privacy of subjects, but privacy of various facets of the data, if you will. Okay. And the third thing, and that's the last one, don't worry, I'm not going to talk very long. The last one is something that we call versioning. I have to learn how to pronounce that word, versioning. And the idea is this. You have, in many applications, you have data that evolves over time, in fact, in most applications actually. And so, for example, in autonomous navigation, you're going to, typical data, you're going to record a stream of images of distances and the velocity. And in many situations where you record data over time, the initial measurements are only provisional. And then maybe later replaced and refined with more accurate version. A process that can repeat multiple times, so the data can actually be updated. The data in the past can actually be updated multiple times. So, for example, in econometry, we rely on measurements of prices and wages and employment levels, et cetera. For those values, provisional values are first available within weeks, but then they get repeatedly updated for several quarters after them. So, you have different versions of the data over time. You have the data that you get now, and then later on, two quarters later, you have the same data that was there that is now reevaluated based on what we know now. So, you have basically an evolving set of data. So, the problem with this, the challenge is that when this kind of technology is applied in real time, only the provisional estimates are available for the most recent indicator. And the problem then is that statistical and machine learning models must be very carefully trained with the right version of the data. So, the provisional records are a part of that data. So, from a data technology perspective, we must develop ways to record, organize, and serve various versions of each quantity of interest. So, one way to think about it is that we need to move from what do we know to what did we know and when did we know it. This is what we call data versioning, because now we have multiple versions of the data that need to be maintained. From a reuse standpoint, that means that we need to understand not only how the data was obtained, but also when the data was obtained and which version of the data we are looking at. And the process of versioning must be understood as well. This is an example that I take from the work of Ryan Tishibari and Ronny Rosenfeld. They lead the Delphi project, which is basically a large project on building models that are predictive of COVID propagation and trying to build models that are the level of individual zip code that maybe have a long time horizon. This is an example here of the interface that shows some of the status of COVID cases across the country. So, of course, the way they do this is by accumulating a large amount of data, many different sources of data from doctors visit, self-reporting on Facebook, Google, other platforms, very heterogeneous data, of course, that has to be acquired continuously over time. And if you look at how this data works, you have kind of a two-dimensional axis. You have the time at which the data is reported, in this little example here from time t to time t minus four in the past. And then you have the time during which you're trying to make a prediction. So what happens here is that the data is reported at a certain time, so t minus four here at the bottom. Then at the next time of reporting, that new data at t minus three is going to change this data at t minus four. Perhaps we're going to realize that we had actually more reports, more COVID cases or more COVID-motivated doctors visit at t minus four than we thought has been corrected from the new reports. And this happens consistently. This happens because of delayed reports. This happens because of error correction. This happens for a number of reasons. And the experience here is that taking this properly into account is absolutely critical to be able to build any kind of accurate model. So if you look at this diagram, you can see that you now don't have just one linear set of data acquired over time. You now have kind of a two-dimensional graph of data with the multiple versions of the data. And understanding this versioning process and how the data has changed is actually critical in being able to build those models. So how does one represent that? How does one store and manage that data? How does one represent how this versioning takes place is an interesting set of issues in model data? This is an example here. I'm going to try to see if I can do this. This is an example here of data on the horizontal axis is time. This is a doctor's visit that are reported. And the vertical axis is the number of doctors visit. The interesting thing is that the... I cannot get my pointer to work here. So the orange bars here are the visit reported up to the 8th of April. And the other colors indicate visits reported in the few days after that. So 10, 14 and 17 in the upper graph. The point here is that the data back in time is changed based on the current data. That's what is illustrated on the right side of that graph that you see at the top here. So this is just a graphical visualization on actual data for this particular COVID prediction and COVID tracking project of what it means to have those multiple versions of the data that one needs to keep going. So again, this is a relatively simple example on the one number graph like this. The problem now is how to do this in a systematic way with systematic tools on heterogeneous data. So those were a couple of things that I wanted to mention. The idea of human-generated data and all the new tools that that requires privacy and security and paving the way for completely distributed and secure data processing so that we can maintain whatever proprietary or private nature of the data it is that we want to protect while doing the processing that we need. And finally, versioning which has to do with data that evolves over time for which we need to keep multiple versions. So those are a couple of things that I wanted to mention and I will stop here and give the floor back to you. Awesome. Thank you, Marshal. Thanks for the great overview of the activities that's happening at the School of Computer Science and all these relevant questions that relate to data reuse. So now I would like to open the floor for the audience to ask questions. So you can either, since we have a manageable size, I would say we can, you can either raise your hands or chat the questions to me. So, Amit, you can go ahead and unmute yourself. Hey, this is Amit from Materia Science Department. Hi, Marshal. That's an excellent talk. I have a very specific question. Maybe it falls into what you show. Maybe not. But I will just go ahead and ask. So within the context of human generated data, you also showed that peer review process also have, can at times have a multiple way of representing the similar kind of information So given the ontological discussion which we are having since morning, is it possible to extract some sense of what sort of ontology exists by training some sort of NLP algorithms from the peer review generality itself? So coming, so I'm looking in a way, not the top-down approach of designing the ontology and then extracting information from the paper from a bottom-up perspective of developing an ontology schema by scanning thousands of papers. Any thoughts on this would be great. Yeah, that's an interesting idea. And actually some things, some attempts like this are being done not necessarily in the context of not in the context of reviewing, but in the context of summarizing and in the context of identifying trends also. Okay. Which in those cases you need to have a bottom-up aspect. So I realize this is not the same as what you suggest in terms of the ontology definition, but the tools, the basic bottom-up tools that you would use are kind of the same, right? So that would be an interesting thing to look at. Okay, thank you. So any other questions from the audience? So we're having, like this is sort of informal conversation between all of us. So please don't be shy and ask questions. I guess I can just ask one question that always does very, I always wonder about, so there are a lot of secondary data out there and there's like Marshall, you also have mentioned there's a lot of human-generated data and they're messy and heterogeneous and hard to use. So from your experience, what would make you, like when you see a data set, what makes you to trust that this data set is usable or at least manageable before you actually download the whole thing? Yeah, so I think the key thing is to have formal data, so the key thing is not the data, it's the metadata, okay? That will say how that data was quiet, right? That is the most important, the most important thing. As you mentioned in the introduction, I come originally from robotics and data sets are extremely important in robotics as in any field, actually. But one thing that is particularly difficult is to document exactly the conditions under which the data was acquired, especially when we talk about physical data, you know, machines moving or interacting with the environment, right? So that's the main point, I think. How was the data collected? And by that, I don't mean just statistical characteristics of the data, you know, in terms of bios or things like this. It's more how the data was acquired, how it was transformed and so forth. One of the issues I think is having formal ways of describing that, okay? And formal ways of describing that can be shared across domains, okay? I think a lot of that is still a bit of an art, right? And it's still very much data and domain dependent. So it's difficult to have, you know, common best practices or even better, common tools to describe how that metadata. So, yeah, so metadata is the key. I guess like some of the, many of the speakers in the morning also mentioned like different, implementing different metadata schema. I guess for those in the audience, I also, I kind of want to post a question. You know, we are dealing with very, like many all different kinds of data schema and so I guess the question is when you're facing all these data schema, how do you choose which one to use and how to move forward from, is there any efforts in harmonizing all these standards? Okay, I don't see any response. Right, yes, the answer. That means the answer is no, right? I suppose, yeah. Yeah, my sense is that this is the tough question for most people to answer. And if there's a, I saw in the morning, there are some, when Lily was talking about the dimensions data and I think Melissa suggested, okay, add this ontology to it. So if you, so the audience, if you like join the conversation, please go to the Slack and contribute to that thread. So I guess then Marshall, the next question I guess from here, I also want to see that, you know, from university, we generate a lot of data from the sciences and from the robotics as well, I guess, but also from, from, you know, the domains and disciplines that are not traditionally data heavy, like music, arts and humanities, they're also generating a lot of data and everybody wants to use data and do some sort of the AI is becoming more and more of a standard application for these domains. So as a higher at institution, how do you see our role in helping people to use all these data? Yeah, so one major direction that we need to pursue, that we are pursuing to some extent is taking all those ideas and being able to develop tools that can be used with minimal knowledge of those issues. As you said, minimal expertise to use the term I think that you use. And that involves actually work, research, not just in data science and AI and related fields, but also work in HCI and in injecting in the conversation modeling of how one uses the data. This is similar to some work in software engineering, for example, that actually involves a lot of human modeling that has to do with how people think about code, how people design code, how people track down errors, do refinement and things like this. So we need to have similar models as to how people deal with data generally. So it's kind of a strange thing that to address the problem that you're describing, you need the technical work of building tools that are well engineered so we can lower the barrier to entry. But we also need a lot of work that comes more from the human studies and psychology and human behavior. And I'm a great believer that a lot of the progress that we're going to do in computer science related tools, will come not just from technical progress in computing related things, AI, et cetera, but will come from better understanding and better modeling of human interaction. In this case, human interaction being how one goes about using data but interacting with data. So we have a question from Keith. I guess Keith, do you want to just unmute yourself? Just done that, thanks. Yeah, Marcelle, thank you, fascinating presentation as I expected. And you make me wonder about incentives for data sharing. And we all know that apart from the shared criticism of car parking that most faculty in the university are united by their discipline rather than by their institution. And therefore I can imagine, for example, roboticists around the world sharing data with each other so that they can broaden the analysis and testing of models. But in many instances I've seen or heard of effective reuse of data by people across disciplinary boundaries. My first and favorite example of that was when I worked in Australia and we had a zoologist tracking kangaroo movements and whose data turned out to be hugely important for climate scientists. So I wonder about incentives at the institutional level and whether as you and I will be gearing up for our promotion and tenure committees in a few weeks time. Should we be thinking about how we recognize faculty members sharing of their data and the reuse of that data as part of their career trajectory? Or are there other incentives we should be thinking about to encourage people taking on the creation of metadata the proper curation and sharing of their materials? Yeah, that's a great question and thank you for mentioning the parking too. I was wondering something had to So, you know, I think I would think in those terms of the evolution of how research is evaluated. It used to be that research was evaluated say on journal papers, right? That was the kind of the the the major reward, you know of productivity. And then we went into more other instruments of publication which are not necessary journal papers, but they are still publication. They look like articles and things like this. And then the next step was to go to and again I'm talking a little bit more in my world but I think it applies of course to both. We're talking more about software and things like this that can be released to the public and that can be used and reused by a large number and in some cases in the case of deep learning in a large number of applications that are across the board. And then what we see now is in the section since you were using that metaphor of reappointment and promotion in that section which is no longer called publication but is called products we now see more and more of data and data sets. And the measure of success for those data set is precisely what you said is basically how widely is it used how much of a reference data set is it and more and more how broadly is it used. So I think that's the evolution and that's the evolution that we see already that a data set that is recognized as enabling new work and new research is a metric as powerful as publication or code or other artifacts and I think we're going to see more and more of that in the future. So this is a link shared by sorry I'm not I'm sure I'm reading your name Ron mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo mo implementations for repositories and different stack other stakeholders. Thank you. Thank you very much. So, I don't see further questions from. Oh, there's a one hand. Ali. Go ahead and meet yourself. Yeah. Yeah. You didn't see it because I just put it up there. So thanks for catching awesome. This is a little more of a high level question and it's just something that I wonder about AI and machine learning in general so I'm sort of adjacent to the field, rather than really working in it. And my question is sort of about the idea of error propagation so if we think about these big data methods, and how they're probabilistic and how they're statistical. And, you know, they're really sort of for making these decisions with a ton of data and not much time. We can imagine that works very well for a business context or maybe even for a policy context. But I'm really thinking about it when it comes to research, particularly research in the sciences where, you know, I have some concerns about accuracy and sort of particular values being being discovered or represented rather than sort of this more And of course a lot of science is probabilistic to but I guess my question is, how can we avoid the danger, or if it is even a danger of error propagation so for instance if you Xerox something and then Xerox the copy and then Xerox the copy with that corresponding quality each time, the more that we aggregate and the more that we, you know, develop tools that are based off another aggregation and based off another aggregation. I worry that we're sort of losing accuracy even if we maintain the appearance of precision so can you comment on that. It's very interesting actually. It depends on the perspective there because from a purely statistical perspective, you could argue, depending on the data is used right and exactly what estimation is done from from the data, you could kind of argue the opposite right that in fact that's that's the position taken in some areas in, you know, in AI and robotics and so forth, which is that if you accumulate enough data, the local errors in the data are going to, you know, kind of be averaged out right. So, and so in that case you get you get better and better model in fact right now that makes a couple of assumptions that makes the assumption that there's no systematic error or bias or something in the data that also makes the assumption that the processing with a deep learning algorithm or something actually does what I just said, average is data instead of locking on to some artifacts of the data. So I think the key ingredient that is that is missing here. And in fact across the board in AI on ML is to have tools of models or theories that understand more deeply. If I have this data set here. If I now perturbed that data set in a certain way. Okay, how is my model perturbed. Right. And that's basically kind of what you would call an error propagation model. And this is really important. Not just for what you said of the error propagation. This is also really important to understand how the model that I have learned on this particular data set. I'm referring more purely machine learning approaches right. How is this model that I've learned on this particular data set, going to transfer properly to now my task or my domain, right, which is not exactly the same as this. Right. So, there is all this issue of understanding the, the effect of disturbances if you were in the data that is still to be to be worked out. We have tools from this from statistics of course, but they make very strict assumption on the data on how it's used. But if we within those assumptions then they are good statistical tools to do this kind of analysis. But once we get into black box machine learning things you know deep learning things and all that those those assumptions disappear. Right. And we don't know anymore, this this kind of data perturbation effects. And that was a long wind did answer I'm sorry. Any other questions. Oh sorry I realized I didn't unmute myself. Thank you. Thank you both that's a like in fact very like a very interesting topic and in fact we do have a speaker tomorrow. I will have time to talk about the error propagation. The data set data decay. Yeah. So, any other questions. I guess this question is related to maybe all of us in one way or another. Related to the pandemic. So you know, the COVID has changed the applications in many areas of research technology and all all areas of life I would say. So, Marshall do you see like how the COVID has changed our way of using data interacting data with data or think about data. And so do you see, I know there are a lot of challenges here but do you see any opportunities moving forward. Yeah, yeah well so I think it's really on for size. If we did not think about that enough before it certainly on for size the, the importance of data sharing and data reuse, and also one thing that it on for size maybe further than other application is the fast pace that is necessary. There is not the luxury of, you know, taking the time of, you know, creating data and looking in more details as error propagation since I guess there is kind of an urgency that kind of motivates looking at different ways of using looking at data, looking at ways of extracting information from from the data, even though we don't have the, can I say, all the elements all the time to extract the information that we want so the Delphi project is an example of that right they're trying to do, you know, continuous updates and prediction. Right, which is very different from other applications where you can basically get the data to your processing with your results and so forth in a match larger time, time constant. So this, this more how to deal with, with this more immediate loop of processing and prediction is something that's that is probably interesting it's not unique to cover it of course, but I think it's happening and it's on the on the large scale for critical application that's motivating kind of new thinking on how to use data. Thank you Marshall I think we're running out of time but just one more question from the audience. Do you have any thoughts on how to prepare today's PhD students for tomorrow's data handling. Thanks Alicia for asking that question. Yeah, so I think at a high level, kind of a non technical level. It goes back to Keith question about interdisciplinary aspects is to really expose students to a wide range of data right data. So if data does not mean just scientific data does not mean just human generated data, you know there's a lot of different types of data, they lead to very, very different technical challenges. And, and like in, in, actually, it's not unique to this topic but you know it's it's always the most important thing to have a sense coming in of the range of the field. Right. The biggest mistake that we that we can do is to narrow down data to once to once field I mean I'll tell you the example in my field you know in a specific in computer vision right. We have a, we've had for a long time, a very restricted view of what it means, what data means which is basically, you know, matrix matrices and vectors of numbers and that's it. Right. And so having a much deeper understanding of how of the different style of data is critical I say, and I mean appreciation of that is critical for very early on. Thank you. Thank you, Marshall for sharing your thoughts with us and thanks for audience for asking all the questions. So now, I guess we are having a break. Let's come back at 10 third. Sorry, 310. And in the meantime, if you want to go follow up with some of the questions go together now. See you later. Thank you. Thank you.