 I had a lot of researchers to propose topics, IBM researchers to collaborate with people in academia and I was doing at that moment policy mostly related to network management but I wanted to do more things related to privacy and security and it occurred to me that it was a good opportunity to approach Elisa to work in that project and it was actually quite successful. Today I was looking at Google Scholar and according to Google Scholar we have 22 papers that we co-authored together. So after that I started looking more about Elisa and so there is this publication that the database community does about interviews of prominent researchers in the area and it comes in sigmo records I think once every year and she was interviewed I think in 2009 or so and it appeared in 2010 and I would like to read the first question and the answer that she gave about the first question in this interview. So the question is Elisa, you have written a lot of papers and you are listed very high inside SEAR. For example you have 25 papers listed in DVLP just in 2007 and you have 301 co-authors in DVLP. How do you collaborate with so many people? So the answer was the following. I really like to work with other people a lot. I try very much to understand what other people are working on because I feel that I learn a lot from them more than they learn from me. This goes back to my very early experience when I was a very young researcher 23 years old. At that time I had a lot of ideas for research that I wanted to do but I was always running into very senior people in a senior people always running into very senior people in the field who dismissed all my ideas. Later I discovered that some of these ideas of mine had not been wrong at all. Because of this experience I really tried to listen to the ideas of other people and encourage them to pursue their ideas. Today I look again at Google Scholar. She has now 597 co-authors. In addition to that she is director of the Center for Education and Research and Information and Suras and Security in Purdue University. She also directs the Cyber Center in Computational Methods for Discovery and Learning. She got at IEEE Technical Award for her work in Database and Security. She got the Zutomi Kanai Award from IEEE also for Research in Security and Distributed Systems. She is an ACM fellow and IEEE fellow. Welcome, Elisa. Okay, well thank you for a very nice presentation. You have always to be careful what you say because now with the web things you can know. Once you say something it is on the web you cannot take it back. But thank you. I would like to say actually that Jorge was reading my papers. I was reading his paper before he was reading mine because very early in my research we were working on adaptive databases which was a very rich area in the 85-90s. And so all the papers by Horgel always come up as the most cited papers in that area. So for me it was really a dream to be able to work with him. So I was happy that later on we really were able to work together. So today I'm going to focus on big data and mainly security privacy. So I have listened to the presentation of this morning and I changed a little bit my presentation in real time. So I decided I wanted to explain some cryptographic protocols but then I decided to skip that and focus on something else which we have been doing in our in our cyber center. But let me start with saying a few obvious things with big data. So today we have a lot of technology for acquiring data from the environment. Typically we have a lot of sensors, drones, pervasive computing capabilities so we can acquire a lot of data. Not only we can acquire all this data, we can also process store this data because of computing capabilities, cloud technology and so forth. And we have a lot of work on machine learning analytics. A lot of good work is going on here at UPF. So we have powerful tools to extract a lot of knowledge out of this data. So big data is really going to change a lot and I guess you are all convinced about that given that this workshop focuses on big data and extracting knowledge from big data. So I just go very quickly over some definition of big data because I'm sure you are all very familiar. One definition which I like a lot is the one given several years ago by IBM which characterizes big data as a volume, variety and velocity. And again today we have a lot of technology with dealing with the velocity with variety and I've seen a lot of nice work this morning on how to associate and extract meaning from multimedia data. For example images, music, sound and so forth. However another very important characteristic that we realized is that the power of big data also comes from being a multi-source. So the idea and we have seen this in practice in our work as a cyber center at Purdue that typically in a lot of cases you end up having big data by aggregating a lot of different small data sets and sometimes the data sets may represent different aspects in a certain phenomenon or event of interest. And this is what brings really a lot of power to big data. So by correlating, combining heterogeneous data you extract much more knowledge out of the data. Now a couple of years ago the database research community in North America came up with a white paper on a big data where they, I was one of the authors, this white paper describes what is meant by data science, what exactly is the meaning, which are the challenging and of course the challenges. One major challenge is still to be able to automatically integrate heterogeneous data sets. Because still integrating data is more an art than a real science. And if you want to integrate thousands of data sets, this is going to be a major problem. But again, there is a lot of nice research going on. So I assume that we will soon be able to integrate a lot of data sets. And this gives a lot of power. Now, data can be used in many different domains for science, first of all, but also for security. And the big data can be used for many different types of security, not only for, let's say, cyber security, which is the area where I work. But it can be useful for health care in the medical domain. Big data will be very important for homeland protection. Of course, for science, big data is an important asset to that. So here I have some very simple example of the application of big data for security. Let's start with cyber security. So if you look at current systems called SIM, those are security information event management systems, which are really tools used by corporations. There are a lot of products by companies. What they really do, this type of SIM, they collect a lot of data. So for example, they collect a lot of metadata concerning any packet crossing the corporate network, including provenance, the time when the packet entered the system. They really collect this huge amount of data, and they correlate all this data, combine them in various ways, and visualize the data in order to detect whether there are potential attacks anomalies, which may be signs of attacks. And there are big data is a critical issue. And, you know, those tools are quite powerful, typical problems they have. And again, I talked to too many companies using them is the fact that they have a lot of false positive. That is always a problem that they mention. Their security analysts look at all these alarms and many of them turn out not to be true. But in any case, at the end, this is really the problem of having a lot of data, integrating them and try to come up with anomalies. Authentication is another area where more and more we see a lot of use of personal information, including physical information like biometric being used for authenticating users. And now, not only we see biometrics, which is again used, but we see ideas like continuous user authentication. The idea being that the system kind of learns your typical behavior while you are connected with the computer. And the system will learn your behavior and therefore you don't need to do explicit actions for authenticate. The system, in a way, will be able to by looking at your typical behavior against the actual behavior can see whether there are differences. Now, of course, there are not a lot of very strong system. That is one area. And even biometrics is becoming much more wider. So before, when we thought of biometrics, the idea was always fingerprint, retina scan, and more face. But now they look at the gate, the way you walk, especially if you have a mobile device. They may look at the vein on your hands. We had a bank coming and talking to us. They wanted to explore this type of biometric and so forth. But the point would be that even for authentication to make sure that the users are strongly authenticated, we see a lot of data collection about the users. Access control is another area where I did a lot of work. And again, there the trend is to collect more and more data. A typical example is location based access control where we did a lot of work some years ago. Then I thought, okay, let's do something else. And now recently we have done some new work. And for some reason this paper is getting a lot of download. That's okay. People had enough of location based access control. But there the implication is that a user, in order to access some sensitive data or resource needs to provide a location. He can only access from a certain location. So this requires the access control system to know the location of the user, which means that the access control system will start acquiring a lot of sensitive data. So you may ask me, okay, but we would assume an access control system to be trusted. Yeah, but still it is a software system that can be compromised. And the more data system has to acquire and use, the more expensive will be securing all this data. So organizations tend to get a lot of data. It is good that acquiring a lot of data is great. But if data is sensitive, then your cost of keeping all this data because you have to secure them is going to be higher. So one has to really understand that. Finally, insider threat, which is another project where we had a big project funded by the Department of Homeland Security. Again, it looks and monitoring what users do to detect whether users are misusing the data. So they get the data from which they have access. So they have a permission, but they are using the data for something else. They are copying the data somewhere else. And again, the way we address this problem is to looking at typical use of the data, creating profile of data user, and then using machine learning to do some anomaly detection. So this shows you an example where basically a lot of these techniques for security are based on collecting a lot of data, somehow analyzing this data. Homeland protection is another area where people have talked a lot, especially prediction of attacks. It's not clear whether they, those systems really helped a lot in predicting attacks. But for sure, management of emergence and disasters can be very useful because if you know where users are at certain moment from social network, you can easily reach those users, help with managing the emergence. Another area which is very important is for sure health care and medicine in general, where there is the trend of personalized health care, which requires a lot of data again. So far, data seems to be very useful in many different domains. Unfortunately, when you try to use data, not only for security, including cybersecurity or also for other things, you run a lot of privacy risks. And those risks arise because on one side, as I said before, in many cases, you need to exchange and integrate data across multiple sources. The problem is that then data becomes available to multiple parties. And chances are that one of those parties doesn't do a good job in protecting the data. For example, Purdue, we have a specialized system called the Fortress, that whenever faculty have to do research on sensitive data, they store the data in this special server which is isolated from the rest. Everything is encrypted there. Okay, but what other organizations do they use the same? If one of the parties is a bit weak, data can be compromised. But not only this is a problem, there is a major problem like the, even when you anonymize the data, which means you take your data and you remove sensitive data, for example, the name of the user or other social security number, passport number, you can still re-identify the data. So you can still, even if those portions of the record are removed, you can still link this data to some specific individual. I'll give you a very simple example, which is very well-known. And then again, security tasks require getting so much information about the users, which then really you have all those privacy risks. So the main, so before going to the, to really the major question that I'm trying to really address here, let's look at this very simple example. This is very famous and led a lot of nice research about that anonymization, which is today's very rich area and there are very strong researchers even here in Catalonia, like Professor Domingo Ferreris, one very strong researchers, one of the best I would say, who has done really fundamental work. But this is just, if you are not aware, just a simple example which shows you that just removing sensitive identifiable information from a record is not enough. So this is the famous example introduced by Latania Sweeney. She was a PhD student at MIT and she was working on privacy on medical records. And see, she came up with the natural real example where she, the Massachusetts hospital was obligated to discharge a data set concerning medical treatment of patients. Of course, these data sets were anonymized. As you can see, there was no social security number, no name. However, there were certain information that people call quasi-identifiers which are typically the date of birth, the sex and the zip code. Even though those are not, it's not really by themselves, they cannot identify an individual. They can be correlated with other data sets which have the actual name. So what she did, she bought a voter list which apparently she could buy by paying a few dollars. Now, this voter list, as you can see, had certain columns in common with this anonymized data set. Basically had the zip, date of birth and sex. And by correlating the record on these three common columns, she was able to identify, for example, that the last record becomes to a specific user with a name. So this is a very well-known problem. This means that even if you anonymize the data, if this data has a common field with other data sets which are contained in clear, which contain the actual name of an individual, then anonymizing your initial data set is not sufficient. And remember, today we have so many information sources around. You never know once you release a data set with which other data sets those are going to be combined. And again, the field is very rich. There are a lot of techniques much more sophisticated. People started with the notion of key anonymization. Then this was extended a lot. Then people came up with differential privacy and there is a lot of excellent work done here. So, but this is one example that you have when you start combining all those data sets. So the real question that we have is the following. Can security and privacy be reconciled? So which means, okay, so we understood that data is very important for security, for research, for many portals. Privacy is also important. So here we have a fundamental issue. Do we have to give up our privacy? Just forget about privacy for the sake of security or not or what should we be doing? Now, if you talk to researchers in the security community, you see different views. Some, most of these senior people are very skeptical. They tell me, oh, just forget about privacy. You know, we lost our privacy. We have to live with that. The future, we won't have any privacy. So just forget. So sometimes I go back and forth. Sometimes I say, yeah, perhaps they're right. But, you know, once in a while I look around. So I look a little bit and say, but, you know, is the world worrying about this problem? What are the people thinking? I discovered a lot of interesting things. Actually, there are a lot of initiatives which really talk, try to address this problem in a way or another. And keep in mind that some of those issues cannot be solved only on technical basis. They needed to involve policy makers then to involve other researchers from other discipline. A nice example is this Internet Right and Principle Dynamic Coalition where they are trying to come up with a charter of human rights and principles for Internet. Now, they have a long list of other principles but two important were on one side the freedom of expression and association which is to say everyone has the right to seek, receive information on Internet without censorship. You know, this by itself is a major problem because, of course, you don't want to have censorship but sometimes it's certain content. You don't want this content to be sent around. But the more important one is the principle on privacy and data protection which really says that people have a certain right to privacy online. So people have the right to use encryption, the right to use online anonymity tool and have the right to data protection including personal data collection. So this is an initiative which of course is moving, you know, there is a website. There was something even by companies, this was called the Global Network Initiative and there were some participant companies. Now, their goal was really to help companies because one problem that companies have, they have to deal with many different privacy regulation in many different parts of the world. And they didn't want, you know, they don't recommend that the company doesn't comply with the law, but they have a certain nice guidelines and say, this is what you should do because they again wanted to make sure that privacy is respected as much as possible. So here there are some nice implementation guidelines that they suggest. For example, one thing, because many of them probably got requests from governments to release some of their data. So they give some guidelines saying, okay, you need to give the data because if the law of the country requires you have to comply, but you need to interpret the law as narrow as possible, check the law, make sure that this is really allowed by the law. And they have many interesting also example. They had this nice idea also on transparency reporting so that the company should provide the report which data were asked for what. Now, this is interesting as you can see there are a set of guidelines. It would be nice to understand from our point of view what could be the technical means that we need to provide to being able to really support this type of approach. So by looking at this, I become a little bit more optimistic because at the end, you know, it seems there are people that worry about this, not only activists, but also companies. But so from the research side, what can we do? Because okay, it's okay to see these organizations, but at the end, we are interested in shaving and doing our own research. What can we do? So here I have in a way a good news and bad news. Good news is that I feel that you can reconcile in certain ways because today we see a lot of nice work, for example, in applied the cryptography in many techniques which you can work on data and still have a certain amount of privacy. But on the other hand, there is a no unique technique. So if you ask me which privacy technique should I use, my reply would be really depends what you are trying to do with the data because if you are just trying to do some analytics, you can use a certain techniques. If you really need the micro data, that's a very different problem. If you're using data for biometric authentication, you use another technique. So there is, so this means that we have a lot of nice research to be done, which is good news for us. But a lot of research and we have a lot of, the field has been progressing a lot in areas like data anonymization, data privacy, applied the cryptography and cryptography. A lot of those techniques don't need, especially the techniques concerning crypto. They need a lot of engineering. They needed to be much more efficient. They needed to scale on huge data set, which to that they don't yet do. So now I'm going to provide a few very quick examples. And then I'll move to discuss a research agenda that was the result of NSF funded workshop to debate about research direction for privacy and security for big data. And then I'll talk briefly about a natural system for scientific research that we have been developing, which has a certain nice support for security of big data. So a typical example is this one that we have been working for a while. This is a very simple problem. We have two parties, shows that you needed to be creative sometimes. So we have two parties. Each party has a file, a set of data, and they want to compute the intersection of those two files. So they want to find the common record. Actually, our protocol works more on distance, so certain similarity. But let's suppose they want to find the common record. However, each party cannot disclose that this file is on file to the other party. This is a very well known problem. And cryptography, there have been a fundamental protocol called secure set intersection. And there was protocols defined, I don't know, 20 years ago perhaps, where they were trying to address the same problem. We have two parties. Each party has a set that needed to compute the intersection. So no party can disclose its own input to the other party. They only will share the result. And this is yet a specific case of a class of general protocols called secure multi-party computation techniques, where the goal of multiple parties each provide a part of the input and they want to compute this joint function without revealing each party cannot reveal its own input to the other parties. So the question that we were trying to implement this protocol is that the conventional cryptographic protocols are very slow. So we did some experiment that would take four days to compute the image. And if you have a... So we say, no, we cannot really do that. So I came up with this idea. They came up from early work on multimedia retrieval. Where in multimedia retrieval, what we were trying to do in the early 80s, we had a first European project in my group. So we were trying to do searches on text. At that time, it was really, I mean, the beginning of the 80s, so there were not a lot of techniques. But the idea was to try to implement a two-step approach. First, you do a very initial search by using summary information, which really were some kind of a bit map of each text. And the data, the text, which passed the first phase are analyzed in details to look if the test is really the actual words you were looking for. And so we applied the same idea here. I said, why don't we try to come up with a two-step approach in which first we do a preliminary comparison, which has to be very efficient, but not very precise. So in this preliminary comparison, we will get some record that then will not match when we do the cryptographic problem. So we applied the preliminary step, was done using some privacy techniques. So we started with a K-anonymization when we started this work. So we will compare the anonymized version of the record, transformed according to the K-anonymity. If two records would match, if the anonymized version of two records would match, the actual record may not match, because then you had to compare the actual record. However, if two records were in the situation that their K-anonymized version was not matching, the actual record would not match for sure. Then we moved it to use a differential privacy, which was a better approach. And then with this two-step approach, we got much better result. So we were able to do this matching going from this huge amount of hours to one hour. So it was a major enhancement in scale. So, but this actually introduced a lot of issues that we did not anticipate. And we realized that the security community uses different security models. Because in a way, in our protocol, we combine two steps. The first one uses differential privacy, okay, anonymity, and differential privacy as a zone theoretical, let's say, privacy definition. The second step is used than SMC, conventional cryptographic secure multi-party comparison, which has its own security definitions. When we put the two together, we realized that even though each one is secured by itself, when you combine them, we didn't have the security issue there. We found that there was a vulnerability. Because again, our approach was very different. We were not just applying differential privacy. We were taking this result from differential privacy and using them in some other cryptographic protocol. And therefore, there was a security issue. So then we fixed the data and in some complicated way, but we realized that there is really the need for coming up with the good security model, especially when you try to combine completely different security techniques. The scalabit is very good, but we needed to even enhance that. We have semantic matching, which where you want to do matching, not because a current review support, string matching like edit distance or distance between numbers, but you want to do semantic matching, whereas, for example, if I have a record, which has a field job, I would like to say that a value professor is much closer to the value associate professor than it is to student. That's called semantic matching. So we want to like to do that. But this is one example where in order to come up with some solution, you need to be creative a little bit because we had to get this idea. Be a good engineer because I had a student to really go down and try to implement, understand how to implement SMC in a very efficient way, using some parallel computing as well. And then at the end we got something that we think is reasonable. It can be used to do those matches in a more efficient way. Some other interesting examples, so those are quite old, was some other work, which is Privacy Preserving Collaborative Data Mining. This is a little work done by Clifton and Murat Cantersiogolo. They were addressing the same problem. I have N parties. Each party has a data set. They want to do data mining on the union of those data sets, but they cannot combine those data sets in some central server for privacy reasons. So each party cannot share his input data set. So they came up with protocols in which you can do this data mining, including classification, clustering, without sharing the data. The main problem here is the scalability. Those protocols are still very, very inefficient. We have done a lot of work on privacy preserving biometric, which now, with our technique, we don't need to store the biometric template in clear. And this is actually a little bit tricky because biometric, each time the biometric sensors would do a reading, the result of the reading may change a little bit. So we had to do some complicated approach in order to extract invariant information from this biometric, then use some support vector machine to do some classification and use error correcting code. Still, our results are not very good in the sense that for biometric, you expect that you have to have a very low rejection rate. So if a user is really who is, you don't want to say, no, you are not authentic either. So we have a basic inaccuracy here about 80%, but for biometric, you must have at least 98%. So we are trying to come up with a completely different technique because that doesn't work. But this, again, is important because with this idea of continuous authentication, if we can come up with a system that don't require to even store those biometric information in clear actually, then this will be very nice to have both the security because now you can authenticate your users but also privacy at the same time. And there was also some nice work. The CryptDB is very famous from MIT. This was a nice engineering work. What they did, they took an entire DBMS, let's say, for example, an Oracle, and the idea was to store directly the entire DBMS on a cloud encrypting all the data. What did they do? They encrypt columns with different encryption schemes depending on the type of queries that you needed to do. So if you needed to do order, queries in which you needed to order the field, they needed to use a weaker encryption technique. They is a clever engineering work because what they do basically, they exploit a very important function of relational databases which are the user-defined functions. So if a user gives a query like age greater than 10, they rewrite the query, replacing the predicate with a call to a user-defined function which will decrypt the data according to the proper encryption. We did a similar work. Our work is a bit different, but it's on the similar principle. But this system, CryptDB, has been widely known. There was a Forbes Journal mention this is one key discovery for privacy. And they have been doing work on also supporting machine learning on top of encrypted data. So if this is an area you are interested in, this is really some good work to look. But I like a lot this idea to say, okay, let's take an entire DBMS. We just encrypt everything and transform the queries. So they have a proxy which takes the query, writes the query. And again, we did a similar system. At the end, they didn't take a lot of work. They had a PhD student working for one year to do this. But our system has a more fine-grained access control. But let's move to the research agenda. So as I said before, we have a lot of nice research. As you can see, depending on the techniques, you may do completely, you may apply different techniques. Like you may use these techniques like CryptDB if you just needed to support queries. You may use distributed machine learning with privacy preserving in case you needed to do machine learning. But of course, there is a lot of wide research that needs to be done. And some issues are not really only solved by technical research, but we need to be a little bit more open-minded and work with other people who work on policies, sociology, and so forth. A typical question is, first of all, for which domains security and privacy are critical, both of them? For sure, health care is one of them because people are very sensitive about their medical privacy. But political privacy can be another important aspect. Which are the policies related to the use of data for security? One thing is the ethical use of data. For example, the ACM has a policy committee and this policy committee focuses a lot on the ethics of our profession. So they have a lot of principles. Say, you know, a computer professional scientist which are the ethical principle, you or she should follow. But there is nothing about the ethical use of data. So I have some colleagues from the social sciences who are looking into this problem. The ownership of the data is another critical issue. I see this all the time. So in computer science, we get this idea that I create a data, I create a file, I am the owner, I can control or access this data. But unfortunately, today, this is a really bad idea because somehow what happened is that a piece of data, even though I created the file, I stored this data, this data may belong to multiple parties. So this is a typical case. If I put a picture on Facebook and this picture has many people, even though I created the file, I took the picture but many people are in it and they perhaps have something to say about the use of this data. So then in this workshop, we did better and we said perhaps we should move to the notion of data stakeholder. The idea means that a piece of data has multiple parties which have an interest in this part and each one should have a way to express some preferences or about the use of this data. So I talked to many people. Recently, I was talking to a major company in the area of healthcare informatics and they were asking me that they are involved in a lot of standards because of course for them, the standardization of medical records is one key area, even though there are a lot of standards but they are moving also. The work they are doing now is not really on standardizing medical records. They are looking how they can do preventive care. So they also collect, they have hired 50 machine learning people, try to see how they can use all this data to prevent diseases, help customers and they plan to monitor what time the users go working around, we do the work. I mean, they're giving me a lot of scenario but they told me, we have a major problem about the ownership of this data. They told me, do you have a solution? I told them no, I just have the idea that we should perhaps forget about the notion of ownership and say, recognize that a piece of data may belong to multiple users and they must have a way to work together in deciding what's happened to the data. The other, again, another important thing is people in privacy, especially the people who are very paranoid with privacy, they think that the users should control everything on their own data. This is a good idea but it's very difficult to do in practice because once your data go to a company, you know, even if the company may not be willing to let you know how the data is used for business reasons and so this is a question. Is that always possible in all domains? Even technically, may that be always possible? Which research advances we need to make possible reconciling security with privacy, which means we want to have security but we also want to have privacy. Okay, so as I said before, some I mentioned, some techniques are efficient techniques for performing computation on encrypted data. This may not solve a lot of problems but it is an important building block. If we don't have that, we cannot do much. Privacy preserving data mining techniques are really important because at the end, a lot of times data is used for learning, for extracting knowledge. Privacy aware software engineering is very important and this was motivated by something which happened to us. So we were working on making some anonymous, let's say some tool to be anonymous when working from a mobile phone. Now there is a Tor, Tor is a network anonymizer which will hide the IP address of the phone. Unfortunately, we realized that a lot of application running on your phone will send a lot of data from your phone to their own cloud or whatever. So we found that even a game application were transferring the list of contact. So even if your IP address is anonymous, still they transfer this data. So we came up with the idea that we didn't want that so we wanted the user to say, I want to be anonymous avoiding this data transfer. So a very simple approach we thought was okay, let's change Android to be able to remove the permission at runtime. At that time, basically whenever you would install an application, you needed to give it a permission all upfront. So we said, but we modified the operating system allowing the permission to be removed at runtime. So we could remove that. And then we tested 80% of the application will crash because they were written in a way that they were not anticipating not to have permissions removed at runtime, which for me, coming from the database area, this was a big surprise because you can remove a permission from a table from a user at any time. I think now they have fixed this problem. We also noticed that applications written for many platforms instead were much more robust. They were able to deal with this. But software needs to, at least being able to support this, money needs to, many extension, it needs to work with data which sometimes can be anonymized and can be less precise. And it has to deal with sensitive pieces of data. So this is very important. This is the key issue which is how to balance because at the end the privacy is a personal matter. How do you be balanced? Personal privacy with collective security. That's a key issue because personally, for example, I feel that medical data are a key for researchers doing their research in medicine. So I would be very, I'm very willing to say, get on my data. I want the, you know, the society to benefit to get a better treatment. But on the other hand, I want to be sure that my data is well protected. Okay, I'll give my data to this research organization. But how do I know that that will make a good job in protecting my data? This is very important, but at the end is a personal decision. So can we ever recommend a system which can help the users with this decision? Because at the end, especially when you deal with general public, which if you tell them, oh, by your data is encrypted, we use machine learning on encrypted data. This may not tell much to most of the people. So that is a problem. We have access control for big data. Okay, because in general, we have a lot of access control policies. So we need to manage those policies. And when data come with their own policy, you need to merge and integrate a lot of different policies. Data protection from misuse is very important. This deals with the fact that a user may have a permission to access a data. But what if the user uses this data for something else? For example, the user may copy the data somewhere else, may ship this data. This is tricky. This again will require machine learning. So characterizing what is the profile of use of the data is, what does this user do with the data? And then detect anomalies. Oh, now this user is printing this data, is copying this data. Sometimes understanding what is the, even representing the purpose of use of the data may not be so easy. Finally, and I want to conclude this part here, is a privacy implication on data quality. We notice that sometimes, especially in social networks, people may not post real information. They put some fake or wrong information which then impact the quality of the data. Now you get data which have not good quality. And so those are some of the research directions that we had identified. As you can see, some will require not only technical solutions. Now I have just, I'll try to go quickly to something else because, so as I said before, I'm also doing a lot of work on something. Okay, let me put my glasses. Which has to do with the cyber infrastructure for sciences. Since I saw this morning the presentation, I decided to say a few words about natural system, which has some of the security techniques. So this system is called CRIS, Computational Research Infrastructure for Sciences. So this is a part of a cyber center and somehow this was, this is the vision that was set up for this center. Ten years ago when the center was started, which was again to support the research through data. And here it's what we called our Pyramid, which would need to be evolved a little bit. But as you can see underlying, we have a lot of multimedia data coming from different sources. Then you, so you need to represent data, create and acquire data. Then you must have a lot of sources on top of a lot of services, like dealing with no traditional multimedia data. I saw a lot of presentations this morning. You of course deal with traditional data sources. On top of that, you need to have services for discovery. So this means, for example, running simulation, analytics, modeling, visualization, predicting trends. And finally, you need to support the cyber communities. People want to collaborate in research. You need to share data, to share the process you follow for your science. So with this vision in mind, when I become the director of this center, I decided to do something very different. I didn't want to write yet another paper, but we wanted to do something very useful for our colleagues. So the goal of our center was not to do just theoretical research on big data, but to really work with the professors in other colleges, in other departments. And we realized that we couldn't give these people just a prototype and not working very well. We needed to give them an industry-strength system. Because the first thing, if you give them something which doesn't work, you are out of the game. So we set to develop a real system which then has been widely used. You see some of the users. One of them is the Indiana State Chemist. So what these people do, they basically, whenever there is, for example, some spill, some chemical spill in some river, they collect samples of water and then analyze those samples to see whether there are certain pesticides above a certain threshold. And they were looking for a system to manage this workflow. They use a workflow where the sample goes through several steps. And they asked the companies for a system and they were asking the prices about 400,000 for the license plus a certain amount every year. And then they heard about our system from Purdue and they got our system. But we work with a lot of people. Purdue is a big agricultural school so it has a lot of college agriculture. So you see we work in the new phenotyping facility. It's a big facility that Purdue is setting up. We work with people in the water community. They have sensors in the water. So basically, what Chris, so we developed the system. Again, we had two professional software engineers. You must have the right combination of people. So we had the two professional software engineers plus the key architect was Sambo Diire. He was working in a company. He knows how to deal with customers. Sambo sent him to talk to the various professors in whatever field and he would spend hours with them to understand what they really needed. We realized that a lot of other people like in agriculture, biochemistry, whatever, they like to use the data but they don't have any idea what data means. It takes a while and you must have, I had the right person. So basically we came up with the system which has a lot of nice features. So one of them is complete provenance tracking. We are very interested in research repeatability. So we wanted to make sure that the process followed by the best certain experiment or analysis is well formalized according to workflow. We provide security each researcher as his own workspace where it has access control. It is isolated from other work spaces. It's very flexible in terms of sharing data. And basically the main idea is to support to define the scientific process or experiment as a workflow. In addition, the system is very good for merging annotating data, integrating heterogeneous data. Now, this system was not designed without ideas. It was designed by listening to what the people really needed. So this time we didn't say we know. We listened to a lot of them for a while and even we make extension only when we see that this is really needed. It's not because I want to add a fancy feature to my system. No, if this is really needed by researchers we'll go and do the work. However, we do a lot of research. Recently, we are really trying to understand how Chris can be extended for being able to reproduce experiment or portions of the experiment comparing experiments. Now, this is the actual architecture. So we can go very quickly. But as you can see, there's a lot of data services. The key one is the workflow management which is very easy for users to configure the workflow. We have a lot of different storage systems. Mainly the user data stored into the NoSQL database. So we decided to use this. It's more flexible. Relational databases are used to store the system data. Finally, we also use a Doop when we needed to use Big Data. We also have integrated the Globus Connect to transfer huge data sets. So the system has the key notion is workspace which is again unique for each project. And then we have a lot of different services including data quality because we realize that the users have a lot of problem with quality of the data with mistakes of the data. So you need at least to be able to verify that the values are within certain bounds. We have a lot of wrapping techniques because people want to use some analytic tools or simulations. Okay, now I'll just skip. But this is an example of a workflow. Our idea is that people can configure workflow using drag and drop. So they can click at the new step between two steps. We notice though that usually the professors will send their PhD student to learn our system. So that's a good way to start because a PhD student even in chemistry they're very young and flexible. They quickly learn. Sometimes they take a research credit for just coming and working with us. They learn how to configure the workflow. They go back to the lab and they do everything. This is unfortunately sorry, but we have different ways to search through the data. We have a Google style search or we have a conventional search like you have a relational databases. We have actually a nice interface. So from the data you can quickly create a web page. We have done a lot of work on being able to display data on mobile devices. This seems trivial but requires a lot of work from a software engineer in one of the two to make sure that the data. And finally, okay, this is the example of the interface that is used by the Indiana chemists. As you can see, they have a lot of steps, a lot of tasks, each task as a user who has to do the experiment. We collect all this metadata. So which means that we keep versions of the software users. So if you do an experiment or an analysis, they use a certain version of a software, then you change. We keep the previous version. So we do a good job in keeping this provenance, providing a lot of details. Finally, the last thing that I added was the GIS. Now for this, we had been arguing a lot because since three years I told the managing director, we needed to add the GIS. It told me, no, so far I haven't seen any users or customers requiring GIS. But then quickly, last two years, we work with people doing remote sensing. We become convinced that we need at least to be able to show a map and putting in a showing, for example, where there are the various sensors on the field. So what are the next steps? So first of all, we are going to release a GIS as open source. We are really working on preparing the documentation. So our goal by the end of August is to give it in the public domain so people can download and use. Because we like the system to be used by many people, not only by Purdue. So this is what we will do. And since trivial, Red Hat has been talking to us, insisting that we need to give this in open source. It's easy to say, but doing still requires you need to engineer while the software prepare all this documentation. We want to extend much more. In particular, we are looking more into working focusing much more on data quality and research result reproducibility, so to be able to say, can this be used to really reproduce the research? Because we keep all the workflow. We want to focus much more on sensors. And finally, analytics is the other thing. We realize that everyone at Purdue talks about analytics. But this means different things to different users. Because some people refer analytics just visualizing some data. Not many of them require real machine learning at the moment from what we have seen. But we really plan to do those extensions. And then perhaps in three years, we'll have a new version of the system. So I'll stop here because I think I'm a bit over time. Time for a few questions. Oh, hi. Thank you, Elisa, for such a broad talk. I was wondering, you mentioned at the beginning that there is also a more difficult type of big data. Those are unstructured data. Do you have any hints on whether any privacy preserving methods are being developed for this kind of data? Yeah, that's an excellent point. So first of all, one area where we did some work some years ago was content-based access control for multimedia data. Content-based means that you look at the content of the data. So for example, you say an image which contains the picture of Elisa Bertino should not be visible or can only be accessed by Elisa Bertino. Seems trivial, but you need to understand the content of the image. At that time, we had some very preliminary approach. But that is very important, even for video surveillance. You may want to, for example, to say, okay, I need to look at this aspect, see what it's doing. I want to idle the other people. So not to show the rest of the people. So you need to manipulate the image. Of course, there will be the problem that people are doing a lot of work on privacy preserving retrieval or encrypted data doing searches on encrypted data. I don't know how this can be when you deal with the multimedia data. So that you need to use a more signal processing types of techniques which, you know, the distance measures may be very different. That will be a major problem. So a system like CryptDB, you know, it wouldn't be able to support that type of search. Okay. Thank you. Oh, can I come in?