 Something really really cool, I think and I'm really looking forward to it and to get a glimpse to what it's all about so the talk is about a research project which let The user see and control what's done with their personal data At least that's what I read in the description of the talk and I'm really really Looking forward to hear some more details about this from mort who is Presenting this talk and he's gonna be talking about the platform design about the implementation and the current status of this data box thing and Please give a really warm round of applause to mort Thank you, and thank you for having me. I'm gonna before I start I shall begin by apologizing I have small kids, so it's permanently flu season in my house If I start coughing uncontrollably, I just bear with me What I'm gonna do is talk a bit about the data box project This is a project that was funded by the UK Research Council EPS RC It's a collaboration between University of Cambridge Imperial College London and the University of Nottingham with a number of industrial partners One of whom I'll mention the talk the BBC So to set the scene a little bit I probably don't need to say this very much at this particular venue So you may just wish to go to those tumbler sites, which I thought were quite funny Big data picks tumbler.com and we put a chip in it tumbler.com But we're now in a big data world so data is collected all around us in the environment from what we do our retail habits Sensing IoT things in our homes all around us data is being collected There's a lot of opportunities and challenges that are presented by this You can imagine a great deal of sort of personalization personal optimization Things you can do to make your house more energy efficient for example And there's lots of things you can do that are beneficial from this sort of data But there's also a lot of challenges that are presented particularly around privacy Around the rights of the individual to control and see what's happening about them I Did warn you sorry The the nature of this sort of collection is that it's building up large collections of very rich often quite intimate data in large silos Some of the sensors that you can see on the top left there. You've got sort of things you might expect Retro social networks Nest thermostats, but nowadays more Intrusive things medical devices things that are monitoring incident levels heart rate so forth It's very rich. It's very intimate data that's now possible to be collected so the challenge that we Pose ourselves in this research project was really what can we do to allow data subjects to control the collection and exploitation? of data particularly data that is what you might think of as their data so data That's yours that you somehow own and also data that's collected about you. You might not have such direct control over So that's the context how to enable data subjects to control collection exploitation of their data and data about them This is taking place in an existing ecosystem Which is very much focused around the idea that we want to move data around typically we want to move data into the cloud Data tends to get get pushed out there even when it starts out There's some data where you might expect it starts in the cloud So you post something to Facebook it's on Facebook's computers. That's not a surprise On the other hand, there's a lot of sort of IoT devices that you might think could very well keep the data more local to where Deployed you might think that data about your house could stay in your house If you want if that's what you wanted to happen and yet by default a lot of them will push that data out to the cloud Even if they subsequently give you back in some way it will end up out there on somebody else's computer And this seems to be to my mind anyway It's a structural problem about the way that we build systems nowadays and the internet has become very fragmented It's difficult to build effective robust Efficient distributed systems across the modern internet and it's much easier just to centralize things and the cloud allows us to centralize things We can just stick it all out there in some some system that somebody else runs as it says the sticker says on somebody else's computer So we're defaulting to moving data into the cloud in order to process it and make the processing much easier as well If that data is centralized the starting point for thinking about this was when I rejoined academia in about 2009 And joined a research institute at Nottingham called horizon horizon digital economy research That was focused very much around this notion of digital footprint And what could we do with these digital footprints that we're creating or we're starting to create at that point And I was quite into this interdisciplinary center So there were people there from sociology mathematics engineering computer science from all over the piece And a lot of my colleagues essentially said if you could build as a magic context service We should do great things with it. We just know the context of the user Then we'll be able to do all sorts of fun and interesting interactions And we had a number of discussions about this where my response would often be well. Yes, but what is that? I don't know what a context. I don't really know what the context of the user is What do you mean when you say you want to know the context of the user? And it eventually became clear that it wasn't quite well-defined what that was but it definitely involved using personal data It was definitely going to be possible to construct this from the personal data That could now be collected from sensors from social networks from interactions So the end point I came to with that was really being a lazy computer scientist to punt on the hard problems So I wasn't going to try to define what the context was because that seemed difficult But what I did say was well if you give me some piece of code that encodes what you think the context is Then I'll try and create a platform that will execute that for you And so return to you what you've defined the context to be So I punted on the problem and that gave rise to a thing that we call data Where which was essentially a service-oriented architecture for trying to do personal data processing So the idea was that the data processor would write some piece of code that would process the data subjects data The subject would provide the platform on which they could execute that code and the processor would receive the result and The point here was that we were now moving the code to where the data was Rather than moving the data to where the code wished to execute on it So we're not pushing the data into the cloud anymore trying to take the code and push that to where the data starts This was the sort of picture. We had at the time of data where so you've got a Sort of well overly complex certainly fairly complex Request and permission process here So the data processor request permission through some mechanism gets granted permission to do some piece of processing And is that then able to push the piece of code? They want to execute onto some platform where the data is made available and then results go back to the data processor So that was sort of data where Excuse me data where v1 However, when we started to try and build this and try and think about how it might be used It became clear that there was lots of complexity in terms of the interactions you might wish to support on such a system So there's lots of ways you can construct interaction around this one obvious way That's received some interest is the idea that people might pay you to use your data But there's lots of other things that might you might wish to happen there There may be many situations where you want data to be processed, but it's not appropriate for someone to pay you Well, if it's another member of your family, it may not be may not seem sensible for them to pay you to use your data And there was little in the way that data where was constructed that actually said anything about how this was going to happen So in the case of being paid to use data exactly what were you being paid for? What sort of use was going to be made of your data? What was going to happen then? So data where was a proposal that would support some forms of interaction It basically gave you a kind of transactional nature where you had a transaction between parties in terms of this request Granted permission and then possibly some ability to see what's happened afterwards But there are a lot more things that we could consider And so we sort of abstracted Stepped up from the problem a little bit and stepped away from data where it started to think more generally about What is it that's going on in this sort of system? And we coined this idea of human data interaction with by analogy with human computer interaction So I think and I'm not an either historian or a proper HCI person But my understanding is that HCI is essentially moved human computer interaction has moved as a field of study away from where it started Which was the idea of a single individual using a single computer and it's kind of moved towards a Collaboration between individuals using computers and it's now in the sort of world where you're thinking about ubiquitous computing where it's not necessarily obvious Which the computer is you're using And so human data interaction tries to take that a step further and say well in fact it's now about the data It's not really about the interaction with the computer anymore It's about how you're represented in the data and what the data is used to do to you and for you and so The very high-level model that we have here is that you have some personal data that is collected Analytics are performed on that data. They process it in some way That allows you to draw some inferences to work out something about the way what that data says and as a result of that Inference process some actions are taken actions might be to feed back into further analytics feedback The inferences you've made or actions might be nudges of things that might change your behavior and thus change the data that generates in the future So there's even at this very sort of simple model There's a couple of feedback loops that can take place and it's in this kind of space where data processing systems and data processing Computations are taking place and we felt that the systems that we're seeing and the systems that at that point We were trying to build we're lacking in kind of three key aspects which underpin this idea of human data interaction for HDI The first was legibility. So It's clear that most people I think most of the time are generally unaware of what the sources of data That might be collected about them are for where can the data come from? generally unaware of the analyses that might be performed on those data and Generally unaware of what the implications of those analyses are so understanding what's going to happen to you in the future on the basis of actions You've taken that you've taken now or in the past that are now represented in some data sets on where possibly with some degree of inaccuracy Is not necessarily clear. It's not legible. It's not easy to see and understand what's happening in these systems The second thing that seemed to be missing was agency. So agency is the capacity to act in a system so We are often unaware certainly I think I am unaware speak for anybody else Of the means that I have to affect the data that's being collected about me There are some things I think I can do to try and control what data is collected I can you know block cookies in my browser. I can use brave I can turn on all the all the other privacy things But that only does that only controls the data that collected about me to some extent It might be much less clear for me to know how I can affect this as I move around a smart city environment or a smart environment for example It's not obvious to me always what I can do to affect the analysis that are being performed on those data that have been collected about me and In both of these cases that's even if I know that these means to affect these things exist at all And I can be bothered to employ them because it may well be complex or difficult to employ these things effectively So we lack agency we lack the capacity to act And then the third thing seemed to be what a rather ugly word Negotiability so this is essentially trying to capture the notion of supporting the dynamics of interaction The idea that when you make a decision it doesn't necessarily Remain your decision in this system until you know forever more you might wish to change your view on things You might wish to change the way that you interact with the system either as you learn more about it Or as your behavior changes or as your environment changes for whatever reason So current systems still tend to trap us in this kind of binary Terms of service you click the box to say yes, and then you're done And you don't really get a chance to go back and revisit that maybe nowadays You're starting to see more and more the idea that you can at least completely withdraw from a system So you can be in the system or out of the system But it's often not really possible to control what's going on in terms of your interaction with the system over time So that gave rise to this idea of data box Which is you can think of in some ways as data where version two So this is still taking the idea that you want to move the code to the data This allows you to minimize data release it allows you to retain more control over what's done with the data Because it's running on a device that is under your control So at the end of the day if you really want to you can just turn it off And then you know that the data is not being processed anymore We tried to pay a bit more attention to how Access to data local or remote data was going to be mediated We went to some effort to try and make sure that we could control all the internal and external Communication and that we could log all the IO that takes place on the I'd on the sort of following the idea that I don't really care what Computation you do on data around me so long as you never see any result from it You know the computation just runs on a device somewhere and then gets thrown away As anything really happened The computation I don't know runs in the wood and a tree falls on the computer and did anything take place so If I can log everything that goes on in terms of what's communicated from that device to the outside world Then I can in some sense even if things go wrong I might be able to go back after the fact and figure out what happened figure out what leaked and why it leaked and when it leaked So the sort of model we have with data box is this kind of applications. This is a sort of fraud detection application so Some person called Henry downloads a banks app onto his data box later on the large transaction made in some foreign country against his credit card The banking application is able to check Henry's location by saying are you located in that country where this transaction took place? The data box is able to say no And then the bank can deny the transaction and so the fraud is prevented This hasn't revealed to the bank where this individual called Henry is there's been no release of that information It's simply been able to say no not where that transaction claims that he is So this is trying to minimize the data release that takes place So how is data box implemented? So the model here is that we're essentially we're installing apps the process data locally. So we're following the app metaphor from smartphones Apps process data. We also have a notion of a driver, which is something with which either ingests or releases data and There are manifests associated with each app and they describe the data that's going to be accessed by that app And that will be turned into a concrete what we've called an SLA some of the terminologies a bit horrible I apologize for that too a concrete SLA when you install an app So the sort of thing that might happen there is you have an app that wants to have access to your smart light bulb data So that's in the manifest access to smart light bulb data when you install it You're able to control which light bulbs it gets access to you can have all the downstairs light bulbs But not the upstairs light bulbs in my house Okay, so that's the that's the kind of the ability for the user to exercise some some control over what's actually being Revealed there about them and what they're happy to share in that moment for that application All the components in data box we were using containerization as a lightweight sort of virtualization technique So this gave us a degree of platform independence degree of isolation between running components And the ability to sort of make the management of this kind of system easier because there's quite a lot of moving parts here And so being able to manage things in a fairly homogenous manner seemed useful When I say platform independence there that kind of bit us slightly for a couple of weeks because it turned out We were getting bug reports from a user where they were finding that things weren't working And it took us some time to figure out that the reason things weren't working is because they were running it on Windows using the docker for Windows tool that had come out recently And we didn't realize therefore that that's why the shell scripts weren't working because they were they were not in Unix environment The containers were running and they could get the containers running when they did it by hand But all the startup scripts did not work There are four core components to the platform There's a thing called a container manager think all the arbiter think all the core network and then many things called data stores The container manager is the thing that manages the containers unsurprisingly So it manages container lifecycle in particular It's one of the things that starts up first and then after that it controls which apps are running which drivers are running How things are connected and basically kicks everything off The arbiter is that is the container that produces the tokens that we use for access control And the format of those tokens is a thing called a macaroon. Who's heard of macaroons? Not the biscuits one or two and so macaroons are to Re-use the pun that the authors used macaroons are better cookies They're essentially Access control tokens that you can delegate so you can attach constraints to them when you delegate them to other parties The data stores provide a persistent storage Facility so we can monitor everything that's being recorded and used by each application They also provide a middleware layer. So communication happens via these data stores and that's zero MQ based Middleware layer and each store that's created gets registered in a hypercat catalogue that exists on the data box And then the idea is that that provides a degree of discoverability So so an application is able to find out what it is that this data box has and therefore whether it's going to Be able to support what that application needs And then finally right at the center a thing called core network Essentially tries to manage network connectivity for each application And we sort of hack that together in the Docker world by providing a unique virtual network interface for each application Which is connected only to that application container and the data store for that application and to the core network So we can intercept all of the Communication that takes place for any application so we can make sure that we log everything We can make sure we prevent anything happening that we don't want to happen As I mentioned apps and drivers in fact come with a manifest So this basically describes origination metadata It says what application what the application is going to need in terms of data access and what its storage requirements are And whether it's going to need to do any remote accesses So it's going to need to talk to anything else on the box or anything off the box The distinction between apps and drivers essentially the drivers can talk to things that are not Only on the data box. They can talk to things off the data box. So that's how you get data in and out of the system And this installation process as I've hinted you essentially you start out the user tries to install the application They say yes, you can have access to these data sources And that causes particular tokens to be generated given to that application and then that application and that application is then connected up to the right network devices And the container is containers are all then started those tokens that the application has been given Allows it then to present those tokens to the different data stores in the system And the data source can then verify that this application has indeed been permitted by the user to access that data So that's that that's the sort of mechanism for access control I'll move fairly quickly through this in the interest of time But this is a description of the middleware layer that we have which is based on some standardized protocols Coap running on top of zero MQ We have a git like back-end to this so it records everything and it supports JSON text and binary data There's a degree of security that we attempt to provide with the intent that at some point in the future We might like to distribute this across multiple devices So you'll want to be able to secure the communication between data stores and the kind of main reason for doing this is that the first version of this we started out with was hacked together very quickly and Using a straightforward sort of HTTP REST style API with no JS and that was not suitable in terms of Supporting relatively high-frequency sensor data or the limited memory footprint that we had on things like Raspberry Pi's This is much more effective in that sense So what can you do the data box? What could you do with the data box? Among the interactions that we can support And that we think we should be able to support better You can do things with a physical device that you can't do so easily with things that are in the cloud Physical devices are often easier to reason about so you can see them So you can do things that you can simply glance at it and see what the configuration is you can imagine situations here where for example We might set this up so that access to smart metering data is only going to be permitted if The green tag has been inserted into my data box and my partner's blue tag has been inserted into my data box So we both agreed that that data can be shared or where the green tag is in the data box And we're both located in the house. So we're both proximate to it So you can set up much richer sorts of ways to control access to data And this maps quite nicely to to notions of physical access control Which most people have a pretty reasonable understanding of because we're used to doing things like locking windows and locking doors and so forth One of the members of the team built a thing using What's essentially a hacked up version of IBM's node red? So this allowed you to assemble data box applications by dragging and dropping data sources and computational units Linking them together and then you could essentially click the button somewhere Somewhere off the bottom of the screen and that would take what you'd produced and build that into a container and publish that to the app store And so building applications was fairly is fairly straightforward with this sort of environment We also did some work on looking at richer visualizations of data So you can take an SVG image for example and break it up into its component parts and then describe Transformations so that as the data comes in it animates the SVG according to what those transformations are that you've described I think one of the earlier demos of this had an SVG with a cartoon picture of a particular American president and then when tweets came in that would cause parts of the face to animate according to some simple sentiment analysis of the tweets So you can perhaps make data more legible by doing richer visualizations Making it more obvious. What's more explicit what's happening. What's represented in the data This is a piece of work, which unfortunately stalled But with the PhD student I was looking at what you could do in terms of generic measures of risk So the idea that a lot of the data sources you might see in such a device are going to be time series time series of Floating point numbers essentially temperature readings humidity readings their quality readings, whatever they might be Is it possible then the question we were exploring exploring was is it possible to take a time series like that and Treating it simply as a time series without any semantic information about what those numbers represent just look at various measures of entropy Statistical measures auto correlations and so forth to see whether or not There's in some sense risk associated with giving access to that data So how much information is contained in that time series in a statistical sense? Is it possible to say for this application? It's asking for access to that data at too high a frequency It's going to be able to find out too much Whereas this other application only wants to see an average over every three months and therefore that's fine I don't really care what that says And then if that could be if you could construct things like that The initial results was somewhat promising in this sense And perhaps you could then start to try and put those results together and say well this Application is okay application a is okay and application B is okay, but they come from the same publisher So if you install both of those applications together, you may be revealing a lot to that particular data processor Another thing which sort of pops straight out of this idea that we want to kind of atomize data and push it out to all these different Data boxes is the idea that it's difficult now to do big data analytics in the traditional way you might do you might expect Where you want to put all the data into the cloud So we were starting to think about and we have been looking a little bit at how to do small data analytics So idea that you might do some of the computations first when the data is still private and only subsequently try and aggregate the data So you don't need to build up these vast data lakes of of data about everything and a date about every one Instead you try and again minimize data release Do a lot as much the processing as you can while the data is kept private and only later on start to aggregate results together We had a couple of goes of this one of which was essentially looking at pre-training models using a small sample hopefully a statistically representative sample of users data And then taking those pre-trained models and pushing them out to lots of different locations And then in those individual data boxes You can refine those models and specialize the training of them onto those particular individuals whose data is now being used This gets you essentially further faster In terms of the accuracy of those models The long-term goal was to try and think about how would you actually do sort of try and do machine learning for example other forms of statistical analysis of data At scale so you've got a data box for every house in the country in the UK I think it's about 30 million households And how are you going to run a computation across such a large scale set up as that? Perhaps the most complete set of applications that got built was actually built in collaboration with the BBC So this was a collaboration that was talked about there's a I think there's a blog post on their website from a few months ago That describes this but the idea of a thing they called the BBC box So the idea here was to take data from Data sources that they would not wish to have direct access to themselves So in this case, I think it was your iPlayer viewing habits I player is the BBC's content delivery system one of them at the time But also from your Spotify account and your Instagram account so to try to take data from those three sources They obviously don't want to have the data from your Spotify account They don't want the data from your Instagram account. There's no reason they would want to hold that It would only be a risk for them Take the data from those systems that goes into your data box And then they have a BBC application running on your data box Which is able to process that to produce a profile which can then be sent to their content recommendation system And so appropriate content can be recommended to you based on quite rich data about your activities online But without them having to have direct access to that data There were a couple of other applications We ran a hack day a couple of years ago using an earlier version of this There were a couple of other applications I thought were pretty cool One of which was the idea actually exploring the idea that you could do actuation through this as well So you could imagine Having in fact the couple of people who were involved in producing this demonstration did actually produce this You have a video editing suite where you assembled. Let's say a horror film from snippets of footage You could events in that horror film and then playback is controlled by an application sitting on your data box And when the appropriate points in the horror film come up The playback application flickers the lights in your living room where you're sitting watching the film for example So you can have that without again without the publisher of the data without the BBC or whoever it might be that's broadcasting this film Without then needing to have direct access to control the lights in your living room Which obviously they wouldn't want to have and you probably wouldn't want them to have So this idea of devolving the control to a device that's under your control So it can then interact with your environment monitor your environment control your environment, but under your control So that's data box which I sort of mentioned you could think of as data where version 2 So where's the interaction that's going on here? How is this better supporting some of these this eye in HDI this interaction in human data interaction? It does better perhaps and data where it did but it became clear as we were going through this project that it's still not enough So it's still the case that the request and processing tend to occur in a black box An app is kind of a contra is a contained environment. You can't see where it's got up to you can't see what it's doing It's not clear what what the status of each of these applications is as they're executing this system We have got this audit logging support in there It's possible that using that you could come up with some kind of notion of where the where the processing has got to Like what what the status of the application is But what we can do at the moment just with I always probably not rich enough And we have a number of mechanisms such as audit logging permissions requests That allow us to coordinate to some extent within the data. What's what's going on? But they don't what they what the HCI folks say Articulate the field of work which I'll talk about on the next slide And then the third thing is that real-world data sharing tends to be recipient designed. So I will share data I'll share information rather and with people based on the context that we're in I might talk about something in the pub with a colleague that I wouldn't talk about with my wife I might talk about something with my wife at home that I wouldn't talk about the colleague in the office and so forth, right? Depending on where I am Controls to some extent and who I'm speaking with controls to some extent what I'm willing to reveal to them and the the ways that we support this in data box for a little bit too Slow-moving so you tend to make the decisions at the point of installation of an application It's not necessarily straightforward to go back and change those later It's not perhaps easy to be dynamic sufficiently dynamic and how those permissions are being granted and how those permissions are being controlled I mentioned articulation work, so There are some quotes from a paper on by Schmidt that defines this but the way it was explained to me as somebody who is not subtle enough to really understand some of this kind of these kind of concepts was The example given to me was walking down a busy street So if you're walking down a busy street, you're probably walking to get somewhere So that's the work you're doing is walking to get to your destination But in the course of doing that you have to do a lot of articulation work You've got to make sure you don't bump into other people on the street I'm gonna make sure you don't bump into signage on the street You've got to make sure you don't walk into the road and get hit by a bus and all of this kind of Coordination word that you and everybody else on that busy street are carrying out This is articulation work. It's the work that needs to be done in order that the work you want to do gets done a Subject in the data box the data subject is engaged in this kind of cooperative work the subject The data subject the data processor. There may be multiple data subjects involved We don't really do enough in the architecture that we have to support this kind of articulation work Where everybody tries to work out what's going on with everybody else so that we can all come to the right conclusion and get The right things done The other thing about this kind of recipient design is Was observed by a sociology colleague that data is essentially acting as a boundary object It's a thing that is used in a relational fashion You use data in multiple ways and it describes a relationship you have with something else So an example of a boundary object was a credit card receipt in the sense that this is something which is used in multiple ways Simultaneously so it's the proof of payment that the customer might have it's the bank's proof that a valid transaction took place It may be a supermarket's proof that the bank is supposed to pay them some money for the goods that you've taken away All of these things are inherently relational. It's about the relationship between these parties And it became clear when we started looking at these sorts of data That almost all personal data is in fact relational There's very little personal data, which is so private that nobody else is included in it or affected by it This is particularly true when you look at sensing data Most households either have multiple parties living in the house or at least they occasionally have visitors coming to the house And so the sensing data that you might start to see being collected commonly there is going to implicate multiple people It's not just the homeowner. It's not just one party in the house that should have control of that although it is represented in that data Even if you take something that most people think of who thinks of their email as private Okay, but presumably it came from or went to somebody else in most cases and so you even there This is data which involves other people so the in some ways what we try to do with data box in many ways is flawed from the start because we focused on the idea of an individual having control of data and actually data is inherently social in some sense And so it needs to be controlled in a more social way so Moving towards sort of wrapping up the the presentation part of this There are a number of challenges that opposed Interactional challenge to the pose for HDI then which data box doesn't fully resolve It hopefully takes steps towards trying to surface some of them and perhaps resolve some of them It doesn't fix them one is a really around or a set of challenges around user-driven discovery so How do you discover as a data processor who out there has the data you might want to use it's easy when you're collecting it And you're putting it in the cloud somewhere because you've got it But you know what you've got but how can you find out which of the households which of the Individuals in the population has got a data box and has the data that you wish to process that would be useful to you How do users discover what applications they might wish to install the applications that that might do things for them How could they be empowered to make sure that they install the right applications that they're happy with the applications that they've installed And how do we control those discoverability? This that discover discovery process There are a number of sort of more standard mechanisms. I guess that can be tried out here So along with permissions you can imagine social rating systems You know 14 of your friends have installed this app and they're all very happy with it Everybody's giving this five stars These sorts of ways of communicating to other users that these are good applications that help them discover the right things Legibility there are mechanisms here that can support legibility, but legibility remains a problem You should be able to visualize your own data. You do after all now have it in your data box And it might be much more difficult to visualize the impact that other people's data has on your data Or the other people's data might have on the processing of your data What is going to be revealed as your data is processed given what has already been revealed revealed by other people? This is true both for data that exists now, but also data that might become available in the future There's a question here again comes back to discoverability as well Is what can processor discover about what you have in the same way what you can you discover about what processors want? There may also be the need to edit data So if you detect some data has been recorded, which is wrong you want to be able to go in and change it This is another in some sense floor in the way that we frame this is again We were very much one floor it was deliberate, but it's not complete So we focus very much on the data subject on allowing the data subject to have control over data and data processing But of course there are more stakeholders than just the data subject in this the data processor Might legitimately want to know that you've not tampered with the data that you're revealing to them that you haven't faked out your Propensity to risk for example, so they give you lower insurance premiums I Understand some people when Insurance company started saying well if you wear a Fitbit and we can see how active you are We'll give you reductions on your health insurance I believe there were people who were putting their Fitbits on their dogs or on a metronome Other mechanisms to try and fake out what the data that was being recorded in order they could reduce their insurance premiums There's clearly a need here to try and support some degree of Sort of legitimate interests of both sides I guess in this as I've hinted at Data is a social thing most data is a social thing So you want to be able to delegate control delegate access to data But you also want to be able to revoke it you want to be able to see what's been happening with your data Whether it's being edited who's been viewing it with whom it's being shared You want to be able to revoke those permissions you also need to be able to negotiate So if you have multiple data boxes in the household for example, it might be legitimate for my data box to have access to Some of the IOT sensing data and for the other adults in the house to have access to the same IOT sensing data In in my home right the energy consumption smart metering and so forth But any one of us can then reveal that data to a data processor And that might not be what we wish to happen It might be the case that it would be better if there was some way of negotiating that we all agree That we are happy for this data to be revealed to this particular data processor We have no mechanism to support that kind of sort of social action at present There's a need to think about who data is getting past to what we can you do to try and work out when you've revealed some data To somebody else what they're going to do with it and what what's happening after you've made that revelation And I think from a technology perspective the two of these that I find most interesting come down to the sharing of data And what to do about shared data So sharing data we want to be able to support offline data collection We want to be able to support data collection from devices that are not necessarily co-located with the data box So this means we want some kind of rendezvous and identity service and this needs to be You know reliable and not infringe on the privacy of the people participating in it Shared data is another interesting thing So there was a long-running argument about 18 non-tinas project between myself and one of the collaborators Around how to support this idea of shared data. What could we do given that data is inherently shared is inherently social? Their stance was very much that what we needed to do was introduce the idea of a user account onto the data box So we'd be able to manage access to the data by having user accounts on the data box So that we could say well you can say this and then this other account is allowed to see that data and so and so on Um It turned out Well, it turned out it was certainly my opinion that that was going to be Inordinately complicated to implement Um, because of what it boiled down to the problem of who gets to manage the user accounts Who gets to create accounts and who gets to control the accounts? And with current systems consumer systems that i'm aware of Um, there's no real way to have this kind of um, essentially There's no real way to do away with the idea of a root user somebody you can see everything And it's definitely the case that in other projects that we were doing Um, that when you're looking at personal data It's often the case that people are actually less concerned about complete strangers seeing their personal data Than they are about other people in the house seeing personal data You know the idea that your parents might be able to see what your or your internet viewing habits are Is something that many people find quite upsetting But the idea that the isp can see what you're seeing on the internet. They're not really too bothered about Um, so there was a difficulty there about how if you introduce accounts How are you going to manage the fact that there's going to be one account that's going to have access to everything? You can see absolutely everything that's going on Um, so we I fought quite hard to try and keep things so that we had one data box for one person Um, unfortunately that really doesn't solve this problem of social data And the closest we get to that is we start thinking about the idea that We could replicate data within a household across a set of data boxes perhaps And then you have to I do well what I did again was punt on the hard problem Um, so you end up in a world where you're devolving You're devolving the challenge of managing access to social data To a social matter So you say that well you'll discuss it with other people in the house before you start revealing this data Because you know it's sensitive because you're aware of other people's views It's not like you're sitting in your living in your house in your Social situation and not being aware of anybody else in the house and what they think about this So I think those are two interesting challenges from a technical perspective how to support these kinds of interactions and these kinds of These needs in this system And with that I'll finish Any questions? So thank you very much for the talk Um, if you have questions, please line up at the microphones. I think microphone two. Thank you Thanks for that. Um I wonder how you see this moving beyond academia into sort of broad adoption And if you have any thoughts on Um, something like the Estonian e-citizenship model for how this could potentially scale Um, and then and I guess also your thoughts on just what do you think needs to happen for this to to be adopted at scale? So I'm not familiar enough with the Estonian e-citizen model to comment on that. Um Frankly, I think that for this to be adopted at scale, uh, we probably need to reinvent everything So it's a little bit less of a research prototype. That would be a good start um I think that One of the big challenges in terms of adoption is actually around what these applications might be Um, there was a strong interest from other parties in this project around iot data particularly And one of the things it seems to me to be the case around iot data is that we've got all these opportunities to collect lots of it Uh, but nobody's quite sure what to do with it in terms of really compelling applications that make a great deal of sense So in that sense, it may be that it's all dead duck And there's no there's no need for anything like this because in fact, none of that's ever going to take off Because it's never going to be compelling enough to be really really useful Um, so I think having some killer applications or some real use cases That that that are valuable here would be would be good Some of the ones that I mentioned the couple of ones I mentioned from the bbc and from uh, other collaborators So I think with the university of york at the at the hackathon we ran um They started to become more interesting I think so you can start to see some Use cases arriving there, but they're they're quite slow to find them and quite slow to build them The other thing that we would definitely need for broader adoption of this sort of platform That we really need to fix as part of that rewrite is to make the development process much much easier it turns out that It was essentially there were professional developers that were hired to build that bbc demo And they did it and they did the great job and it worked Um, but I think everybody involved found the development process much harder than they expected Um, the idea that you can't simply access a cloud service when you want to in your code that you have to request permission for that and actually think about all that process is is quite alien to modern development practices I think Um, so I think I think the development process is something we really need to work on to actually give us a hope of being adopted So yeah, let's do a three thing Thank you very much, um more questions. Yeah, go ahead Yeah, thanks a lot for that great talk. Um, would you say sir that for um, so basically what you had in mind when you developed this for iot applications Uh, well, so it sort of changed over time when we first started with data where we were thinking about social media social networks email IRC logs chat logs and so forth as personal data We constantly wanted to do things like getting banking data out financial data was an obvious sort of thing that people find sensitive but would like to do interesting things with Um as the as sort of time passed iot became more of a thing one of my collaborators in this project Is essentially was funded to look at iot data specifically. So that was where the interest in that came from um I think that Given the domestic context that we were targeting um iot data is a sort of obvious thing to look at There and other sorts of household data finance data is another one that still still be kind of obvious Personal health data now with sort of wearables things monitoring you can do all of these things are a sort of there but in some sense, um i'm not I don't think it matters too much in in terms of the the challenges that we were coming across as we were doing this I mean, they're they're fairly endemic across the space whether or not it's iot data or Or other forms of data You start to realize quite soon when you try and actually build these things that you've got these problems That data is inherently social multiple people are implicated and so on and so forth great Any more questions? Oh, yeah microphone one. I think I would like you to elaborate if you could on the different levels That's this level of a household containing a family or someone and there's this this level of a community I think if you have some Thing to control the temperature in the living room You could have this box tell you that two other people not named in your household Would like it to be somewhat warmer, but you're paying the bill. So you say This could not be the case and the whole neighborhood Street or block or something that's a different level where you have different questions that you could elaborate and where you could have Good use of this box Yes, so that's an interesting challenge that that's part of the reason we were thinking about this kind of very wide spread sort of federated data processing essentially on the the idea being that As I understand it one of the ways that you can nudge people to reduce energy consumption, for example Is to tell them what the average is for people in their demographic And if that's higher than that's lower than what their consumption is that acts as a prompt for them to think about bringing their consumption down Whereas if you do it across too large a scale Um, you know, everybody in the country is doing this and it's it's become less meaningful But if you know that households that have more or less the same configuration as yours are on average using a lot less energy You might start to think about it So that was kind of drive drive those sorts of applications where you want to look at data across multiple data boxes Simultaneously where those data boxes may be spread across the wide area so We started to investigate some technical means to do that There's a system that a postdoc of mine built called Owl, which is a data processing system for the OCaml programming language Which was trying to embed and body some of those ideas if you go to ocaml.x yz The web that's the website for that particular thing He was I think 18 months and he wrote 180,000 lines of OCaml code to implement that it was fairly impressive Um We haven't got to the point where we can deploy any of those yet or actually test any of them out and certainly not at that sort of scale Um, that's something that I'm hoping to do in the next year or two with some other developments That we've got in Cambridge around the digital built environment Where I might be able to start to deploy some of these ideas and see how they work in terms of data, which is being Collected and managed at a scale which is larger than a single household. So you don't have the same kind of domestic framing for it Okay, do you have any more questions microphone to Could you elaborate a little bit further on the trust of applications and especially if they start Doing unintentional things such as requesting data that well Given or being based on other data reveals information without yourself Intending to So I think I think that that's essentially that's one of the challenges that we haven't really addressed So if an application that you install Asks for access to data that you have not given it permission to it can't have that Those requests will simply be denied But if an application that you've installed Has been given access to some data And it manages to do some processing of that data And you're happy with that and that results of that processing go back to the data processors home base And then then they're able to join that with some other source of data that you had no idea about That we don't we can't do anything about about at this point And that's one of the challenges here is what to do when The data that you thought was okay turns out not to be okay to reveal because somebody's found something else that lets them attack it in some way I don't know what to do about it Thank you. Thanks. So we have one more question. Yes So when multiple apps are trying to access the same data, how do you Is there a standard that you're using like semantic web standards to understand the meaning of what? Certain rows or tables mean between applications Not really no we didn't go down the route of trying to sort of taxonomize everything and put everything into an ontology So at the moment the application writer just has to know That's the data source that they're accessing. So they're accessing the phillips hue light bulb data They happen to know what that format is. So each application is talking to its own data store Within the club within the data box No, the author of each application needs to know the format of the data that that application is going to process So somebody has to go and look at some specs before they write their app And how do you see this project like parallels between like let's say the solid project that's being led by Tim Berners-Lee or The own cloud project where again, it's I feel there's kind of some parallel Um, I think that So based on my understanding of those projects I think that we are focused on the platform and the control in the platform Unless about trying to control what the application tries to compute out of it we're not I think Solid I think maybe has moved on since I last looked at it But I think initially it was quite focused around the browser for example, and we're not we're not trying to be Be in relationship to the web at all. It's about having a device. I think that's the other thing that seems to be Again when we when I last looked still seem to be Fairly fairly unique. So it seemed to be Something we were doing differently which was having a physical device that users could control directly And trying to trying to provide the affordances that you get by having a physical device rather than having something That's just abstract software in the cloud somewhere that you can't really control in the same way Good microphone one Hi, I'm curious if you have any data among uh of the problem awareness in the uk like Sorry data of the problem awareness among the population Like if they're already aware of the implementations implications things Uh off the top of my head. I don't From a previous project. We did do a review of a lot of the privacy literature Um, so papers that have been published about people's attitudes towards privacy and understanding of the problems of privacy as uh represented in data But I don't actually have any sort of statistical data about How the population is generally aware of these kinds of issues. I think When we have looked a little bit at that trying to do that sort of thing We got we when I was at Nottingham, for example, we did some work with one of the standard surveys that I think was the city council executed every year or frequently anyway and If I recall correctly, there were some questions in that the answers to which did not make Sense from a technical perspective. So one of the questions that was asked was Do you use the internet? I think and a lot of the respondents said no, I don't use the internet Why would I use the internet? um, another question that was asked was How do you arrange to meet up with friends and a lot of those same respondents who don't use the internet use facebook To meet up with friends So I think it can be quite difficult from survey data sometimes to actually work out and tease apart really what is going on in terms of people's understandings and concerns about this because it's Some of these concepts are quite abstract and they're also quite I said it's a lot of it a lot of it is very dynamic It's a sort of recipient design So I can give you one answer to the question am I concerned about Privacy of my data and if you frame the question slightly differently I will give you a different answer because you've triggered something else and so I think it's quite difficult to gather Really robust data that you can you can really be satisfied with the inferences you draw Thank you. So there are no more questions. I think uh another round of applause, please for more