 That's this this works. Oh my goodness. This is so nice Have you ever given a presentation to an empty room you can't even with the mic You can't hear me even with the mic front I asked whether have you ever given a presentation to an empty room Yeah, so should I be like I mean I don't know whether this is being a recorded or not Should I even bother Should I even bother there's no one here I think I heard some clapping so Yeah, but everybody uses their soft phone So I don't know if anyone's recording But if no one showing up in 15 minutes, I'm just gonna go Yeah Yeah What about at the were you at the The keynote Okay, I saw that you were at game night Yeah, I was very tired and he was very Hyped up like this game night was like oh my goodness. There's so many things to do and I want to do all of them There were so many kids that he yeah, he just went and played with all of them But I'm very tired. I didn't have any patience for it. How come you're here for the this talk Completely missed it. What was your name again? You don't have a name you don't Okay, actually this is going to This is going to be very quick actually because I need to rush and catch a plane I'm just gonna go ahead Okay, so this talk is about basically poking into Container images for license information So there's some information about what what a container image consists of and what license implications that are for for Distributing container images So there's some disclaimers here. This is the stock doesn't cover anything about FISMA HIPAA PCI DSS It's about open-source software legal compliance, but the same concepts should apply to these two Meeting all of these other compliance obligations I'm not a loyal and This presentation doesn't have any legal advice don't ask me for legal advice and the example I give is mostly about Containers built using Docker because Docker is the most ubiquitous tool to build container images and so This Talk will be focused on that. Okay, so what is open-source software compliance? Let's imagine that you made like this delicious dish and I Like it so much that I say hey, give me the recipe for the sake of argument. Let's say that you Give me a very detailed recipe including Where you got all the ingredients from and you've also told me hey if you happen to share it with anyone else Just let them know that it was my recipe and you say okay, okay? I'll take the recipe then you decide you are going to start a business packaging up You know this dish and you'll Base it on the recipe won't be exactly the same But you know maybe add some extra sauce and you say that this is my family recipe which is Incorrect, it's not your family recipe is Recipe maybe I thought it up. Maybe it's been passed down for a long time the the point is that it's something that I Put a lot of work towards and it's something I gave you in good faith So that's what open-source Software compliance is all about it's about giving credit where credit is due and following the wishes of the creators of the work it The wishes are described in a license file for software The license file is very useful in communicating what the creators would like you you as a consumer How they would like you as a consumer to treat their work? So not giving credit or following the wishes of the creators will get you some bad karma points with the Community that you're involved in and will put you at legal risk. I'm not going to talk about the legal risk today So how do you comply with the OSS licenses first? You'll find whatever your software is build time or on time Dependencies are and then you'll find the licenses for those Dependencies and then you do what the licenses tell you to do again. This talk is not about number three It's about number one and number two. There are some tools to help you through the Linux foundation There are some other tools that I'll talk about later There are lots of enterprise tools that help you with this and you know It's a it's a problem that a lot of folks are trying to solve now It's a hard problem Even from a single apps perspective Because what you'll have to do is you'll have to go and track back the apps dependencies all the way down to the operating system And then you'll have to find licenses for all of those dependencies and then if any of those licenses are Copy left licenses like one of the GPLs Then you will have to find the sources that created the binaries that went into those dependencies So it's a hard problem But it gets harder it gets harder when you containerize your app So in order to understand that you have to understand How container images specifically are built so just really quick overview on containers They are running processes. There's no like magic box that's being deployed in the cloud It's just Linux processes They act on file systems That's what a container images. It's just files Dockers a tool that builds runs and distributes distributes containers makes doing this very easy from a consumer's perspective and so because it's so easy They use it to deploy their app with all of the apps runtime dependencies And so this is a plus for consumers because they feel okay. They don't have to manage any of these dependencies and The community at large encourages reuse of containers. So building new containers on top of existing ones All of these there are some references over here for which you can go and find More information about how containers build and run. They're very good And I learned a lot from them So what I am going to focus on is That part which is that container images are files So we're going to look at what those files are and I'm going to focus on that guy Which has the license implications that we're going to talk about soon so You when you build container images what you're doing is basically using the Linux kernel storage driver They are called graph drivers union file systems so You start off with a minimized Operating system like very small miniature operating system. So you get all of the you basically get You operate on a tar ball that has all of these Files that you would expect to find in an operating system And then using the union file system storage drivers You'll create a copy on right layer and in that read write layer. You will run scripts so the copy on right layer basically Means that whatever file that you're modifying that file gets copied into the new layer from the bottom layer So even though the bottom layer is a read like read like read write Layer that file that you're modifying gets repeated in the top layer so So when you finish that operation you get a new layer that act that has all of the files that you modified or Changed so Here's a it it's a good time to mention that if you were deleting files the files don't actually get deleted They get tombstoneed so the files will still exist in the top layer, but they'll be appended with Like a dot wh Extension and that's called a whiteout There are a lot there are implications about Using whiteout files when you're distributing and I won't get into that right now, but it's just an aside So this is you will build Container images are built like this layer by layer and then finally what you'll get is Collection of directory T trees one for each layer and when you distribute an image This is what you're distributing the implication here is that when Docker and tools that used to containerize your app are used they actually start from that beginning image so you if you're using a docker file you'll start from a from image and You'll download that original image that with the collection of directory trees You'll copy your app in and Then you'll run a script on top and create more layers So you got your changes and then you got somebody else's changes and when you do a docker push You're not only Distributing your changes, but you're distributing all of the layers underneath What that means is that now you're legally obligated to follow all of the license obligations for all of the files that you've distributed Regardless of whether your app actually depends on that Library or not you downloaded it and you've distributed it So you are legally obligated to follow all of the licenses in under in all of the layers So this is this is the big side effect, this is the biggest compliance implication For using that from in your docker file I want to talk about multi-stage docker builds Because this is something that the community encourages to use you start from The and a beginning image that has your apps Build dependencies. So this is a container that you use to build your binaries So you put you copy your app in which has all of the source code and then you do a build And then what you do is you'll copy all of those binaries into a container that only has the binaries runtime dependencies They the community encourages this because this is one way of minimizing the size of your container before distributing it But what happens over here is that? For golang in particular, it's a statically compiled language So your binaries will still contain pieces of the libraries that you were dependent on and you are still legally Obligated to follow the licenses of all of those libraries So if you happen to be dependent on a GPL library Then you have to go and find the source code for those that created the binary that you're distributing Docker Deletes the build container. So how are you going to do that when your build container is gone? It's you It's impossible to track the provenance of an artifact in a broken pipeline when you delete your build container your pipeline is broken This has other implications as well, which I'm going to talk about later and that's build reproducibility, but Actually going back and following the licenses of all of your dependencies becomes much much much much harder So yeah, I mean at this point everything's on fire and I want to stress that this is a Industry-wide problem that people are trying to tackle and usually when I talk about all of these implications that Docker builds in general and multi-stage Docker builds in particular create I get a lot of grief from engineers because that's all they have Okay, so that is the reality that we live in right now, but there is still some hope we can still Go and figure out what Packages are installed So let's do that with an example file. So let's look at this goal and container You can get the Docker you can do Docker history to find out all of those layers It gives you an idea of how big your from images Let's look at this you can see there is There's a bunch of missing stuff in this what this basically means is that You have no info you have no configuration information for all of these layers This is what happens when you do a Docker push whatever You use to build that container what methods you use to build the container. That's no longer recorded. It's gone So all you're left with is just this information You can actually find out what container images were used to create this container image But you have to go and look at Docker help and find the Docker file that created and go and trace back all the Froms so I did this for this image. This is the only go lang one point one one part Specific for this image This is built on top of this image which is built on top of this image which is built on top of this image so There is no other way of finding this information other than clicking through Docker hub and Going to the Docker files that created the image if they happen to post it on Docker hub so Very iffy already right off the back. You can't track prominence of the containers, but there are some clues that you can look for so here's one where you You see this ad file that should be an indication that it's some kind of operating system that you started with What kind of operating system well over there they use app get so probably they'll be in urban too and You'll notice that there are actually only a small subset of layers that occupy space So these are actually the real files The rest of them are all Configuration stuff so setting up environment variables and things You will find that actually a good number of these of the Sizes Is created using the package manager, which is good because you can actually use the package manager to go back and list out all of the The packages and their dependencies Excuse me But you'll also notice that a pretty large amount of the sizes Created using shell scripts. So there's one over there. That's 341 megabytes That's actually the Golang package. You'll find that out soon enough. So these Shell scripts can get very large and it takes a little time to go through them But if you go through them, you might like find some clues like this, okay? Here it looks like you're downloading the Golang package. You're checking you're seeing your Checking the Shah some of it and then you untoward it into user slash local My phone is on sorry, okay So you don't know what that Golang version is and You will not you will never know what that Golang version is unless you saw the build and release pipeline that created it because this is one of those Args external arguments that you pass to your docker build So if you just have the container, you'll not know what that is You might be able to find the log Files because they have a thing that prints out the go version but You won't find it here so Just keep a note of that that it's it's a it's some indication, but not all of the things that we need Okay, so what do we have? We have a possible base OS and a package manager and then we have some install scripts but we're missing What packages were installed using the package manager and what packages were installed using scripts and we don't have versions and licenses Which is what we want to look for here You can actually get the raw image by using Docker save and that will actually get you all of the files that are in the image So here's an example and we did that here and there's a bunch of directories and there's a file And there's something called manifest.json. That's the entry point to the the image So if you look at manifest.json, there's actually a config file The config file has all of the container images metadata including that history that you saw when you typed out Docker history so it's It's All of that all of that information is already embedded in here So you don't actually have to do Docker history. You just say you just do Docker save and all of then you get all of the files The SHA-SAM in the config file is the SHA-SAM that you get when you run the SHA-SAM on the config file I know that's confusing But I think that's what they call smart pointers The SHA-256-SAM is also the image ID It's in JSON format so you can run you can parse it using any any Language that supports JSON so Python or Ruby for example, and then there's these layers and These are actually the ones that contain the files. So if you remember Docker history had some Indication of sizes of each of the layers those are these are those layers. So they are They're non-empty. They are ordered from the bottom most to the top most How do I know this? from testing There are and these are all parts to the layer tar balls So you can actually untie that untie the tar balls and inspect them and I did that for the first layer and sure enough that looks like an OS So that's cool. We have files And you can actually do some scripting to map the Non-empty layers to the commands that created them. So I did that I did that really quick over here and You will see that I have one two three four five six the long shell script and then a seven one Well, there's a seventh layer, but back in Docker history. I counted only six Where did that seventh one come from? Well, so that also lists out empty directories And that occupies space as well That doesn't get reflected in the Docker history it says zero bytes over here, but And I'm not really sure why it doesn't reflect But anyway, it's an indication that Docker history is probably not the best place to go looking for the information that you want It's better to just get that rock container image to go and Look for what exactly the files are that's in your image Okay, so let's take stock. What do we have now? We have a possible base OS with the package manager We have install scripts. We have directory trees and we have the actual files that make up the container image What are we missing? We still don't have the software packages We don't have the software packages that you can get either using the package manager Or you can get from the scripts and we don't have any of the versions and licenses metadata But there are ways that we can get it one very common way that people use right now is File system scanning there are lots of tools. There are lots of open source and proprietary tools that help you do this There's lots of a lot of people use the security Security scanners to do this. So Claire is something that's very commonly used This is reasonable option The trouble with this option is that it doesn't really give you context on where the files came from It'll tell you that you have You know, you have a CVE on this file. It might even tell you that that file belongs to this package But it won't tell you how it got there because you do not have any context on You know what what exactly you did to get the file in there this has implications when You know your your base OS already has package managers Like already has a copy like for open SSL for example, and you've installed a different one you It has implications if you're adding if a base container that you're using added a special PPA that stale and you never know that from here, but You know the the security scanner will pick it up and say hey you have this Vulnerability and you'll have no idea how it got there So I mean it's a it's an option. It's just like a little limited So there's another way that you can do this you can actually step through your build So go layer by layer and see what and check out what you did And you can do that fairly easily By using I'm having a hard time. Oh fun Okay Let me take my jacket off maybe Okay, so you could do that You could actually do that by using just basic Linux kernel commands so you can just mount your Your first file system and take a look at and run a chute root Command and find out what packages are listed You can also do that with the diff layers that go on top using overlay So this is these commands can be used to mount an overlay file system and run the same command again Here's an example. I ran d package dash dash get selections because I'm using app get and There are all those packages So the trouble with this is that images get very complicated very fast So it would be nice to that if all of this was automated. So That's what that's where turn comes into play turn automates this process For a layer by layer for your container image So the way that it does that is very straightforward It does it mounts the first layer Runs the package manager gets you the list of packages and then for every subsequent Leo it will run the same command and so What you get is a layer by a layer list of all the packages that are installed in it So a little bit about architecture. It's it's a Extensible architecture in the sense that you can use it for you can enable it to do it for any kind of container image Docker is the one that's supported right now again because it's the most ubiquitous one, but you can enable Enable this for any container image. You just need to find a way of getting the raw image So once you get all of the layers then the Analyzer will go through each of the layers and figure out what packages are installed It does that because it's got a knowledge base of system package managers and it's got a knowledge base To handle slip snippets so Say for example, you want to enable Python then it looks at the system level Binary so you know for like if there's a listing for pip then it'll look for pip and then it'll use pip to list out all the packages if there's a if you happen to have GitHub account that you copied in then it looks for anything that anything that has anything to do with git and then it identifies it that way And so then after that after it collects all of that information it sends it through a formatter Which is also which you can also enable for any format There the default format is a human readable format But you can also get structured data like YAML and JSON SPDx is coming out in the next release So here's a recorded demo, let's see. Oh, there it goes. Okay, so Here it's Doing hashes of all of the layers and then you can see it mounting and running each of Running scripts for each of the layers and it does that for three layers and then it finishes the report now the report has What image you used what layer we're looking at all of the packages that were installed in each of the layers and Yeah, so Pretty straightforward. I want to talk about the results that it produces because it it's an indication of how granular the results are so Let's look at the report You can see that it talks it tells you what image and For what layer what created the layer what scripts are being run on top of that layer to get all of the package information that you see So these are all the packages of the first layer, this is the second layer It tells you the same thing what scripts it's running to get that information and Those are all the packages and their dependencies that were installed layer 2 and then it tells you if There's anything it doesn't recognize or if anything's copied over that it doesn't understand so it's not It's not ignoring anything that it doesn't know about it reports that it doesn't know about some files and This is because This is because there's so many ways that you can go and put file systems into container images and Because there's such a big universe of it Sometimes it doesn't know And it wants to be very transparent about that. So here's some features That it has right now it supports Debbie and urban to photon and Alpine package managers So any OS is based on any images that are based on these OS is it supports? And list all of the packages and the dependencies that came in with it with it for each of the layers So every layer the image has if you've used any of these package managers, it'll find the whole list of dependencies It's got an extensible architecture. So you can add your own method to find the licenses of anything that you are You're working on It you do There's caching so it caches by a container image layer So if you happen to have used a say the same layer Other times it will just pull it out from the cache That helps make the process go faster. Oh And it can be used as a standalone tool to help Container developers as part of a build-and-release pipeline or on your desktop If you just want to look at images it supports structure JSON data and YAML and There's an active community. It's a small project, but it's growing And here's some future work. So SPDX document support is happening is coming up in the next release, which is in May There's work to enable language package managers like like PIP, NPM, RubyGems, etc And then we'll be working on enabling all of these external repositories So one of repositories that you copy in and then Start a proprietary support, which is if you know about the project that you're working on or it's internal to your org Then you can hard-core all of that license information Okay, I'm going to stop right now because I think I'm done It is oh Says no, it's 12 11. Oh, I can go on. Okay. Good Sorry about that All right. So what do we have? We have Software packages installed using this the package manager we can Tackle files of known origin and we can find their versions and licenses in this way but We still don't really have a way of tackling Software packages installed with long scripts. This is simply because People tend to put their whole Build and release shell script into one line in the docker in the docker file This is a human problem. We have no way I mean you you can't really use automation to tackle this What are you going to do right like a parser to parse like entire shell scripts and figure out What was going on in your head when you made that? No, okay He can't really tackle files of unknown origin So if you were to copy in a random binary from your desktop into the Container when you're building it it doesn't know anything about that random binary only you know where it came from and that may or may not be the case but you know People do silly things It can't tackle post-processed images and I I Took me a little while to find the name for this because there's a lot of tools out there that Will take an image and then strip out information from that image in order to reduce the size of it This includes Microcontainers the That was bad So this includes microcontainers what they do is they'll take the image and then they'll find all of the LD files, that's the runtime dependent dependencies and they'll strip out everything else or The multi-stage docker bills where they copy in just the binaries in your small like runtime container and then There's no way for you to go and trace it back. So can't really do that Yeah, so we cannot substitute Automation for software best practices it's a lot of people say that they can automate DevOps out but Really you kind of need some best practices and some humans to enforce the best practices and think about them So I'd like to I'd like to spend a little time talking about reproducible bills The idea of reproducible bills is that given the same environment and artifacts should be able to rebuild from source Bite for bite. It should do this every time you run a build on this pipeline This helps with Debugging bills this helps with QA in able to find so you can find regressions and then it checks for it checks for security flaws and It checks for license compliance. So reproducible bills are very helpful And this is a software deployment standard practice a lot of build-and-release engineers will tell you that and And Most container builds do not follow any of these standards. I'm not sure why Maybe, you know, there is a there is a feeling that if something doesn't work because containers are ephemeral if a container doesn't work get rid of it build a new one and this this Idea doesn't really give you any information about what exactly went wrong So it's very hard for you to debug You know what issues there are with your app if any of Any of those issues can be preventable if there were any issues with your dependencies So you needed to update your container that way It's a very like blind way of dealing with this not systematic a non-systematic way of dealing with issues like that So This is a shout-out to get involved with things like making container builds more reproducible and trying to find License compliance you can contribute to turn you can follow me on Twitter What I do is usually Publish some good first issues on Twitter that folks can handle You know good first issues or issues that are easy to handle I try very hard to break down the big issues into tiny issues for folks to easily contribute Provide your input on if you feel strongly about reproducible container builds and license compliance Get involved in The open container initiative mailing list there are some ideas some links that I have over there I hope they let me post the slides, but You can there are there is I gave a kubcon buff on Container build manifests and I have like a live document going that has container that has a Spec for container build manifests and here are some projects that are working on reproducible containers Singularity and which Tackles just containers and then in total which tackles the whole build and release pipeline security All right, and thank you very much. I have time for questions Yeah, I have like 20. Yes So there's no support for RPM base distros, but they're very easy to enable and Follow me on Twitter and I will I will talk to you or You know post an issue. This is easier actually post an issue on the github page and ask for RPM Sorry, what no It there's no issue that that says hey enable RPM support, but you can totally go in and say enable RPM support And you know it it's actually very it's actually reasonably straightforward because Photon is based on RPM and so you can just copy all the scripts from there over and so it's it's fairly easy to do Any other question, yes, I am not familiar with the is it puppet? Yeah Yeah, so I'm not familiar with any tools that puppet is using right now This is a very like license oriented tool And I understand that there are a lot of security scanning tools that also report The the kind of granularity layer by layer Having said that if this is Most tools are unable to handle like the whole universe of container images that are out there So I'm not gonna say that this is better than what they have I would think that they're kind of similar and they just use different approaches There's a There's a tool called container diff that kind of does the same thing except not layer by layer and Then there's I mean, I think the layer by layer granularity is Reporting is one of those things that they put in commercial tools right now, but I Don't know of any open source tools that do the same thing No, no, it's open source BSD to License under one of the projects under the Linux foundations under the Linux foundation they had They have a project called Automated compliance tooling and so this is one of the tools that exists Yes, sir, right. Yeah, so the question was How how do you how do you report any vial compliance violations or any Or any like follow-up work that needs to be done after you give the report So it doesn't turn doesn't tackle anything like that, although that's been request that was Sent where you know have a rules file where you can Compare the results to the rules file and then see if you're in violation It's it's definitely possible to do that but it's Not necessarily in scope of this tool so that you could totally build another tool that will do that like You know take this report have your own rule files do a comparison and then alert there's also There's also a way that you can put it in your build and release pipeline So take two reports do a diff of the report and then anything different you get an alert for that So that's definitely like a good downstream Future work that can be added and But my feeling is that that's not necessarily in scope of this so definitely different Formats of the report to be able to enable those external tools is a good in scope issue to add but adoption needs to increase in order for in order for you know as to get like Yeah, to get a good idea of what kind of reports people are asking for alright Okay, well if there are no more questions, I'm going to stop and Thank you very much Well, I met you for the first time yesterday. I said Afghanistan or Iraq you look surprised. How did you know? I didn't know I saw for no time about the risks to be the rule, but not some baby Then there's your brother the phone expensive email enabled MP3 player. You're looking for a flat show. You wouldn't waste money on this It's a gift then not your father. This is a young man's gadget could be a cousin But you're a war hero who can't find a place to live It says you've got problems with him. Maybe you liked his wife Maybe don't like his drinking how can you possibly know about the drinking every night goes to plug it into charge But his hands are shaking. You never see those marks on the sober man's phone. Never see a drunks without them that Was amazing. Do you think so? Of course it was extraordinary. It's quite extraordinary So what people normally say what are people normally say piss off. Oh, yeah testing Is this microphone working Thank you Okay So it's half past so I should get started. Hello. Are you all enjoying scale the last day? My name is Stuart language and Today I'm going to talk a little bit about privacy about how it could be the next big thing and What we do about it and how you can be ahead of your competition in 2012 Target the big American discount store. They put together a list of 25 products that when they were purchased together Indicated that the purchasing person was likely a pregnant woman And then they mailed out coupons for Baby products to people who bought these products Yeah, so the idea was that they get steal a march have the people come and do their shopping at Target And one of them's father Stormed into his local Target store demanded to see the manager. He said my daughter got these in the mail All right. Um, she's still in high school And you're sending her coupons for baby products. Are you trying to encourage her to get pregnant? And then a couple of days later. He apologized profusely when it turned out that she was present Women are less likely to be shown ads for high-paying jobs If your social media friends have bad credit ratings It can be harder for you to get alone Uber have Tracked Uber drivers who were attending taxi protests and then fired them They'd bought a thing called God view which tracks you after you leave one of their cars They wrote a blog post about rides of glory which were people taking ubers home after one night stands They retracted that one because even they're aware that this sort of thing is really really creepy Isn't it great to live in the 21st century where deleting history has become more important than making it They used to be a saying if you're not paying for the product Then you are the product and this has always been various levels of untrue Sometimes you're paying and you are the product The fact that you're getting something for free does not mean that you've signed up to be exploited in every way possible There is no correlation between how much money users pay and how well they are treated And sometimes I'm okay with being the product if I if I'm not paying for a thing and you've decided to monetize it by Show me showing me adverts. They might as well be adverts for stuff that I like Stuff that I might be interested in so sometimes it's okay that was always the problem with television ads back when people actually watch television that They weren't targeted at me particularly or people like me They were most of the time were adverts for things that I didn't care about had no intention of buying no intention of looking at What's different about targeted stuff is a word I've used already a few times It's a word you hear from a lot from people colleagues at work and people on the train Creepy What does it mean? The issue here is about aggregation About emergent phenomena the idea of data science is to get a big pile of facts and then deduce from that Extra-additional facts that you weren't explicitly told that's what target did When they did this and when they did their list of 25 products you could deduce from the fact that you buy unscented lotion that you're pregnant and That's the kind of thing that data science is for it's what target did It's what Sherlock Holmes did, you know take some data draw new and surprising conclusions It's actually quite fun to watch if it's happening to someone else Well, I met you for the first time yesterday. I said Afghanistan or Iraq you look surprised. How did you know? I didn't know I saw for no time about the risks. You've been abroad, but not some baby And there's your brother the phone expensive email enabled MP3 player. You're looking for a flat show you wouldn't waste money on this It's a gift then not your father. This is a young man's gadget could be a cousin But you're a war hero who can't find a place to live says you've got problems with him Maybe you liked his wife. Maybe you don't like his drinking How can you possibly know about the drinking every night goes to plug it into charge But his hands are shaking you never see those marks on the sober man's phone never see a drunks without them that Was amazing Extraordinary it's quite extraordinary. It's not what people normally say what are people normally said piss off That was from Sherlock BBC program Watson being Amazed by Sherlock's deductions. This is not the face of someone who was pleased and delighted by their user experience People do not like it when you do this It's creepy. It's strange What I want companies to do is to learn this that your data collection is creepy when you use it to deduce things that you weren't told and that you shouldn't know and To a large extent doing precisely this is what data science is for So there's something of a mismatch here, you know, and this is not a new phenomenon that I'm talking about Supermarkets are laid out in an incredibly precise way There's a huge amount of literature and knowledge within the supermarket industry about how to lay out your store You've got vegetables at the beginning because they communicate Freshness and the bakery is often near the beginning because it smells nice and the smell of the bread attracts people to come into the store Stuff you actually want to buy is at the back So you've got to walk past everything else to get there and you'll see other things that make impulse purchases It's the same reason that the only exit from the airport is through the duty-free shop Despite the fact that you probably don't want to watch or Some cologne you've still got to walk on an incredibly windy path through the duty-free shop, and that's exactly the same principle Every aspect of a store's layout is designed to stimulate shopping serendipity And people find this whole thing weird and unpleasant and the worst thing is that they're trapped because There's nowhere else to go. It's not you can say it Well, I won't use any supermarkets or I won't use services that collect my data the whole time make me feel creepy about it You've got nowhere else to go that there were a bunch of stock answers that people tend to give or people tend to get For what to do about this and none of them are right you can't Opt out you can't just say I'm not going to use any of these services that collect my data. It's just not realistic It's not impossible in the same way that if you wanted to you could go and Live in a cave in the desert if you want to it's doable But anyone who's seriously advocating that as a solution to your problems. No, come off It's we are now all part person and part machine Everyone anyone not got a mobile phone In a talk on privacy We're all part person all part machine and that's honestly that's okay It's a good thing. We now live in a world where anyone growing up now will never know what it is to be lost The never have the experience of being lost every horror film Now has a bit at the beginning where they have to come up with some bullshit excuse about why no one's phones work It's like, oh, no, we're trapped in the woods. We're being chased by possibly the Blair witch Oh, I'll just ring the police then it doesn't you know I say someone growing up never gonna know the experience of being lost and I didn't know what what it means You can listen to any music you want from all of history You can video call people on the other side of the world at a moment's notice Louis the 14th couldn't do any of this This stuff's amazing. There's superpowers We shouldn't have to give them up or trade them away I'm saying just opt out of using stuff because you want to protect your personal data Means giving up the superpowers and you shouldn't have to if you leave your phone behind It's like missing limb syndrome Ignoring up to out you can't Regulate the problem away the European Union have done some work on this a body of which I am a part for the next week and a half So do you are doing quite a bit of work on this you've seen things like GDPR I mean, I know a lot of people think oh, that's just the annoying thing that pops up stuff on newspaper websites But there's a bunch of stuff in there about personal data protection. It's good things in India The Indian Supreme Court have declared that privacy is a fundamental human right government regulation is an important part of this It is needed, but it's too slow. It's too easy to stay ahead of it and Frankly big businesses who want to be able to collect all of your data and violate your privacy have got more lobbying dollars than we've got So it's not a way to do it. Um one of things that John Stuart mill said he wrote lots and lots and lots about free speech But one of the big things he said is forgotten Which is the laws passed by the government so about the 90th most important restriction on your freedom of speech Americans right Tend to make a correctly tend to make a very big deal about the First Amendment Non-Americans also tend to make quite a big deal about the First Amendment After they get dragged into prison till they find out they don't have one There are lots and lots and lots of cases of people say people going into courts in other countries are saying These are my first amendment rights because they learned it from television and the courts going We don't have a first amendment Doesn't work like that government regulation is a part of the answer here But it's a best apart and it can't be the lead part The other thing you can't do is shower everyone about it. Don't have a go at people Because they're not using your sick your choice of secure messenger. It just annoys your friends What we need to do is move the overton window. We need to shift the public discourse Giving people, you know a kicking because they're on the wrong messenger Doesn't help right and this is something so am I said that Mozilla knows that privacy has never been an effective selling point. I Will come back to this But she's making the point if you if you say to people you should be using these services because they do a really good job of protecting your privacy at the Moment you use those but no one else does so you can feel Morally superior but have no friends No, I deal Right now people don't know how to care about this and You can't get a new public who do care either The children of the revolution were faced with the age-old problem It wasn't you had the wrong kind of government which was obvious, but that you had the wrong kind of people This is from Terry Pratchett book and Pratchett's point was that this is wrong thinking you can't think like this You can't go you people don't care, but you ought to care and therefore it's your fault This is not the way forward More than 70% of people currently it's a survey done by the BBC More than 70% people would reveal their password in exchange for a bar of chocolate. I have a bar of chocolate I Don't mind I can get a second bar of chocolate, but you see my point Somebody always says admin And then normally I'll give you a piece of kale or something as a punishment, but I'm in California where it would be a prize so But here's the bad news for this technology not the facts The tech is not the hard bit. There's loads of tech already working on this problem There's signal and there's matrix and there's purism and there's privacy budget And you've got VPNs by the dozen and you've got password managers by the score and you've got Tor and the problem that we've got is a chilling effect people are Frightened of what might happen and they don't know now most people in this room relatively technical You've got some sense of if a company's got a big sequel database full of information about everybody You've got some sense of what they can do with that the realm of the possible and to some extent that's comforting But real people Don't know this they don't understand what can be done And they imagine all sorts of terrible strange things But because a lot of the data science predictions and things that AI and whatever could do look like magic and Therefore if you assume that it's magic you assume that they can find out all sorts of things I know plenty of people who if I'm I'm going through a website the purchase process and I stop halfway through and I haven't filled in my card number I'm perfectly happy in my head that I'm not going to get billed for this There's no way they can bill me for it because I haven't told them my card number My parents wouldn't feel anything like that secure the below. Well, I've filled it I've filled in half of it. What if it continues to place the order and Because I have in my head. I know I didn't give you my card number. Therefore, you haven't got my card number Therefore, you can't bill me for it Or people don't Real people normal people don't know this and they fear what they don't understand and That's the definition of a chilling effect Something where you haven't made rules about it, but people have discouraged any way And that's the problem Our freedoms aren't being taken away We're just afraid to use them if you feel like you're being watched You change your behavior. Ideally People really would dance like nobody's watching but hardly anyone does But everyone's still involved as I say everyone still uses all these services because they've got no choice But what if there were a choice and people knew it whoever gets this right whoever builds services which Reinforced people's sense of security rather than trading on it whoever does this Will define the next ten years of computing Mobile the mobile came around changed everything It changed the world put the power out in people's hands and it made billionaires and it made industries and everything old was new again And we got to look at everything through a whole new lens a whole new way of doing everything with the device in my pocket Social media came along Changed everything it changed the world They put the power out in individual people's hands it made billionaires and it made industries and it made everything old new again And we looked at everything through a whole new lens to major shifts in technology Which changed and colored everything that came after them now. They're not exciting whizzy new things They're just a normal part of the landscape. It's just expected and that's what can happen with Privacy protection as well go back in time and tell Morpheus on his cool Nokia banana phone The Less than ten years from that I've said 20 years ago, so 20 years from that But even less than ten years from then Everything will be mobile Absolutely everything and he won't believe you go back in time to about the same sort of time until people on six degrees Anyone remember six degrees other than me? Yeah, a couple of you world's first social network Go back and tell them on the first social network that Ten years from now everything will be social media in some way or another it will change everything it elect presidents and They go forward to an imagined world a world where your data is yours and everything still works and Tell them that there was a time when we felt like we had to give that up that was the price we had to pay and They'll laugh at you and they'll ask you where your penny-farthing is It's people want this fixed 82% of people are not are Not comfortable with the sale of their data to third parties in exchange for speed or convenience or product range That's from a report 82% of people half of all people have avoided doing some basic stuff online Because they have concerns about how their data will be used people want it fixed Here finally is an industry that actually needs disrupting And that's how you disrupt it There's a pervasive myth especially in the tech world that if you build a better mouse trap the world will be to pass through your door They won't stop believing this It's not the truth people will not build about a path to your door just because you've built a better mouse trap What you need to do is overcome the incumbents on a field that they cannot Compete on not that they won't although they don't but that they can't Apple spend 20 30 years competing with Microsoft on the desktop never managed to do it invent a whole new platform Mobile where they have first-mover advantage where they win Microsoft didn't have compete but Microsoft know this because they did the same thing It's a mainframes a computer on every desktop Rather than a terminal to the big iron in the basement They changed everything by inventing a new field if you are you know a huge mainframe company the idea of selling some company 20,000 PCs or the software to go on 20,000 PCs rather than one huge iron box You're just not geared up to do that and that's how you do it find a new place where the existing incumbents can't compete and For data collection. This is a privacy thing, right? Um, if you've built your business model if a company has built its business model entirely around invasively collecting everything about everybody and Then you make Not doing that be a thing that everyone wants. They cannot compete Facebook, but privacy focused can't exist. I don't care what Zuckerberg said in a letter, right? Because it's a rubbish Facebook, but with privacy can't exist and that's how you win And the advantage with this is that it's a weapon that only hurts bad people if If your business model is not reliant on data collection If you're doing it, but you're not reliant on it Then you'll just stop doing it and that's good if your business model is critically reliant on doing this and we switch The world so people don't like that Then your business dies, but honestly, I'm okay with that And as I say real people want this not just us not just techies Everybody finds this stuff unnerving people talk about it on the train and in the pub It's a standard topic of conversation Mainstream newspapers talk about this kind of thing Facebook do weird things with your data is a mainstream opinion 10 foil hats and now a fashion item the world's ready to be convinced. They are eager to be convinced So how do we do it? This work gets harder And this is a bit where I'm not entirely sure that I've got answers But I've got a couple of suggestions and what I'm trying to do here is Start a conversation There are people out there with clever ideas and me and they'll come up with them as long as we start talking about this a Good way to do is differential privacy. This is something some of you may have heard of Apple started talking about this a couple of years ago. It came from a paper by Dwork and the team and It's basically a way of getting aggregated information. So you ask people questions And you are able to derive from your big pile of data or the aggregate info you need But anyone individual person's answer Can't be identified. You can't even know whether they participated and Differential privacy is a good thing, but it's some pretty heavy maths So if you want to get into this read the paper start implementing a good stuff I have no intention of explaining it first of all because my explanation would be incomprehensible and secondly because I'm not Holy short, I understand it But what I am going to talk about is a simpler, but similar method It's something that those of you who are developers you can start implementing now it's called the randomized response method and Let's think about it this way, right? What data do we collect about people the sorts things that make them feel Uneasy what kind of data is it? There are two kinds of data collection overt and covert Covert data collection is stuff where you haven't asked the user It's about you can and do collected without asking them explicitly and it's things like the the screen size of sorts things that Google Analytics collects For a website what what screen size they are what kind of device they are machine derivable information and Some of this stuff is perfectly sensible and perfectly easy to connect to nobody much collected nobody minds and some of it Terribly invasive you hear all these stories about Android apps Which steal your address book and email them off to the company or the list of other apps you've got installed and so on I'm mostly not going to talk about this If you are collecting the standard stuff like what screen size someone has got carry on if you are Secretly using undocumented APIs to mine someone's address book pack it in and we're fine What I want to open over data closely stuff you ask the user about rather than stuff you deduce and it's about them Rather than about their device or about what they do. It's demographics the age and gender where they live whether they're married the stuff that Advertisers like the stuff that they want in order to segment their audience and This is the stuff that people are uneasy about having collected because it's about them The goal of it is to put your audience into different categories In order to be able to look and say okay, most of our users are between 25 and 34 So that's where we should direct our our user experience That's going to be or to think ah, we aren't doing particularly well in the over 60 demographics So let's start changing how we advertise or start changing how we do things or start doing more user testing with that group It's it's about segmentation. It's about segmenting people into buckets Yeah, and there's loads and loads and loads of literature about this and how to do it and why to do it the whole advertising industry is founded on it Given that the advertising industry is not going to change overnight They're still gonna want to be able to do this. You're still gonna want to be able to say Okay, most of our audience is in between 25 and 35 or they can say most of our audiences in the US or in Idaho or on this street and As I say, that's the stuff that people feel uneasy about because it's about them They feel like you're finding too much about them that you wouldn't necessarily otherwise know So let's take an example, right? age buckets you want to You want to know what age your audience is So you can know where to direct your advertising dollars what who to test with where you are Overserved where you are underserved so on and so forth. So you want to break up into those ranges You want to know what percentage of your audience in between 35 and 44 or what percentage is in between 44 and 60 and so on So the randomized response method was invented in the 60s by sociologists and The idea was they wanted to be able to ask people Difficult questions and still get sensible answers But there were questions where the people wouldn't want to necessarily answer things like questions about illegal activities Have you smoked weed? Which in the 60s? No one wanted to write on a survey. My name is Stuart language. This is my address. Yes I have smoked weed Just in case the old bill coming and pick up the bit of paper and then go you have confess you confess to a crime And then put you in prison But equally if you want to make sensible drug policy or do surveys on this You need to have some sense of how much of your audience have done this if you want to ask questions About sexual behavior again, some people don't necessarily want their answers recorded But you want to be able to make aggregate decisions That's the point of doing the survey So you want people to give the answers and you want to get accurate answers, but you don't want people to incriminate themselves How'd you do it? You have people tell lies is how you do it Basically, some of them just lie about their answers you say to someone okay Have you smoked weed and then unknown to you they flip a coin and if the coin comes up heads They tell you the truth and if the coin comes up tails they lie about it So if they have they say they haven't if they haven't they say they have and what actually happened in Aggregate is the lies all cancel one another out So if 40% of people have smoked marijuana and you go through this test your answer that you get from it will Be about 40% But any one Individual person's answer is not reliable The police come in and show me the bit of paper which says this is you confessing to a crime I can just say no I flip tails on the thing and I lied and they can't prove otherwise It's the same reason that Voting in elections the secret ballot is secret because you can't be intimidated Intervoting for someone else because Al Capone shows up and says you have to vote for me in the election You go into the box and vote for someone else and then you come back out again And say I voted for you and he can't prove otherwise This is the same thing Anyone individual person's answer is unreliable so it can't be used against them But you as the collector of all the answers are still getting approximately accurate answers So Don't get this to work a brief demo so At the top is the actual age distribution of our audience This is what we'd like to know we're the company and we'd like to know The details in that graph that most of our people are in between 25 and 34 a lot fewer under 18 a lot fewer over 60 We do not know this and we would like to know it So if we ask people what their ages Then we get exactly that graph no problem But then I know exactly how old all the people are and so now I know something about them They're like why should I tell you and people are concerned so Instead if I can find my mouse pointer Well, this is way harder than it looks Let's imagine We have 20% of people lie So we asked the question how old are you and then 80% of people tell the truth 20% of them lie about how old they are so the green bits of the bar this this bottom graph is what the data We actually get back from our survey the green people in there told the truth the red people told lies now We do not know whether someone told the truth or lied all we see is the overall shape of that graph And you can see from that the overall shape of the graph is basically the same as the truth But 20% of people lied So if we've collected their data, but we don't know for an individual person whether it's reliable or not and you can Increase the amount of people who lie so if we go up to 50% half of all the people in the survey fly out lied about their answer and The shape of the graph is still basically the same as you as you lie more you start to Decrease the reliability of your answers But we're still getting basically the same thing still most people in our audience at 25 to 34 It still drops off at the outer edges. We've still got the same shape if you go all the way up Every sync we asked people their ages every single one of them lied to us Still basically okay It's not perfect and you can see as you increase the percentage of lying the graph tends to flatten out Peaks become shallower troughs become taller until eventually you've got just random data and you can't use anything from it But you can see from that though if we Take that back down again complete completely accurate graph and then It flattens out as we move up So I'll come back to here again. I'm gonna find the correct slide now. I knew that was a mistake Da-da-da-da-da, excuse me Covert data correction Yes So the point about that is as you saw that you could essentially tune the lying percentage to find the best balance between protection of my users personal data and getting the results we need the data we need to make deductions and and Exactly where you draw that line depends on you as a business Depends on what you need from the data you can run tests to say okay We need to be able to make these conclusions. Therefore, we need the data to be at least this accurate and Therefore we're not we're gonna have 30% of people lie to us The point by this is you don't actually ask People to lie you collect their data accurately in your app or whatever and then you just have the computer change it I mean, this is technically trivial. It's literally one line of code Right, you could sit here and do this for whatever app or service your company's building You could do this now in this talk, right? just After you click the data, but before you bang it into Jason and send it up to the server Pick a random number with a 10% chance Change your answer slightly and That's it at that point all the data on the wire is Unreliable Any individual person's data? Can't be used to pin them down and anything and you're still getting the aggregate stats that you need You can still make deductions draw conclusions about your audience about the demographics and this works for all kinds of data that we collect you don't know whether someone lied and Proverly so you can actually demonstrate to people Okay, we've got all this data, but it's not reliable and we can prove that you don't store whether they lied You never told whether they lied what's interesting here is This sort of thing proves that you can do data science without being creepy about it This is known technology the methods exist as I say sociologists have been using the randomized response method for 50 years This is not Whizzy new stuff the differential privacy stuff is quite whizzy in you and there's quite a lot of complicated maths in there and That's even better than this, but this is understandable simple. You can implement it tomorrow And what's important about this is as a company. This is something you can trade on you can leave the charge This is competitive advantage for you because your competitors can't or won't or don't do this and I'm not for a moment suggesting that this should be the be all and end all of what you do There are quite a few as I mentioned earlier quite a few Privacy focused services out there, but if you're putting a privacy-focused messenger The only people you get to talk to on it are other privacy headcases like yourself like me Can't talk to your actual friends with it But just adding a bit of this stuff being able to say yes our users We we protect your data your competitors aren't doing that They can't do it or they won't do it you can talk about how you protect your users and they can't do the same That's competitive advantage and honestly Maybe they'll start going on well we can compete on that and we'll do it too at which point great Everyone does it and we made the world better. I'm cool with that too It's either competitive advantage for you or it's not competitive on because everyone's done it And so the world's cooler two thumbs up for that plan perfectly happy with that What we need to do is we need to come up with ways of Explaining this we need to help people understand that there are Ways to do this stuff you can never be lost again and listen to any music you want and video chat to people on the other Side of the world and you don't have to feel Uncomfortable about it because it's not okay that you're made to feel uncomfortable about it It's not okay that people feel weirded out by this stuff. It is possible for there to be alternatives Someone rooting around in your life is not a price that you have to pay And what we've been presented with is a kind of false dilemma as as users as people in the world We're told you can either opt out and cut off all of your internet superpowers Or you can give up all of your personal data to pay for them and those are the two choices You have to pick one of them as a company You tend to get the same false dilemma either you Inhale everyone's data with a vacuum cleaner in order that you can make assumptions about your audience or You don't do any of it your privacy focused and then your competitors who have way more information about their audience than you do Can make better business decisions than you can and again, it's a false dilemma. There is a way somewhere in between And it's not even about the technology of how we do it I've put together an explanation of how we do it technically, but that's not the important thing The important thing is making people aware that the concept even exists But there is something other than the binary this one or this one choice And that's what we need to help people to understand that this middle way exists And that they should ask for it that they should demand it that they should expect it and These new ideas these alternatives, they're gonna come from us Right people in this room rooms like it. Who's building the next big company? We are so When you build it Talk about how we change the story People are frightened and they shouldn't have to be When you're hacking on staff or making companies or chatting about things talk about how we change this how we Help people to understand lead the charge on this And then one day The world Changes everyone starts looking for this stuff as a matter of course Everyone starts assuming that it would that it will exist a company that says, oh, no, of course We collect all this stuff about you. They become the weird ones the sea source tipped over and Then everything's loads better. Thank you very much. I have a bunch of time for questions if people have any So so the question was do I feel like the sea source started to tip already two different things? Yes in terms of awareness This was talking about where people people are increasingly aware that this stuff is a problem Like I say, I mean discussion that Facebook do weird stuff with your data is a mainstream opinion My parents ring me up and ask me questions about this. That's a known thing. What's What's not changing yet is awareness that there's a there's an alternative at the moment It's almost worse before you lived in a state of divine ignorance about it now Everyone knows about it, but I don't know what can be done about it. So you feel helpless, which is worse Because people are not presented with an option if you if if someone non-technical comes to you and says Hey, I really don't like it the Facebook still all my data and do weird stuff with it What can I do about it? I'll answer is almost always stop using Facebook and They say but I want to talk to my friends and then the answer is could it all of your friends for you signal not gonna happen what What we need is awareness that alternatives can and do exist have people start building those alternatives and Then start telling people here is a third way. It's not just opt out or opt in and that's all there is So yeah awareness Mainstream now every day that goes past. There's a another news report of a data breach Or something like that, you know, you get email saying your password has been compromised in this and this and this and this and this and this And this so people know about this, but honestly People I've spoken to about this and I'm you know, I'm not going around doing surveys of any things you just conversations people are bus stops, right and They are Concerned in a kind of a light way. Don't get me wrong. Not that many people are running around with their hair on fire about this stuff but This is kind of low-level background radiation of concern Which is why 50% of all people have avoided doing something simple online because they're concerned and I'd like to fix that right? We're trying to build this brilliant new technological world There's no point in body a brilliant new technological world that everyone's scared to use It used to be the people didn't use it because it was all too hard and Then we learned about the importance of design and that helped but Now people are not using this stuff because they understand how to but they're frightened of what it knows and That's what I'd like to change Yes question there So, um, so the question was at the moment people don't just collect data because they're evil Didn't they turn in their moustaches going hard on puppets now they're collecting it because it's worth money and That money subsidizes a bunch of the products that we use and that and that we have collectively decided as a society that We're not prepared to pay for and So how do we resolve that part of the point about talking about things like randomized response is that my answer is not Stop collecting data It's collect data in such a way that it's still useful for most of the stuff that you want to do But he's not personally compromising now There is a line to be drawn that and there's quite a gray area if When you say, you know, we're collecting data from your smart TV and the reason that data is valuable It's because it individually targets you specifically then I don't believe there is a way around that If the thing that is monetizable about you is very specifically you Then yeah, I would like us to stop relying on that. I Don't have a good answer. I don't believe it's possible for there to be a good answer About how we avoid that but I think most of the people doing data collection aren't doing it for that reason They're doing it for yes advertising purposes But it doesn't have to be as individually targeted and specific and detailed and as much as it currently is But at the moment if you're gonna go to the effort of finding out something about a person you might as well find out everything Because the alternative is not doing it at all And that's all there is and there is a whole bunch of middle ground which is basically unexplored No one's even trying to say how can we collect just the data that we need Because you might as well collect everything and then someone might pay you more for it so You may find I do this talk again in ten years, and I go, you know what check it out turns out there is no middle ground It's either, you know everything about someone or you don't know anything and that's it. I might be wrong here hand on heart But I would like people to explore the concept of there being something in the middle Let's say a lot of people Who are doing data collection at the moment are doing it for aggregate purposes to find out about demographics They're not doing it to sell you specifically to an advertiser and they don't need to do it advertisers in general if The idea of collecting your data and using it to target you but making it be a bit fuzzy is Not necessarily a problem. So maybe instead of knowing exactly Why do I on exactly where you live and exactly where you've walked? Maybe what we should do is make all those things a bit blurry Or you make it so everyone stays with sort of right and sort of not right and There are a bunch of ideas ways of going about this Which I haven't thought of which nobody's thought of but that's because no one's trying And I think if people do try it'll help so not a perfect answer to your question But I'm not sure there is a perfect answer question over there. You don't know that's good quick that day. That is a good question day Yeah So the question was when Netflix did release some of their anonymized data is doing the randomized response stuff gonna run anonymize it enough And the answer is you want to go talk to a real data scientist about that because I'm I'm honestly not sure I would have Thought it would but the idea behind randomized response is plausible Deniability Basically, you may still be fairly uncomfortable with someone seeing your record Even if you can say no, that's a lie Because it's still a little bit uncomfortable the differential privacy stuff is better at this Because you've got all the aggregate data, but not only could you not tell what someone's answers where you can't even tell if a specific person participated? It's brilliant. Um, but it's hardcore math So So that may be something that people would want to look into the reason I talked about randomized response It's because you can implement it in the car on the way home Does that answer your question cool? The other questions. Oh So say that again What Torrance because because it was Sunnydale High School in Buffy the Vampire Slayer I needed a I needed a picture of an American high school to illustrate the thing and I thought Buffy have one of those didn't it? So I googled for the pic. I was amazed to find out that Sunnydale High School was a real high school It's cool. Yeah, it's called Torrance. I think excellent. There you go. A little bit of local color Okay, anyone else. Oh, yes. Oh, yeah, sure. Um, I'm happy to yes Keep it on keep it on Twitter and I'll post them Somewhere. Yes. Thank you for the reminder Yes, so the question is currently marking marketing companies are used to and will demand They get all the data about your customers in order they can market to them effectively How do we move them to order which we go? You don't get that much You get this much because that's all we are able to collect all we are as a company prepared to collect How do we get them to do that? And the answer is you go? I'm paying the invoices This is the data you get we consider it more important to protect our users Or because they're as important to protect our users as we do to monetize them, you know I'm not suggesting we go protect them or monetize and the whole point of this is that you can do bus But you go to your marketing team or you go to your your ad agency and you say This is the information you get and if you are slightly less successful In marketing because of that then that's one of the things you figure into your bottom line You say how much is it worth to us to protect our users? And Hopefully the answer will be the amount we've paid to lose on marketing because of it It's possible that your answer will be how much we've paid to protect our users not At which point I get that as your business decision, but you are part of the problem and you are making people nervous So that's the decision you need to make but I think I also think that I Suspect at least that if this becomes the norm if more and more companies start saying this is all the data you get People will develop excellent new marketing techniques to take advantage of that but I don't know that but I'm not very good at marketing. There's a reason people don't ask me to do it question there Yes, so the so the idea here is Not only are people nervous about sharing their data But stuff that they should share that would help things which is genuinely not problematic to share People don't want to anyway because they sell it on the whole concept of giving data up And I completely agree. It's a massive Negative externality to the whole thing increasingly, you know The idea that the apps in on your phone are stealing things about you makes you nervous about all data collection and all of technology Which means the people who genuinely want to do good things and are not evil mustosh twirling villains Get bitten by this as well because you say hey, it'd be really important to us if you gave us this This is why this is why it's not a problem and they just instinctively don't trust you same sort of thing as people just you know Every time a politician does something that you don't like you trust all politicians a bit less and it works same over technology where um The whole industry gets tarred with the same brush and it's a problem for people who really are doing the right thing Yes, yes Yeah, so the question is it boils down to who do you trust well almost? Yes Yes, if if you if I tell you something about me and you have a financial incentive to do evil things with that data Do I trust you to not do it anyway? I mean Capitalism check it out, but No exact fris precisely yes And and that's exactly the point You know one of the most illustrative things I ever saw is that I believe the US Department of Defense the way they define trust is Someone you trust is someone who can break your security. That's what trust means Sure, and of all the nice feelings about them what you think trusting someone means they can screw you So you're hoping they're not going to and all you can do is hope because they are in a position to do so and You kind of think huh, yeah Concerning question at the back. I think are you okay? Yes, um, so doing things like third-party audits on the data. There's a whole bunch of data chleddliness Standards which ought to exist and don't the idea that once you've collected this stuff Maybe you should kill it all after a year or after six months or after ten years or something I mean, we think this is a problem now imagine what a problem it's gonna be in 20 years especially for People who are currently children who've grown up entirely within this world a reasonable proportion of my life as a 21 year old Now reasonable portion of my 40-odd years Happened before a lot of this happened So I don't have Records about the first half or so of my life That's not the case for my daughter for example, but This there's a reasonable amount of discussion going on about that. It's all part of the same story certainly and I again, I Think if people start valuing or not valuing people privacy people start Seeing services arise which are responding to their already well-founded nervousness Then that's what thing happens almost automatically once people start thinking hey once companies start thinking this is competitive advantage to do this They'll find all the ways to find that competitive advantage and that's great. The point is not I don't want to say This is the way to do it. I want to say this is what you should be doing You work out do it. You're better than me. That's why you're running a billion-dollar company and I'm not Question there and then yeah, we get close to the end so Right, so the question is it's possible currently to monetize your users by collecting data about them You are under a fiduciary duty as an American corporation to screw everyone for every penny You can or you can be sued what can they do about it? My answer that is smash the state, right? That yes, if you are given a massively perverse incentive by the legal and financial system to screw people you will screw people I don't believe there is a sensible way of fixing that without just Finding whoever wrote that wrong hitting them with a mallet until they cross it out Jeremy What's the name of the corporation that be cops? Thank you Yes, we talked about this on bad voltage a bit and I said there ought to be a thing where that's not a rule And you're allowed to write into your company documents big corporation We're gonna do nice things and Jeremy went. Yeah, it exists. It's called a b-core But so now my official policy on this is every corporation in America ought to be a big Honestly if the law says you have to screw people and you want people not to be screwed Then the answer is to change the law. I Don't know a better way of fixing that it would be Yes, the the EU is more sensible about this sort of thing. Well, every honestly, I'm pretty sure I'm standing in America Well saying this everywhere is more sensible than America about this particular point And I'd like to see that fixed, you know, call your congressman call your congress person I don't know a better way of fixing that Some people might get sued in the interim you would hope That if you went to court and said, okay, I'm being sued by the shareholders because I'm not paid to hammer My users into the ground that you might get a sympathetic judge. I I honestly don't know alright flip a coin and then tell a lie it's Hopefully that wouldn't happen, but Yeah, at that point if you're built into a structure where you are told by law That you have to hammer your users then you're gonna hammer your users the way to fix that is to change the structure Okay Then I think we're done I've been to a language. Thank you very much. I get to eat my chocolate bar now Check check check microphone to check check microphone to microphone to check check Checking microphone to ballroom G Okay checking the lab Check check check microphone to microphone to check check check microphone to microphone to check check ballroom G microphone to Check check check microphone one microphone one check check check check check check. This is the handheld check check check Check check microphone one Test test there we go Hello I'll give one more minute before I start I'd like to thank everyone for coming Even just late, happy you can show up and attend my talk. And let's start. Okay. Thanks for coming. This talk is building an application market for the Linux platform. I am Sri Ram Ram Krishna. A little bit who I am. I've been a contributor to Free Software for over 20 years. Primarily working with the GNOME desktop. Spent a lot of time in user space. And this is a particular passion of mine. I really feel excited about building something like an application market. I've been thinking about all those for the last five years. And now that we have the technologies to make this happen, I've been evangelizing that, yes, there can be an application market. So, also, for work, I am the director of marketing for Purism. And if you want to find out, talk to me later. All right. So let's first just make some definitions. What is a market? I think it's probably good to know. So for me, a market is two things. One is an ecosystem that delivers value to its customers and users. The second one is it's an ecosystem that creates financial incentives to its creators and investors. In my opinion, in order to have a vibrant market, we need to have these, too. Now, we've done pretty good with the first one. I think there's no doubt that we derive value from the applications we have today. But creating the financial incentives for people to want to create more apps, meaning that we got to a stage, but if we want to grow, we're going to need to have financial incentives for creators and investors. So what's the situation now? Today, OSP opens up operating system vendors, supply applications through packaging, like RPM or Debian or various other means. You can get applications through GitHub. That's another method. And software vendors distributing tar balls from their website. If anybody's done that, it's probably one of the worst user experiences is to get your software through a tar ball. But it's good to understand how do we get here? So long ago, in the 90s, before there was distros, there was just a raw code. Everybody filled codes and it was up to you to figure it out and put together an operating system. So then a middleman showed up a distro. A distro is defined as an entity that will package everything together and make it a singular operating system for you. They developed a packaging software and maintained the operating system, all of that. So naturally, as they started working on the operating system, well, it's an ecosystem and well, ops is this naturally the next thing, right? Because they control the tool chain. So if they control the tool chain, then of course it's easier to bring these other ones because it's harder for others to figure out what tools change you, what GCC versions they use, what libraries they use, all these things are there. But the packageers in these distros understand what was in that ecosystem and were able to naturally package applications. So the distro ecosystem, once we understood how to build operating systems and how easy it goes, the number of operating systems started to grow. So I went and looked at distro watch a couple of days ago and there was 303 distributions and about 100 of them are active. That's a lot of... And if you look at them there, you have absurd kinds of distributors like Hannah Montana Linux or various other kind of things. Others are a single-purpose one-shot, I mean, whatever it is, right? It's the nature of free software that people are going to, when the source is out there, they're going to come up with whatever idea it is and build an operating system on it. But what's wrong with this picture? So let's start from a developer perspective. So the apps that are being distributed to you are not necessarily the same application that comes out of the developer, right? How the developer compiles their code, what tool changes is different than what's in the distribution. So that also means that these applications exhibit different bugs, different behaviors. Sometimes you can't even tell... Like a developer gets a bug report. They only see this behavior on this distro, but not on that distro. That's because of tool changes. Maybe GCC uses a different set of parameters, all these kind of things. But it's still the same app. And it gets even more complex because each one of these distributions have their own bug database and not all of them get upstream. So that's sort of a big mess. And then there's the idea that packages believe that they are the last line of defense for the users. A lot of times, if you talk to them, they believe that developers are not to be trusted. Which I don't agree with that. But they believe that they understand the user better and they know how to maintain the app. So that's why there's some good things from that because they can keep the thing secure. So if there is a security problem with an application, they can update it, all those kind of things. So there's definitely a strong culture of that middleman mentality. So now let's take a look from a business perspective. Each transaction that involves an install. So if you're on a shell or on a store, or not a store but something like Goom Software, and you do app get install or yum install. Each one of those is a lost opportunity, a lost engagement for your project because you're completely bypassed. So that's one issue. Secondly, 303 distros, 100 active. And each one of them, you have a set of users. So how do you reach every one of them for that? You end up having to pick one or two major distros, but you didn't capture all of them, right? So you're not going to be able to... So you can't present the whole GNU Linux thing as one market because it's all fragmented. All fragment over 303 distros and not exactly the best thing. So if you are a new entry business and you're looking to put time and energy into an application, you start doing a lot of head scratching. How do I get all this? Oh, so you want me to build an RPM and a devian or a dev? So it's not streamlined at all because you can't please everybody. You can't target a one market because they're all split across. And then there's the jail-free, which is just basically, here's a tarball or just go from your distro. And that's just terrible. Giving a tarball is just a band-aid. So key takeaways is every time you install software or distros, you lose an opportunity to have a conversation. The app you actually get is not the same app as a developer. And if you're a serious developer and from a business perspective, and it's about your branding, if that app is behaving differently, because somebody else modified the code, then that's also a problem. Because is it really the same app? Is it really from you? If it's modified by a package? I don't believe so. So it's time to change the paradigm. And so what I called before was a distro-centric model. And I would like to move to an app-centric model. And I've put a lot of work in the last couple of years precisely to do this thing and to bring people together to understand that what we had before was good and it was the right thing to do. But to have success from here on end, we need to do more. We need to change things because every model has a limit, a term limit expiration date that you need to find something else in order to fund growth. And I think a lot of this model is going to be fueled by ubiquitous apps. And today, what is a new ubiquitous app? And that's essentially an app that will run anywhere. Regardless of the distro, the app acts exactly as how the developer intended to. So today, there are two major technologies available. And we have some of them other developers here, actually. One is Snapcraft by Canonical, which was originally designed for IoT, but of course now they've expanded their scope to desktops and various other things. And then Flatpak, which I will boldly say is by my own developers, who have, this came out of the same frustration that applications don't react the way they should. And that we can provide a much better experience. So what's the difference between the two? The first one is Flatpak runs as a container that duplicates the developer environment. That's not quite true. It duplicates developer environment. Essentially, there is this idea of runtime. And there are runtimes that are specific to a particular desktop. So when there's a GNOME runtime and then there is a KDE runtime and so forth. And the apps run within those things because they have all the tool chains, they have all the libraries, everything to actually create a desktop app. The other thing is it runs in user space versus something like Docker, which runs as root. And if you want to consider container technologies. Snapcraft, I know it also runs as root. And it uses standard Linux security utilities like AppArmor, C Groups, that kind of thing. So there are completely different technologies, but they achieve the same thing. The other great thing about the AppCentric paradigm is the developer controls the distribution of their apps. So you can, the flow I envision is you pull up an ID, you build an app, you target a runtime. And then I'm going to use Flatpak primarily because that's the technology I'm used to versus snaps. But the developer, you print there and you can put it on your website. You can put it on an app store and it's just going to work on just a simple download and run. Now the great thing about that is once you do that, you can now present the entire platform as a single market because everything just works without having to puzzle out what distro to support, what packaging technology to support, it doesn't matter. And I think that's a really powerful thing. Because when you do that, the key thing here is we can measure the market. Whether that's through the app stores, whether that's through your web server. You have data so that you can decide how much of an investment this kind of thing it is. So if you have analytics, you can truly decide, well, how powerful is this market? So I took some number of downloads from, I took some metrics from the platform website. And I kind of looked at, well, how powerful is GIMP? And this is over the month of December, how many downloads? So I didn't have any cumulative things, but you can see that how popular on a day-to-day basis how many downloads you get a day. Now this is something you could never get before on a distro. Because I guess I said, every time you do an app get or a yum or something, that's a lost conversation. You don't know anything about how good your app is doing with people, right? How popular is it? And I think that's one of the more powerful things about being able to present as a single market. So once you can, once that you know you can measure demand, how many downloads, then you can justify more investment. Because once you know that there is an ever-increasing amount of demand and you know it, well, it seems fairly obvious that you want to meet that demand just like any market. So now the next step is, okay, how do we build a relationship with the people who are doing your investment? To downloads, right? One thing I would like to see is, at least on the FlatHub website, we don't see the number of downloads. I had to pull this out manually, but on the page it would be like, okay, we have this many downloads. And then people could see how popular these applications are versus others. Especially if there's choices between many applications of the same genre. I think, like I said, before we had talked about what is the mark of a good market. And why, of course, is meeting the needs. But the other one is financial compensation. If you can't have financial compensation, then your market is not going to work really. It's not going to fail. And that's the one thing that we're missing. So any officers I've looked at, only one of them really provides a way for compensation. And I think this is one of the missing parts is being able to compensate developers. Today, the cultural thing is users expect applications to be free. And being a free software developer is an enormous amount of work. A lot of time and effort is spent maintaining source code, going through bugs, dealing with feature requests. There are so many things that go in there, maintaining the community. All of those is done without any compensation. So having the ability to be able to ask compensation, and it doesn't have to be a huge amount, is something that's missing. Because once you are able to get compensation, then others will join in and you can grow this market. So there's a couple, as I said, there was a couple stores are out there. One is the elementary app store. And the process for that is you develop on GitHub, you submit it to the store, and then they have this idea of pay what you want. So the user decides how much they want to pay. And because elementary is the broker, they get a small percentage of that and the rest goes to the developer. The other thing that elementary does is that the apps are curated. So that means they can assure you that there's no spyware, there's nothing in it that you would be alarmed rather than the chaos of a full-blown market. If you're in an Android store, you don't know what's safe, what's not. So this is the idea of a curated store. So the next one is FlatHub, which we have settled developers here, if you have any questions. It's based on FlatPak. It does have a button to donate. It has a help button to, you know, where to go to ask for help, if there's any kind of problem with it, and has some amount of metrics. Then finally, we have the Snap Store. On that page, you get a link to those websites, a contact. And I really love the metrics on here because if you look at it, it shows on the map, it shows where all people have downloaded this application. So if you've got a picture of the map and you can see how many downloads from each part of the world, I found that to be really, really interesting. If there was ever any blog post that talked about the application, it would show up here as well. So I'm going to get to the end of my talk. My idea here is that we're not really out there, we're not quite there with conversation. And that's one of the things I feel like, at least from the FlatPak hub, is the ability to compensate developers. I think that is the one key thing that we're missing. But I think in general, we're heading in the right direction. One of the things I'm doing personally is I'm trying to develop a conference around the idea of an application ecosystem, which means bringing the stakeholders together, working together with distros, because there's still a lot of questions on where's the line between applications and the operating system. What should be part of the operating system? What should be an application that you can just download? So there's a lot of technical questions on deciding what that line is. So, of course, the ability to have metrics. Neither one of the two app stores I had talked about metrics, except for elementary. So I think that's something we need to work on as well. So that's pretty much my talk. If you want to follow me on Twitter or MasterDawn and my email, if you have any questions. So thank you very much. The competition is good, but also you can figure out what works and what doesn't. So just having one is good. Competition is good. Yes, Matt. Right, right. I think I saw a link to that. Or I saw one, but it wasn't quite giving me what I want. But when it's on a page, I mean, both from a developer or a project, they want to see what it is right next to it. And they can gauge how enthusiastic people are about that application, things like that. I don't know if you want comments, because that could always go wrong. But certainly a place to link to issues if there's things like that. But yeah, yeah. Right, that's a good question. Right, right. I have not looked at the app data spec in a while, so I don't remember if there was a profile set. But I think that's something that's definitely worth looking at to see if there's a way to tell, like, yes, this works on mobile. Right, now that we have a platform like that, absolutely. I think that's something we should definitely look at. Yeah, that's a great point. Being able to filter those things as well, especially with the idea of convergence that an app could be designed for both mobile platforms and desktop platforms. Any other questions? Where does Eftroid fit? I'm not sure where it fits at the moment, at least from a mobile perspective. I don't know enough about Eftroid to say. I don't know. Frankly, I can't answer because I've never even been exposed to Eftroid to even think. But if it's Linux apps, then it's sort of a solution already in itself. It's its own ecosystem where I'm looking at desktop applications as a whole. Right, because the problem is desktop applications as opposed to mobile applications, which naturally find their way into an app store, because that's the default experience. So that would be kind of my first shot of an answer. You're saying like if the components of an application, like a library, has a security problem, and how does it get fixed? I think, Matthias, you can probably answer better than I can, but essentially it's the runtime that gets fixed. So you have a runtime and you have extensions. So for an app developer, I think they maintain the extensions part, correct? Right, it falls on the application developer to fix them, but not on the runtime itself. Yeah, that's community supported. Right, so it's just like a distro, actually. Like the runtime, you could kind of treat like a distro, and in that space, the security and everything else is taken care of for you. But if you're adding more prerequisites for your app, those then, that's false upon you to fix, essentially. So where that fits is if you decide to add more things to that runtime, then it's on you to maintain the security on it. I can't remember, I think in 3.3.2 that gets automatic. So once you decide you want an app, the runtime gets updated for you, and so does the app. So it's almost like a subscription, you just, you're subscribing to the app and it gets its updates automatically. So, okay, any other questions? Great, thank you. Are you guys actually here for this? Oh, cool. Wow, there's three. I thought there was going to be two. So this is great. Thanks. Okay. All right, I'm going to start like 30 seconds early because the sooner I finish, the sooner we can go home. Right? So I'm not going to drag this out. Anyways. Yeah, thanks for coming. I honestly thought there would just be two people. So there are many more than that. Thank you. So my name is Renee Lung, and today I'm going to be talking about decomposing the Page of Duty monolith in production. And not just that, but more on also how doing that let us rethink our data models. And we did all of this without any downtime, like more than, like without any more downtime than like what happened kind of like normally. So as I mentioned, my name is Renee. I'm a software engineer at Page of Duty. I've been there for about three years. Has everyone here heard of or used Page of Duty? Yeah? Sweet. Okay, there's just one person that didn't put up their hands. So quickly I'm going to go over PD. We integrate with monitoring tools like Datadog and Sumo Logic. And these create incidents. We also, so Page of Duty then contacts someone who's scheduled to be on call and through communication channels that each user specifies. And if that person doesn't answer, then we just, we page the next person and so on. And so forth. And so if it's the middle of the night, you get notified when your stuff breaks. And maybe you're tired, but you know, things are good. So yeah, let's just jump right into it. So where did we start? Well, Page of Duty started in some nasty ass basement in Kitchener Waterloo, which is like the computer science mecca of Canada. And literally last month, Page of Duty celebrated its 10th birthday. So not surprisingly, we still have some pretty old code kicking around. We've always had, we've always been largely a Rails application. But over time, we, you know, developed additional services, but it wasn't really what you'd call a microservice architecture per se. We added a few more distributed data stores. We split off some services to handle our events API specifically and a couple of services to handle talking to telecom providers to send you those really irritating text messages. But it was still, we were, everything was still overwhelmingly centralized within the Rails monolith. And it was doing a ton of stuff from handling all the front-end requests from the web app to handling REST API requests, actually composing the texts of all those, composing those emails and those text messages and like authentication and God knows what else. So if it's not obvious, why would we want to break down our monolith? Well, as you would expect over time, this started to have more and more obvious impacts on our performance. So for example, looking up all those incidents that we created on your account could take a really long time depending on the configuration the extra data that got sent in with those incidents. So maybe you're asking for all of the resolved incidents that have ever been assigned to me. Well, if you're on a large account or you've been a long-time user, that's going to be a lot. Or all of the triggered incidents assigned to me and I'm on team X, Y and Z. So as you're adding those parameters, that also makes things complicated. So asking our main database to do that sort of, that kind of sorting was putting a lot of memory pressure on it and my SQL was very not happy. And at the same time, our user base is growing, the number of customers are growing, like accounts, and the speed at which everything was growing was also growing. So basically it looked like a runaway trainwreck, which is a really weird metaphor. So yeah, and also sometimes because of locking and also all of these complex queries, there would be some really, really awful wait times for people on the client side. So as I mentioned, slow and killed queries would also lead to request timeouts and these timeout errors kind of like propagated through out our like few systems outside of the Rails monolith. And so one, it's kind of like difficult to debug because like you're waiting for something, some service is waiting for a response and never gets it. Sometimes it logs, sometimes it doesn't. And some customers felt this pain more than others, specifically accounts with large number of users and also teams, which I will get into later. So we also experience, you know, like you're probably saying, oh why don't you just add another index or whatever, but like after a while, you see diminishing returns from doing that because every time you write to a table, you got to write to the index. So yeah. From a more like a developer point of view, maintaining your code was super painful. So some parts were changing more quickly than others and often different parts of the code would be a lot older than others. And it got to the point where you didn't want to touch something, like one particular like module because all of the people that understood the magic incantations that kind of made it work had already exercised their stock options and went to a clothing subscription service or something. Also, we would see our packages and dependencies go really out of date because it's either hard to know what you can get rid of or updates are risky because, you know, everything's so old. Maybe if I update like this one tiny package, everything will break and the world will end. So yeah, adding features became really painful because, you know, we were a growing organization and the number of engineers is growing. The number of engineering teams are growing. So all of us kind of working on this one giant Rails repo was kind of like having like six different contractors in one house and even though they're all working on different things, they're all getting each other's way. So there we are. We're in kind of a crappy place. So what should we do about it? And as you can imagine, breaking down a monolith takes time and discipline. So what my team did, we distributed a so-called manifesto that made it clear to other teams what we were trying to do and why we were doing it. So breaking down this monolith because it simply wasn't going to be sustainable to keep throwing stuff on it and like, you know, duct taping pieces and new features to it because it was literally just crumbling under its own weight. And on kind of like the process side, we also had to strike a balance between product needs and engineering needs. So, you know, you have your product managers that are still like, we need features because we need to sell stuff. And then you have the engineering side where people are like, but we need to do all these other things because our databases are on fire. So, yeah, I mean it was not an easy journey and there were many late nights, but it was pretty fun. So I'm going to go over three of our projects starting with the Incidence Dashboard. So the Incidence Dashboard is where most users see a list of their incidents. This is my test account. And among a bunch of other things, users can filter and sort by urgency. They can filter and sort by priority. They can filter and sort by status. And so like, as you can imagine, we start combining all these things like triggered with a high priority, but a low urgency that's assigned to so and so. That's when you are starting to like build on those joins for the database. So it was having a lot of trouble, our main MySQL database was having a lot of trouble with these queries. So we were thinking like, why don't we just do this work somewhere else with the tool that does this kind of thing really easily? Like how could we ever, how could we easily search through a lot of like denormalized data? And slides aren't keeping up, there we go. And because of the different kinds of searches we would be performing on these incidents and also because there had been other teams, you know, that had experience rolling in Elasticsearch cluster, we decided to go with ES. So basically the request would go, so let's say we had to get requests for that incidents dashboard, the requests would go, would, you know, come into our Rails application and it would get duplicated and sent to the Elasticsearch cluster. So with those search params and also to the MySQL database. Both the databases would do their filtering sorting stuff, the Elasticsearch one faster than MySQL. And Elasticsearch would return a set of IDs and MySQL would return the actual records. And this is where we would compare and log what either service was sending back so that we could measure our progress and make sure that things that we expected to happen were actually happening. And then because we were still testing things out, we'd use the MySQL results to populate the dashboard. So there were no performance improvements yet and this is totally invisible to users because we were taking a very incremental approach. Like this is basically our bread and butter, we can't have it breaking or lying or just like disappearing. So after we were satisfied that Elasticsearch was consistently returning the right stuff, we would send those search parameters just to Elasticsearch. It would filter and sort and return that set of IDs. And then Rails would use that set of IDs to fetch the records from MySQL, which is much less expensive than joining six different tables and then filtering and sorting. And you got your 200. Okay, I had a lot of fun with the keynote anime.