 Good morning, everyone. So I'm very happy to be here. I'm really happy to see another Python conference growing up in the UK. We've got a 10-year history with Python events growing up. How many of you have been to PyCon UKs over the year? So that's more than half the room. OK, that's great. So I'm one of the founders of PyData London. I run that in London. We've had five annual conferences. And I love to attend things like EuroPython and other PyData events and EuroSciPy as they move around Europe. So this morning, I'll be talking on citizen science. So I'm a data scientist. I self-identify as a data scientist. I've been doing that for a long time. Can I just check with the audio? It sounds like there's an echo to me, but it sound OK to you? Sounds OK? And Andrew, you're happy? OK, no, everyone's happy, apart from me. Right, fine. There's a tiny echo. I'll just ignore that. Brilliant. And can everyone hear me still? No, there's an echo. Oh, god. Some weird loopy thing going on. That's fine. OK, so I'm a data scientist. I've been doing what is known as data science from back before data science existed. So thinking of timing, I'll just set my timer going. So yes, I've been doing data science for a very long time, about 15 years. The term data science that we know now has been around for sort of five, six years. So I used to call myself an AI engineer in industry, working on academic problems. And it was a long convoluted title that nobody really understood. But now we have data science that tells us I'm the kind of person who works on scientific applications of ideas in a kind of an engineering world. So melding the two sides together. So I run my own consultancy, as Alex mentioned. I coach teams. So I help teams as an interim senior. I do training on site and off site. So I really like teaching data science. I've been teaching at EuroPython's and EuroSciPy's over the years, and PyCon US over in the States. And I create intellectual property working with companies. And the kind of companies I work with, they're larger now. So Mitsubishi Finance Bank, Channel 4, QB Insurance is my current employer. I've worked with lots of small startups over the years. So I've been doing this, as I mentioned, 15 years. So I've got a pretty good view on things that do work and things that don't work. If you're interested in finding out if a problem that you might have might hit work or not, feel free to come and talk to me. I'm really opinionated about this over the years. I've got so much visibility on things where everyone says, that's going to work. It's going to be a great idea. And it's a rubbish idea. And the data doesn't support it. And it's a complete waste of time. So I'm very opinionated about this. If you want to de-risk something that you've got in mind, come and grab me afterwards. I'm happy to give you my thoughts. How many of you come along to the PyData London events? Some of you about a third? Brilliant. OK, the other two thirds of you. If you're interested in this talk, you'd be interested in the meetup that we run every month. And I would welcome you to come along. I'm one of the founders. We've been running for five years, as I mentioned. The conference has gone from 200 people in the first year to we ran a 500-person conference about seven weeks ago by Tower Bridge. And the monthly meetup is 200 people every month. We run it in a hedge fund, AHL, just down the river. Free beer, free pizza in the evening. We go to the pub afterwards. So there's a couple of great talks and then lots of conversation in the pub afterwards. So if you like talking about data science at all, please come along to the meetup. Over-subscribed, we have a lottery system. So you put your name on our list and then some subset get randomly chosen. But please do come along. So I consult and model Insight. And I wrote a book a few years ago on high-performance Python. And that was a slightly unconventional route to becoming known for a topic. I wrote the book so I could stop talking about it. But of course, once you've published a book, and I've been teaching before for years and writing lots of blog articles, I wrote the book, decided that was it. I'm going to wrap up that chapter of my life and get further into the data science world. And then of course, I became a bit more visible around the high-performance computing side. So I like talking about that. And if you're interested in that side, you're very welcome to come and talk to me. So I've got a couple of goals today. So I'm going to tell you a set of five short stories around citizen science. And I quite like this end of talking about scientific work. So rather than a huge application, the search for the Higgs boson, rather than huge applications, which get loads of media attention and loads of support, I like talking about smaller projects, the ones that are typically unfunded or lightly funded, run by one or several people with passion behind a project, trying to figure out how to typically improve the world. So I've got five stories here that cover those areas. And I've tried to introduce a set of lessons after each of them, so that if you're at all interested in this kind of thing, and I can inspire perhaps a couple of you to go and try and play with your own data, I'm going to give you a couple of lessons that will hopefully make that journey a bit easier for you all. At the end, I'm going to have a demo with JupyterLab, which I believe was spoken about yesterday. That demo depends upon some interaction from all of you. So at two points during the talk, I'm going to survey you with a question, an uninformed question, and then a better informed question. I need each of you, and that means all of you, to go onto the web. So check your badges for your Wi-Fi login, it's at the top. Go online. There's a Google Form. I'll put the link up. It's a Bitly link. It's a Google Form with one question. You type in one number. It's all you have to do, and you can share that device to people around you. You can just submit another answer, and then you get the form again you put in the answer. There's no log-ins required. You don't need a Gmail account or anything. It's just a Google Form. It works in a private connection. Super, I've made it as simple as possible. But what I want is sort of 60 or so submissions from the room, and then on the second round I'll have 60 submissions, and then there'll be a graphing demo at the end, and we'll look to see how well you did at the question I said. And then because I've got five speakers worth of material here, I've got links to all of their talks and their videos in the appendix, and I'll put the slides up online afterwards. So first of all, has anyone been in a city with that kind of air quality? Yeah, okay. Which city is? You've been to Scopgy? Okay, all right. Who else? What kind of, which city is? Beijing, okay. Say again? Moscow. So this is a talk by a chap called Gorjan. I met him just two weeks ago in Piedator Amsterdam when I was speaking out in Piedator Amsterdam. He's a local out there, but he comes from Macedonia. So he was talking about the smelly fog in Scopgy. And so this is an aerial photo of the city. This is a winter time photograph looking down onto the city. And this is something that I completely recognize because I lived for six months in Chile in Santiago in the summer, beautiful weather, lovely people, beautiful weather, great conditions. In the winter, you can't see anything because it goes like that. And I didn't realize that the geography was the same. It turns out in Scopgy and in Santiago, you have the same thing. You have a large mountain range. You have a city in the middle. You have lots of industrial activity inside the city and then an inverted weather system putting air pressure down onto the city so the pollution can't get away. So the pollution just builds up and builds up and then stops. And then the whole city sinks inside this sea of pollution until the next summer when the weather system opens up and the pollution goes away again. So this is the winter in Scopgy. So Gorjan tells the story that about four years ago, when he was 21, as he was a younger programmer, he was looking for projects to work on and he found that the government had some data on their websites and so he thought he'd have a go at graphing this data and the data was about air quality. And he knew all about the air quality. He knew what everyone knew and that is that every winter you get this, the smelly fog and then every summer it goes away again and you don't worry about it because whenever it rains, the streets run brown with mud and then the air is clear and then a few days later you can't see anymore but that's normal, right? That's everyday activity for years and years. That's just the way the city works. That's the way the whole world works and it turns out of course it isn't. This is just incredibly bad pollution and lack of knowledge about that. So Gorjan takes this data and he works to try to visualize it to improve his programming skills and he's confused by the fact that he seems to have these errors in the data. From what he can see, he's got four times more pollution than what he can read about Beijing and 20 times daily pollution limits compared to the EU's maximum limits. But as far as he's concerned, he's got an error. I mean the data must be wrong or the code must be wrong. His understanding must be wrong. Something must be wrong. He digs further into it and then realizes no, actually this is true. The daily air they're breathing in the city or the inhabitants of breathing breaches EU regulations consistently by a factor of 20. So not 2% but 20 times those limits. That's a crazy amount of pollution. That's proper lethal pollution. So he tells this story about how he took this data and then visualized it and then built an app around it. This is the air.app and then worked with others to popularize this discussion around air quality. And then this discussion, this blossomed. So he takes this Jason dump. He draws his graphs. He writes some kind of web app with a mobile front end. It turns out he's pretty good at publicity. So within a month, he had a million people looking at this data set and that's a huge explosion. I mean, it's, this is not a large city. So have to have a million people interacting with this data set and learning about it and talking from it. That's an amazing engagement. And it's, and there's an easy story to sell there, right? You can't see down the street because of the pollution. And it turns out we're breaching EU regs and we're all dying as a result. I don't know if you've ever read about the data. But it gets to the point, that picture there, that's parliament, that's people holding up pictures printed out from the app in parliament. And as Gorgian tells the story, here you've got a story that unites everyone because it's not about race or religion or sex or age or socio-political status. It's just about the air that every single person is breathing, young and old. Doesn't matter their background, their wealth. Doesn't matter, none of that matters. All that matters is you're all breathing the same air and slowly everyone is dying at a faster rate than everybody else. And this is awful. So this is a story that galvanizes the public. One of the stories he tells is that he's popularizing this data set. The media's picking up when they've got millions of people interacting, protests being organized. He's got a nice group at this point. So there's an ecology group building up around the use of this data and other data. And then he's challenged apparently by the minister for ecology. The minister talks about how this is a big conspiracy, there's external funding influences trying to disrupt the government. The data is a lie, this data is a lie. The situation is not that bad. And so Gordon publicly challenges back to the minister, look, minister for ecology, invite me in, we'll compare my data that I'm visualizing from the government website with the officially recorded data. If I'm wrong, I'll withdraw the app and just kill the app and quietly go away. But if you're wrong, you resign. And at which point apparently the challenge disappears and the minister for ecology says no more about this subject. And then, and I thought, this is kind of crazy, right? So the government's being challenged by the citizens on the subject that the government won't talk about based on public data. So why the heck did they publicize this data? Why did they release it? And it turns out Macedonia would like to be part of the EU. The EU provided air quality sensors. And one of the requirements was the data has to be published. This is one of those instances of the EU doing something really, really sensible. So the data has to be published. Data goes out there in public form. We are in a JSON undocumented format. No one knows what to do with it. So tick boxes are ticked, but no story comes out. And it takes a person like Gorgian to take this data, visualize it, and do something with it. So then the app gets updated very frequently. And it got to the point that air quality became a subject of the government debate for the upcoming election. And then the previous government lost the election. The new government said, we're going to make great change and then got in, in part, on a new air policy subject. Apparently nothing changed after that from a government point of view. And this is clearly going to be a long, slow process. There was one really nice thing that came out of it. Using handheld devices, they went around recording air quality around industrial facilities. They found a highly polluting incinerator, and they got it fixed. It turns out, if you go back to 2001, the Guardian is reporting that that incinerator was one of ours. We wanted to get rid of it. We sold it to the Macedonians. It turns out it was going to be a better technology than what was available there already. So it was a step in the right direction. But it ran 24-7 rather than for limited hours. It didn't have filters on it. It was probably awful. It breached all the EU limits, which is why we sold it on. And so they could prove that. One of their concerns was that by using official government data or these handheld devices, they wouldn't get more buy-in. So they've escalated this to working with the European Space Agency and the Copernicus Program, and they're looking at satellite data to find more sources of pollution. So that's a really, really nice story. They're driving governmental-led change and educating the public around the world they live in. And they did it, which starts with one person taking a data set that no one had seen and graphing it, just drawing some graphs. So if you want to take away one lesson from all this, but I've got this big lesson in my talk here, graph things. Draw graphs about some of your data and see what you can do with that. If you can tell a story, even better, that'll engage people. And even better, if you can recruit other people to your course, you can make a bigger project and then push that message further afield. But as a bare minimum, take some data that no one has looked at and graph it. It's really easy to do it. External public data or your own company data. OK, so here's another project. This is a personal one. I spent, or my wife and I spent some time trying to diagnose her sneezes. So it turns out, I mean, everyone in this room will have sneezed at times. Most of you don't sneeze up to 30 times a day consistently. So she has some kind of bad reaction to something. Well, at least we thought it was a bad reaction to something. But we didn't know what it was. And I thought, hey, with the power of machine learning and data acquisition, we can probably find some environmental factors or something related to. I mean, she had thoughts about things that made her sneeze. She uses antihistamines. She lives an otherwise perfectly normal life. She just sneezes a lot. Well, that's kind of something there to study. And the consistent use of antihistamines, some antihistamines have been linked with early onset Alzheimer's. And I thought, well, she's not on those ones. But maybe taking a drug for the entirety of your life is not a good thing. We don't know what the long-term outcome of these things are. If we can find a thing that influences the rate of sneezing, so whatever condition is happening in the body, maybe there's a nice story here for her to change her behavior and to help other people modify their behavior as well. And hey, great data subject. Right, my wife. She can't escape. I can study away. So that's one of the graphs that we gathered. That's a histogram showing the number of times that Emily would sneeze in a day. So the left-hand bar is the number of days when she sneezed zero times. And so it's a pretty big bar. Then the next bar is the number of days she would sneeze once and then twice. And then the right-hand bar is 30 times. And so you can see that the peak is several bars in. And that's around four sneezes per day. So the majority of the days of the year she's sneezing four times a day. It's not the hugest number, but that's every day. And then it goes up to 5, 6, 7, 10, 20, 30, with or without taking antihistamine. So the diagnosis is she's a senior engineer. She's another techie. And at the time, she was working on iOS development. So she wrote an iOS-based app to record this. Open-sourced so that other people could log their own symptoms as well. There were some other symptom loggers out there, but not many. And I thought, you know, we can contribute to the world if only we provide an open-sourced tool that lets people log their own symptoms and then analyze it, get an SQL dump out, and then put it into Pandas, say, or some other analysis tool, and do something with that data. So she wrote an app for dealing with any kind of allergy-based condition. So we had a nice icon set, an interface designed by a colleague, and so we had this really nice app where she could just tap very easily to say, I'm sneezed. I've sneezed. My eyes are running. I've interacted with the dog. And I've done all the things that might help connect me or record some of these bad conditions. In the background, it had GPS traces. Freakiest thing ever is analyzing this data and knowing that my wife has flown off to Canada on a work trip. And then I can look at the GPS data and see exactly where she goes on her daily walks and where she is in the office. And I think, well, Google's got this data. Apple has got this data. But it turns out I've got this data too. And it's really weird looking into someone's private life and then seeing their minute-by-minute trace. That was a little bit of an eye-opener. We've got GPS traces and all these button events recording what's going on. And you can infer things from this. So if you know where somebody is, you know what kind of weather they're being influenced by. If you know that they get to a bit of London and they disappear, they reappear elsewhere. And simultaneously, their oyster card says they use the northern line. You can infer they're on the northern line for 30 minutes. So I wrote processing tools to get the oyster data around, do things with that. So I did an awful lot of analysis on this and gave a couple of conference talks around it. You get some kind of nice personal data interactions coming out. So on the left-hand side, after taking an antihistamine, each of those lines is a trace of about 12 hours worth of traces counting the number of sneezes that Emily had. So the left-hand side, at the 0th hour, takes an antihistamine and she's recording her sneezes. By the middle, at the 6th hour, she's recording her sneezes. And you can see that their graphs kind of spike high at the beginning and they drop lower. And then they get higher again by the 12th hour. On the right-hand side, the blue line has summarized. It's summed up the number of sneezes on the 0th hour, 1st hour, 2nd hour, 3rd hour. And the 0th hour, close to 50 sneezes, same for the 1st hour. Then by the 2nd hour, it drops down significantly. And it's down for a number of hours. And then it increases by about 12 again. And it turns out the antihistamine takes two hours to take effect. And then it declines in pair over about eight hours after that. And you get an exponential drop-off. So this is not modeling the believed effect of the antihistamine. This is the actual recorded effect against a modeled antihistamine distribution in the bloodstream. So it's quite nice to marry this up to reality. So you can imagine a person modeling their own behavior to a certain drug and then seeing how they react. So they can anticipate how they might react in other situations. Antihistamine's easy, but you can imagine doing this for more complex interactions with other drugs. Out of an awful lot of work, we found a result. It was incredibly inspiring to find a result. But we got a result. And the result was if you tagged GPS, so when my wife was in the country, talented against weather sources. And this is what April's through to the next February. So the better part of the year. All the time she's in the UK, if I take all the weather sources and I look at humidity. When humidity is higher, so when it's damper, propensity to use an antihistamine goes down. So she uses less antihistamines because she's sneezing less when the air is wetter. And the right-hand chart is a scatter chart of a number of antihistamines in a day versus the humidity. And as the humidity increases, antihistamine usage comes down. And we get this negative relationship in the data. Turns out the mucus membrane in the nose, if it's dryer, becomes more irritable. And you've got a greater propensity to sneeze, at which point, better chance of taking the drug. And then, if it's damper, that relationship decreases. We took this data to King's College. My co-founder in PiData did his PhD at King's College, went to see one of the leading professors on the subject. He's a global leading professor. He sat down and showed him the data. He went, oh, well, there you go. It's kind of obvious, right? It's not obvious. What's going on? He goes, well, it's not an allergic reaction. This is a non-allergic reaction. Oh, OK. You mean everything that we've gathered just shows you that there is no allergic reaction. There's something else, but yeah. OK, brilliant. All right. So it's chronic and persistent, so this thing's never going away. And it's just a reaction to the air. It doesn't matter which country Emily's in, what kind of public transport she uses, what kind of food she eats. This thing just carries on all the time. It's just vaguely influenced by the dryness of the air. He suggested a new treatment, switching off of the standard antihistamine going to nasal crumb. We gathered data for a few months after that. And there was a change in behavior, but nothing interesting. So Emily switched back to using a regular antihistamine. We were hoping to see a dramatic effect if this particular drug was going to work. And then it kind of stops there, because it turns out having exhausted all these possibilities and done an awful lot of work, we couldn't really do much more. There was just nowhere further to take this particular bit of work, except to kind of shrug our shoulders and say, all right, well, we know what this thing is. We can't cure it. It's just going to be a lifelong condition. One of the challenges here, of course, is having an n of 1, so a population size of a person that you're studying. I did recruit other people who had summer-based allergies, but then you'd only get data for a couple of months. And it turns out other people were, because their condition wasn't so bad, they were less motivated to fill in the form every day. And so the data quality got spotter. So that was a pain. So somewhere in there, we kind of lost a bit of energy for this project. And there was a realization from me as well, that all this machine learning work that I was doing was really completely overkill. It helped to validate the outcome that we discovered, but really having really good graphs, that message again, draw your data, really good graphs taken to an expert so they can interpret it was more than enough to understand what was going on in this data set. We did open source the app. We open sourced some of the research or the talks are out there, so other people can recreate this work, or at least not make the same mistakes that we made. There's lots of discussion about all the intricacies of getting date time data out of an iOS device and all the weird ways that it records date times. And GPS, GPS on an Apple device often puts you in the sea just north of Africa, location zero, zero, if it's not quite sure where you are, like you're on the underground, for example, and then ways you can go about fixing that. So I mentioned earlier on that we're gonna have two little quizzes. So these quizzes require interaction from you, not from some of you, but from all of you, please. I'm gonna have this URL, that bit.ly slash keynote Ada one. I'm gonna have that on the next couple of slides, so you can just type it in. All you want is access to your Wi-Fi, and with access to your Wi-Fi, you go to bit.ly.com slash keynote Ada one, and it has a question for you. It says, please guess the weight of my dog, and then there's no other information, and that's deliberate at this stage. Absolutely no other information. I wanna see what data you put in. You're gonna guess the weight once, and if a colleague next to you doesn't have a device, just hit the submit another response link, pass it over and they can type it in, Danny. Sorry? You weren't meant to ask that so soon. I won't tell you. No, okay, all right, so that's, damn it, Danny, you engineer. Right, so the units that you'll be asked about in the second question is kilograms, so you can use kilograms here too. I've got a conclusion based on that. So, yeah. So, we've got two questions. On the second one, you will get more information, and then the intention is to see how the answers change when you're given more information, and to see if we can recreate a classic result from a couple of hundred years ago. So, does anyone have any problems going to that URL who's tried it? Is it working? Yeah, it's working, right, okay, so that URL is gonna be linked on the next couple of pages, bitly.com slash keynote, Ada, and the number one. Don't go to what might follow later on with the second question, we'll get there in a few slides time. And yes, we're gonna have a live demo of this later on, and I hope that my pre-processing code will work and take whatever data you've been typing in, process it, not blow up, and give us some nice graphs. Okay, so now I'm gonna switch on to updating outdated medical results. This was a lightning talk given by Anna, and I'm not gonna try to pronounce her surname, Piedata Warsaw late last year. So this is another Python success story analyzing data where the analysis is not very complicated, but the effects are very large. So, okay, I'm not a birthing expert, so I'm gonna talk lightly about this, and I hope I'm not gonna get any of this wrong, check this with her, so the story I'm gonna tell you was correct, but you can't ask me any questions. So, in 1955, Friedman published some results. Friedman measured the, so it's measuring a cervical dilation in preparation for giving birth. So at four centimeters, that's the beginning of labor, and then by 10 centimeters cervical dilation, the baby is ready to come out. So this is measuring birthing progress using current data and data from 1955. This data, this data set of 500 first time mothers from 55, this graph has been used to, basically it's the first thing taught to people involved in birth around the planet, so all the doctors involved in birthing will be using this graph, and then making medical decisions based upon it. If everything is constant over the 60 year gap in between, then this graph might not be wrong. If technology has changed in the meantime, if the age of mothers has, age of first time mothers and second time mothers has changed since the 50s, this graph might be wrong. If technology has changed, if the use of forceps for live delivery has gone out of fashion, for example, if we no longer sedate the majority of mothers during birth in which we don't do anymore, then this graph is wrong, and yet this graph is used to drive medical decisions. If the cervical dilation is not progressing at the right speed, so every hour it's, I think from memory it's every hour, one centimeter extra dilation, if the mother is progressing slower than that speed and then is not following this graph, then the doctor can say, right, there's a failure to progress, and the failure to progress means that the baby might not be born correctly, so we need to intervene, we can either use drugs or we can use surgeries like a cesarean operation, and these things have a big effect on the mother and potentially on the unborn child. Surgery is bad. The use of extra drugs when unnecessary is probably bad, particularly using data gathered 60 years ago under very different conditions to what exists now is probably bad. So she tells this story, and then inside their birthing unit, and this is, it's not a conventional hospital, this is a birthing unit where they don't have to follow the conventional advice, so they don't have to perform a C-section if the mother is behind or apparently behind in the progress of cervical dilation over the 10 or so hours, I think 14 hours might be the max before they intervene. All of the mothers involved in the study, it's, I think it was 400 overall between first time and second time plus mothers, all of them gave birth successfully without interventions, and so of those that gave birth without interventions, the red curve is the Friedman curve, all the dots below the red curve, by rights, shouldn't have been there, and intervention should have occurred according to the conventional advice, but they didn't intervene and all the babies were born successfully. And this is not the first study of its kind, this has been questioned for the last 15 years, there's plenty of research out there saying in different units around the planet, a lot of work in the States, that maybe this advice is wrong now there's been plenty of evidence gathered to say that you don't need to follow this advice. So what we can see on the left hand side, the one, the most left hand box plot there, that one shows that the majority of mothers when measured had a cervical dilation of between zero and three centimeters, and as that box moves up and the cervix is dilating at an increasing size, then by six hours you're having mothers up at a 10 centimeter dilation, which is 100% the point where the baby should be coming out, by about eight hours, almost all mothers have reached that point and have given birth. So critically, the majority of the mothers are giving birth ahead of what the Friedman curve suggests, so there's clearly wide variance in the advice given by this singular curve, and a number of mothers are giving birth at a later stage, but they're giving birth successfully. The Friedman curve was designed around first time mothers only, and under these different medical conditions. So what happens if you gather a richer data set? Well, you can do something like produce a flowchart. This is a scikit-learned decision tree, but a machine learning, but really it's a flowchart. At the top is X5 less than equal to 1.5, that means how many births, including the current birth, has this mother hat. So less than equal to 1.5 is one first time birth, go down the left hand side of the tree, more than one go down the right hand side. If you go down the left hand side, what is the weight of the mother before delivery, if it's less than 58.5 go to the left hand side, and if her height is less than 164.5, get on the left hand side, and the value 344, that's expected number of minutes for this first time mother who is shorter than some critical value to give birth, and then you can see the next value is nearly 400 minutes, then 440, and it varies first time mothers and give birth at different rates to second time mothers. So you get this actionable outcome. This is explained to staff in the birthing unit, they agree with this, they like the output of this, they like the nuance they get behind this, they like the idea that they can see when things are off track, and particularly they can see when things are on track, because they know the other evidence they've got isn't really working for them, and so this has escalated to increased studies and collaboration with other units, and I'm really looking forward to seeing where that story goes later this year and hoping that we get more of that story told at PiData events. So little lesson for you there, check out outdated assumptions, find data from 60 years ago, and then update it, do something new and interesting with it. Always draw your graphs, that's that message I'm giving you, always draw your graphs, and produce interpretable advice, that's the important thing to get your results actioned. Before I move on a slide or so, 10 minutes, oh dear, oh dear, I'm talking too much. Okay, in that case, I'm gonna skip over trying to understand my cat, because that's the most lightweight story behind this, very short story, I couldn't tell if my cat was going out or not at night, or whether she was trolling me in the morning for me to carry her into the garden, so I built a micro Python and Raspberry Pi based sensing unit stuck on the cat flap to measure her going in and out of the house, and I've got a link there to Robin and Oliver's talk from yesterday, where they talked about the home assistant at Raspberry Pi's first sensor monitoring. I suggest you follow up on this if you're interested, and I gathered a bunch of data, and the short story is it turns out she went out of the house a little bit sometimes during the night, she wasn't really trolling me, she really is a very, very scaredy cat who was afraid of the garden and afraid of the enemy cats in the garden, which is a problem because we thought we'd solve this problem by getting a dog, but then it turns out the dog and the cat don't get on so well, and now the cat really has to be carried outside once I've put the dog away, which is a whole separate set of behavioral experiments that I've been running for a year trying to teach my dog not to chase my cat, and you can see the wires hanging off the sensor unit there, and that's when my dog ate my science experiment. So there's some simple lessons there. Robust hardware can work, and simple hardware can work, but making it robust is quite hard, and even those simple visualizations that I generated from logging this data was enough to tell me that my cat did go out a couple of times overnight, but not really enough. Right, guess the weight number two. I'm gonna give you pictures in just a moment. So as you might have guessed, the link you wanna go to is billy.com, keynote ADA number two. Danny preempted my note about the, you have to use the correct units, which of course is the correct thing to do. It's the right question to ask. Damn you for asking it so soon. So I have a one and a half year old now. The picture you saw before was the younger ADA. She's one and a half years old now. English, Spanish, very active female dog. I'll show you pictures in a moment. I'd like you to estimate her weight in kilograms on keynote ADA two. So these are pictures of my dog now. You see that she's quite photogenic. Bottom left, you can see that she's got a, I've got my name in the camera. What's the brand of the camera that all the sports people use? GoPro, right. It's a GoPro mounted on her back and you can record her running around in the forest. Very lovely, very active dog. How heavy is she in kilograms? Go to keynote ADA two and I'll keep that link up. Okay, I think this is the final story before the live demo. So a colleague of mine, Gorison runs the London Machine Learning Meetup. He's been working on a project. How do you track orangutans in the jungle to assist aid work? So orangutans are endangered, some species critically endangered. Turns out they're very smart and they're very cute so they are kept as pets and then abandoned or they're at risk from say logging in the jungle. So aid workers can go in and rescue them and then take them somewhere safe but then you need to monitor them to make sure that they're kept safe and that they have a healthy life and change the interventions, the landing site for them and how they're looked after. To do that you have to monitor them, you have to see where these intelligent, private creatures are going. And the way that you, maybe I'm on the next slide. The way that you track them is with a radio tracker, those little pills on the right hand side, they're little radio discs, you put them in the back of the neck, you can't put a tracking collar on or you can't make the animal wear a bracelet, they all break off, they rip them off, they vary in size, the weather conditions are quite harsh. But these things are under the skin. And yeah, the way you track this is two people, 24 seven follow the ape for the first weeks of its release, very labor intensive and then over time they decreasingly track the one ape and they track other apes and then the hope is they can keep finding the ape and the way that you find the ape tomorrow is you go back to the place that you left it tonight and you hope it's still there and you've got a radio tracker with a 400 meter maximum range in dense foliage and you walk around waiting for it to make a noise and if it makes a noise that's a ping and you try to get closer to it making a louder noise and then you're homing in. But you've got 2,000 kilometer square jungle, these animals move around so it's very labor intensive and it's hard to track. Can we use a drone to do this tracking? So Dirk was working with a commercial organization that didn't work out, he then took the technology and the aid work, one of the aid companies he works with they fund the technology, he denates his time, he's been doing this for a couple of years. This talk is based on the work he did last year which he spoke about at Pie Data Amsterdam. He's just flown back from the jungle again with an update for me and I'll share that in a moment. So if you take a drone with a fixed search pattern so it's gonna fly in a fixed sweep pattern with software defined radio driven by Python and Python analysis tools, you get radio pings, recording signal strength, can you determine from that where these orangutans are, then you can send a human in to go and track them more closely. He gives this really nice talk at Pie Data Amsterdam from last year and there's a blog post that I link in the appendix and I've got a video to show in just a minute. This is one of the test sequences, all of these dots, these are GPS locations so you can see these dots follow a track and that's this drone flying backwards and forwards across the jungle and then it's gathering signal strength measurements, all this blue, there's no signal being recorded, then there's a region when they know that the test subject, Susie, she's walking away to a release site and they're measuring the radio strength coming out of that and then with GPS data you can then localize that to the spot where the animal is most likely to be and then you can go in an autobahn in a far less manual process, go and send somebody over there and because this device is flying around with a generic receiver, you can receive updates from multiple animals at the same time. So I'm just gonna give you a quick video, there's two quick videos. So the first one, it's just a very pretty video. There's a second drone, a professional camera drone up in the sky already and then that's Dirk's drone that's taking off and then we'll see it pan up in just a moment. So that pans up, that's the drone with the tracking equipment on it and then it's hovering there and then it'll just take off and sort of fly off into the distance and then that's the beginning of one of the test runs. So, and then at that point it's out of human control, it runs off on its test runs and then it comes back and lands automatically, you download the data and you do your processing. So once it's flown off, it's kind of out of your control. And if something goes wrong, then there's a problem. So it's a second video and this is the video on the actual drone itself. This is taken off from one of the release sites in the jungle and you get an idea to the scope of the problem. That's the hole in the canopy and then if you look at the canopy all around it, there's no more holes. So there's a release site where this drone can go up and then when it flies back, it has to come back to exactly the right spot and come down. Otherwise it will crash and that would be really bad. So this is the test flight they did just flying around the local area. They thought the animal was there, it was present still but it turns out the animal had moved away. So the drone flies away, comes back, lands and they get no data. At which point they decided right, we're going to go for the expanded search pattern. So they send off the expanded search pattern. I don't have a video of that because device goes up, flies away, they're tracking it, signal strength drops off, off an hour later, signal strength is returning, brilliant, the device is coming back and then nothing, it all just stops, disaster. And they're waiting and waiting and waiting, nothing happens. Then in the end they give up and it turns out when they were constructing their flight plan and then there's a return track, they hadn't realized there was a gnoll at some point in there and so the trees were higher and the device flies into the trees, crashes. And so at the time they didn't know what happened, many months later they get a bag of bits sent back from the aid agency which is the remains of the drone that was discovered out in the forest. So you get an idea about the complexity of that problem and that's one person volunteering their time, working on this very tricky subject. So Dirk is back from his second journey out there. So it didn't go perfectly but the results are much better this time so they're closer to having a working drone, a ragatang tracking system working in the live. So it's actually working in a really dense, complex jungle. So again, Halberd is hard, freeing up humans is valuable and then graph everything. So we're gonna have a quick live demo in Jupiter Lab and we'll see and I hope enough of you have put your results in. We'll see how this goes. So who's used Jupiter Lab before? Jupiter Notebooks? Oh, lots of you, brilliant. Oh, way more than I thought. Oh, in that case, great. Jupiter Notebooks are the standard for data scientists in the Python world to analyze their data. Jupiter Lab is the new system, an update on notebooks. It's notebooks with an IDE-like interface embedded in a web browser. So I'm just gonna run my code in here. We'll pull in the data, hopefully live and working. Brilliant. This line here, typically I just do a read CSV. That might be a bit too small to see. It says read CSV, you're reading a CSV file. I'm doing some processing to try to be defensive against whatever you might have typed in. But typically that's a one-liner to read in a CSV file. So I'm gonna load that data in. Maybe I wanna make that bigger. Can you read that okay? So at the top here, I've got a data frame, timestamps of the entries that you made and the guest weights in kilogram. So I can get a visual idea as to the numbers that were put in and this is stripping out any text or other weirdness that might have existed. So I've got a bunch of results there. Really I wanna see, great. So I've got a count of 91 results from that first, the uninformed experiment. So 91, that's brilliant, thank you very much. The mean estimate is about 17 there. So 17 kilograms with a standard deviation of 14. So that's 17 plus minus 14 for about 68% of the answers. So that's a very wide variation. And then the minimum is zero. Okay, she doesn't weigh zero. And the maximum is 100. Now I clean the spreadsheet this morning, that's not me. Okay, so yeah, those numbers are wrong, but that's fine. Now, always visualize your data when you start a new analysis. So if we just visualize the raw data, we can see that the distribution is clumped to the left-hand side. That's because we've got this one outlier at 100 dragging this graph off. So this is not great for analysis, but it's good just to get an eyeball of the underlying data and see what we've got. Now in this cell, I'm going to mask off data outside a range that I think is unhelpful and come up with a slightly better graph. So this is the estimate that we're coming up with now. So the mean estimate is 13.5 kilograms. I'll tell you the truth in a moment. For the standard deviation of six, pretty wide still. I mean, there are some guesses down at one or two kilograms, I guess, and 25 kilograms, but you didn't know how big that she was. She could have been a tiny dog or a huge dog, so that's fair enough. Let's run through the second data frame. This is the better informed guess. Oh, interesting. So the distribution is pretty similar. The mean is 12.2 from 13 before. Standard deviation was six and is now five. Okay, so you actually haven't changed your guesses all that much, which is kind of interesting. Her true weight is 14 kilograms. So actually, the mean estimate here is very close. So you can give yourself a round of applause for this. You've done a good job there, well done. To combine these things, you might make a data frame. So you put the two columns into a data frame together and then you can draw them and overlay them. Actually, yeah, these results look pretty similar here. Am I wrapping up right? I'm wrapping up. And then if you wanted a prettier plot that you might put onto the web, you might do something like use plotly, in which case you get a D3 interactive, fully rendered graph that you can upload to a website and embed. And so you can share your results inside an organization. And I think on that note, I'm just gonna say collect and visualize your data, get open data, go to the Kaggle machine learning website, you can get open data there. Practice, play with it, visualize it, visualize your internal company data. I'll have a write up of this on my blog. Please remember to thank all of your volunteers and your speakers at the conference. These things are all volunteer-driven. I'm a volunteer here. All the other speakers are volunteers. Go and, if you do one thing nice this weekend, go and thank a speaker for the time they put in and thank the organizers for the conference. And if you've learned anything at all today, the thing that I've been doing in the last year is saying, please send me a postcard. I want a little thing back. Send me an email, I'll give you my address. Send me a postcard from somewhere nice in the world. I've got a nice collection of postcards building up on the wall. It makes me remember that actually, I've helped some people out with a new bit of thinking. So if you want to send me a postcard, send me an email, I'll be very happy. Thank you very much. Awesome. Thank you very much, Ian. You'll certainly be getting a postcard from me. Brilliant. I think we've got two minutes for questions. So if we've got any questions in the room? Any questions? OK, I'll ask one. In your first story, you said, visualize unseen data. Have you got any tips and tricks for finding unseen data? Good sources of it? Yeah, the government open data website is a great place to go to search for government open data. And then there's a whole pile of data there. I remember when I visualized the UK house price home ownership data set, I was surprised to discover that if you look at the time-based completions of household ownership, the majority of completions occur on a Friday, not on a Monday. And it turns out that conveyance is typically complete during the week. And then you get a spike every Friday. And then there are certain points in the year where you get certain spikes. And there's an annual change as well behind that behavior, Christmas fewer completions. But big holidays, there's a big bunch of them beforehand, and people want to move before summer holidays and the like, or complete before the summer holidays. So go and grab any of that data, and just visualize it. And then weird things drop out. That makes sense in hindsight, but you don't necessarily know to find before you go there. Awesome. Thank you very much. Thanks very much, Ian. The story from Skopje was, for me, the most illuminating and interesting because of the implications of it. How easy is it for someone to stumble across that information and actually do something with it unless they're already quite expert and know what they're looking for? So in Gorgian's case, he just picked the data set that was on the government website and visualized it and didn't know what he was doing. He was learning to program. So he just stumbled upon that, which is really nice. The fact that he then managed to engage a million people speaks to the fact that he's not bad at publicity and not bad at organizing. And he found other people who also cared about this health condition. If he'd have found a data set on something far more mundane, he wouldn't have got a million people engaged. So I think there's an element of luck in that story. But I think that's kind of the case. The Guardian goes out digging for stories, and the Guardian has a data science and visualization department. The Guardian doesn't know all of the stories that exist out there. If you stumbled across an interesting data set and thought, hey, maybe the wider public in the UK or the Eurozone care about this, you could contact someone like The Guardian or one of the other newspapers and try to publicize this result, and you'd probably find collaboration. But what it needs is many eyes on many data sets discovering the interesting things that exist in there. And I think it is a random discovery process to some degree. Awesome. That's all we've got time for. So thanks again, Ian. Thank you very much. Thank you very much.