 So, I welcome, and I am really happy to be here. This is a huge honour to be given a chance to provide a keynote. I'm always pushing data science on everyone. So my goal this morning is to try to educate you, and convert you into the field of data science, and then bring you into my meet-up and the pie data, but the general data science world. So, I'm an engineering data scientist. If you're outside of the data science world, you would have heard a lot about data science, it's one of the sexiest things at the moment that it gets all of the mind share. Many data scientists come into this world via PhDs and postdocs, that's not me, I came in via a theoretical computer science background 15 years ago, so I've taken the other route into data science. And I'm talking to you today more as engineers coming into this world rather than academics coming into the data science world. So I work, I've been running my own company for nearly 15 years, I coach, I train, I act as an interim senior data scientist in teams that are lacking as senior. And one of the reasons that I do that is that I love to learn new things, so I keep challenging myself to go off and work in new domains. A couple of the talks today will be medical focus, but I've worked in a number of domains just because I really enjoy learning new things. And then I love sharing those new things that I learn and where possible I run experiments, including upon my wife as you've already heard, and then see what I can learn. So along the way I've written a book on high performance Python with my co-author Misha, who was at Bitly and has moved on to Cloudera, I think. I found at the, co-founded the PiData London Meetup, which Alexander has mentioned. So I'm really proud of PiData, we've got over 100 PiData events around the world and a set of conferences of the 100, my London one built with Marco, who's here and a number of other colleagues is the largest in the world. I'm super happy about that. We've got 7,500 members. I would love to invite some of you to come and join us at PiData London. But also we've got a PiData Edinburgh, we've got a PiData Cardiff in the UK. There's a number of PiDatas throughout Europe. There's a PiData Frankfurt, I'm reminded. If you're interested in us at all, it's a very friendly and welcoming community. It's all the nice things about the Python community with people who like talking about data. So if you're at all into that, go and join. I consult in my model insight and I work with companies like Mitsubishi Finance Bank Channel 4. I've been working with QB, it's a very large insurer, helping them figure out how to apply machine learning into their insurance. So I work with large companies where there are large projects, they can be very slow. They can have a big impact, but they're the big corporate things. That's not what I'm talking about today. So I'm going to give you some stories on citizen science. These are small, either individual or lightweight projects inside various organisations that are public projects. I'll be giving a crowd-led demo. I mean, I'm sacrificing chickens to the demo gods here. I'm giving a live demo with JupyterLab so that those of you who haven't seen the JupyterLab environment before, you can see how a data scientist might work and you might be inspired to go and try this as well. I need you to participate with me. I'm going to be sharing a link for a Google form twice. No login required, you just visit the link and then there's a single question, you type in an answer, you hit submit. So you can do that on your phone. I know that the Wi-Fi in here might be a bit tricky, but the form is very lightweight. It works over your 4G connection. So for anyone who's got a mobile phone, I would very much like you to take part in this little experiment. You just go to this form twice, type in a number, hit submit. And then having done that, you can submit another answer. So if your neighbour doesn't have a device, you could pass your device over and they could submit their answer as well. It's all in the moment. No logins, no complexity there at all. There are two appendix slides as well that I'll show. They've got all the links for the talks that I'm using, these stories in here so you can follow up and learn more about them. So first of all, I'm going to talk about Macedonian air quality. Has anyone been to a city with really bad air quality before? Okay, so a bunch of us have. So when I was at Piedata Amsterdam about six weeks ago, I met a chap called Gorjan. I think Jovanovsky, I think that's his name, and he told the story of the Macedonian smelly fog. And as he tells it, every year the smelly fog descends. So this is not an unusual photograph of a strange cloud layer in the city. This is the smog in the city taken from above looking down. And this is what the populace lives in. You can see some of the skyscrapers just peeking through at the top there. So this bad weather descends for many months of the year, every year. It's a known thing. Everyone just says, it's the smelly fog. There'll be rains later. It'll clear. It'll be okay. In between the government issues, warnings, anyone with breathing difficulties, or anyone who's a baby, maybe shouldn't leave the house today because the air is particularly lethal. And then they changed the limits. My wife and I, when we lived in Chile, we had a similar thing where the government would change the red levels and increase the limits for what it meant for a day to be a red day when you shouldn't leave the house. But it's kind of terrifying when you can't see down the street because the pollution is so thick. He was learning programming at the time. We took some government open data about pollution measurements and he was learning programming. So he wasn't very confident in what he was doing. This was about five years ago, I think. He takes the data and he draws some graphs. And he knows he's made mistakes because when he draws the graphs, the numbers make no sense. It's an undocumented data set, so he hasn't got any guidance there. But every time he's drawing these numbers, it's crazy. These numbers are significantly higher than anything he sees around him. He does some reading online. These numbers are consistently four times higher than the bad pollution that he's read about in Beijing. And 20 times the numbers expected in the EU at the worst possible case. And these are the daily readings that he's experiencing. And after a while he realizes, oh, they're true. These numbers are actually correct. And he's the first person that he's found who's playing with this data set. So this is really awful, right? There's killer air, 20 times EU pollution limits. No one in a country is talking about it. So he writes a web page graffing these results. And then he finds some other people who care about this topic as well. There's a lot of people who care about the fact that they're being poisoned on a daily basis. So we find some people when they popularize these results. And within a month they've got a million people consuming these results, first off a web page and then off of a mobile app. And it gets to the point, that diagram on the right there, that's a member of parliament holding up printouts from the website in parliament discussing the fact that there's this issue that transcends any nationality, sex, education, wealth bracket. This is the air that everybody, every politician, every child is breathing. And it's killing them all. And maybe this needs to be discussed. And the incumbent government doesn't want to discuss this because this is a bad topic. And then the government that wants change is talking about it and using this to generate some action. There was an interesting part in Gordon's story where he talks about how the Minister for Ecology goes online, I think goes on to the radio, and says, this is all lies. The data is wrong. It's all a lie. It's a conspiracy. So Gordon comes back and says, look, I'll come into the government with my data, compare it to your official data, or compare it to your paper records. If I'm wrong, I will apologize, delete the app and remove everything. And if you're wrong, you resign. And then that was the last they heard from the Minister of Ecology. And I asked Gordon, how did you get the data? If the data is this bad, why would the government release it? He said, ah, well, Macedonia wants to be in the EU. The EU provided these sensors. A requirement of having the sensors is the data has to be published. The data was published, not documented, but published. So as a result, the data was made available, but then just not pushed, not documented, not investigated in any way. And it took someone like Gordon to go and do something with it. So they're improving upon this now. They've gone from this single dump of data to frequent updates. It drove government policy change during the election. The new, the changing government were promising rapid evolution in air quality standards. They won the election, nothing changed after that. And so this is clearly going to be an ongoing slow process. But some things did change. Using the mobilized population, they tracked down a highly polluting incinerator. It turns out British supplied highly polluting incinerator, something that we in the UK got rid of over a decade ago because we wouldn't meet the EU limits with this. And the Guardian wrote about this in 2001. So this was gifted out. And as I read it in fairness, it was better than what was available at the time out there. So it was an improvement. It just should have been better. But as a result of highlighting the problems with this unit, the fact that it was running 24-7 rather than within strict timelines and we didn't have certain filters on it, they got it fixed. So it was far less polluting. The big step up they're doing now is a collaboration with the European Space Agency and the Copernicus project looking at real-time satellite data, which doesn't depend upon where a government places sensors which may or may not be in a sensitive location, but takes satellite data, which sees everything. And they're beginning to analyze that to drive further change. So what can we learn out of this? Well, the simple lesson here is draw a graph of unseen data. So you can find the data set that no-one's looked at and there's a lot of open data. I've got links to that in the appendix here. Go and find some data that no-one's drawn before and draw it. And you can draw it in Excel if you want. A lot of this stuff, it's really easy. It's CSV files typically. Draw the data and tell a story. Find some people if you want to try and get some change around this and then see where that can take you. This is a really easy entry point into data science just getting some data and drawing it. And there's an additional slide here that I've just put in. A month ago, Pai Lundiniam, my colleagues Robert and Olivia talked about this personal air quality project they're working on, a Raspberry Pi device with a low cost sensor. That kit up there cost about £60. They can use it for monitoring in the house. There's the infamous dirty sausage story, which you can see told in that presentation if you go and watch it. And what they're doing is they're mounting these sensors on the backs of pedestrians and cyclists so as they go around, they're monitoring their own personal air quality as they travel around the streets and make better choices about the streets they're taking and the pollution they may or may not consume. And there's a talk this afternoon by Douglas Finch on air quality in Python that if you're interested, I'd suggest you go and attend. So, here's the first audience participation moment. So I would like you to guess the weight of my dog and you know nothing about my dog. So, this is a wide open simple survey. If you go to that, it's bitly.com, keynote Ada 1. And you can guess that the second one will have a 2 on the end, but don't go there yet. So, bitly.com, keynote Ada 1. There's no sign-in. You can go there on your mobile phone. That link will appear on the next set of slides so you can go in there in the next couple of minutes. Please guess the weight of my dog in kilograms. Only put in a number. So if you put in any text, it gets stripped out. And I'm going to give you no information right now. Later on, I'll give you some more information. You make a second guess and then we'll compare those two sets of results in the notebook. So just a number, no negative numbers, kilograms, so nice round numbers or low numbers, nothing too crazy. Please don't be the clever person when I gave this last time who types in NAN to see if my parsing routines work. They do, but there's no need to test this. Kilograms, kilograms please. Yes, when I ran this last time, I left that deliberately blank and immediately an engineer that I know dived in with my requirements for units and I love that. But yeah, kilograms only. So we've mentioned my wife sneezing. She's here in the audience and I love the fact that she supports me in running these experiments upon members of my family, including her. So my wife sneezes a lot. And when I say a lot, there's a histogram on the bottom right-hand corner. We wrote an app that records where she can record when she sneezed. So you just record every time she sneezed. The left-hand bar at the days when she sneezes is zero times. This is over the course of about a year. So there are about 35 days when Emily didn't sneeze at all during a day. The next bar is when she sneezed one time in a day. Well, there's about 20 times. And then two times a day, three, four times she sneezed about 40 days, about four times a day over the course of a year. The far right side is 28 sneezes in a day. That was a particularly bad day. And then the question was, Emily was a mobile developer at the time, could she write an app, an open source app that had benefit to other people's suffering from different conditions, but a generalized app for medical, personal health care? And could I analyze the data to see if we could find possibly correlated, possibly causal connections between events to see what might drive the sneezing? So we had a hypothesis. There are environmental factors that drive sneezing if we record all of these factors. Can we do something about it? The app that Emily built. It's an iOS-based app, open source. It has event logs, so simple button interface so you can just tap when something has happened. You've got a runny nose, your eyes are itchy. Particularly I've sneezed, I've sneezed. We talked about could you use the device to automatically record sneezes so you get a physical jerk, you get a loud noise. That will be quite a lot of work. I didn't want to go quite that far. Tapping a button was easy enough for the first version of the experiment. But I can see lots of ways you could automate elements of your personal reactions collection over time if you suffer in that way, which we might imagine seeing in future devices. Is that a hand up? Oh yes, for the survey please just one answer per person. Then there will be the second survey where you put in a second answer later on. Thank you. It's an open source app, editable history, records GPS traces. I will say one thing there, with the GPS traces I take periodic updates from Emily and then I do the analysis. It was really weird realising that I had the same kind of view that Google and Apple have, watching a person's movements over time. It feels incredibly intrusive and of course it is incredibly intrusive. I got it lagged, but nonetheless you get this view and it's a view that Apple and Google and any other controllers of our data or any mobile phone company has all the time and if we aren't looking at that, we never think about it, we kind of just take it for granted but when I actually had it in my hands it's really weird to have that. So one of the reasons I encourage people to run these kinds of experiment is it makes you think a little bit outside of what's normal to you in your everyday life and how you're interacting with the world of the data that's available. So we're gathering all this data and there's a number of things we've got out of it. I've given a couple of talks on this. This is just going to show one little result here. So here we're looking at a single patient anti-histamine effect. So Emily sneezes a lot. She takes anti-histamines roughly every other day. On a day when an anti-histamine is taken what is the effect? Well clearly Emily thinks she needs the anti-histamine. She's sneezing. She already feels like she's sneezing. It's a day with high propensity to sneeze. So what effect does the drug have? So on the left hand side we've got all of the traces for when individual sneezes have occurred. So this is a period of 12 hours after the first anti or the one anti-histamine of the day has been taken. So when the anti-histamine has been taken whenever Emily sneezes she's tapping away but she's already recorded an anti-histamine has been taken. So if I take those days and then say at the zeroth hour when an anti-histamine was taken count all of the sneezes and then we just get a single count that's the blue line on the right hand side. So hours zero and one after an anti-histamine was taken the sneezes are high. They're close to 50. Two hours after the anti-histamine was taken we see a marked drop. The total number of sneezes over all of these anti-histamine days is markedly lower and it stays low for about eight hours and then it increases again. We might ask well what's driving that behind it? So the dotted line behind that's just an extrapolated line that I put together and I know that the anti-histamine that Emily was taking at the time takes about two hours to have an effect to enter the bloodstream and then it has an exponential delay curve a decay curve so that it drops off with a certain half-life and so I can plot that extrapolated line based on the simulation and we see that that two hour point is when the sneezes drop down and then as it decays to a certain point this is a general result but this applies to everyone in different ways based on personal biology so based on the kind of medication you're taking you might have a different reaction a different effect it might last for days, it might last for only hours this particular drug, other drugs might work in different ways better or worse ways. So here's a nice simple way to record the data and see how it works for you to improve your own personal healthcare. Now I had the strong hypothesis that there were causal factors in the environment that drove the sneezing and I worked awfully hard, really hard with a couple of colleagues, really really hard trying to find any evidence of this causal connection we couldn't find it. We found one result, there was a weak relationship with humidity as the air got drier the propensity to sneeze increased and as the air got damper propensity to sneeze decreased and it turns out your nasal lining the mucus membrane in the nose when it's drier it's more irritable and so it's more likely to be sneezed because the nasal lining is drier so we can't control humidity but it's interesting at least to find a proper result in there. Now we escalated this took it to a King's College professor one of the top professors in the world connected via our pie data London community and he said this is an amazing result clearly this is a non-allergic reaction going on it's just chronic and persistent rhinitis so Emily is primed to sneeze just because of the way her body is working and there are no environmental factors we had data for different countries different seasons different allergen types in the air what kind of travel we were doing at the time London Underground buses all sorts no connection at all with any of that he did suggest the new treatment which we tried but didn't get any improved result out of it so it had some benefit it ruled out another treatment method I mean the antihistamine works just fine but we were looking to see if there was a better solution here but the important takeaway here is graffing was enough to get a diagnosis and the machine learning it did give us something new but you don't need to go all the way through to machine learning in a data analysis project typically getting good data good enough data and drawing graphs and having someone who can interpret it is what you need and that's the key takeaway here and I'm going to repeat that lesson a little bit more and you might want to see Marco Bonsonini's lies, damn lies and statistics talk her in a couple of hours time where he talks into some of the issues around data analysis so if you're interested in this if you want a little step forwards ok second guess the weight exercise so I've got an English Springer Spaniel you get some sizing evidence there from some of those photographs the photographs appear in some of the subsequent slides and then there'll be this second link bitly.com keynote ADA2 just go there and give a second guess for her weight in kilograms to a number only I'll let you look at those photos those lovely photos for just a minute so I'm going to move on but you'll see a few more photos so you can make a guess in a minute or two if you want when you've seen a few more so oh and that's my dog who clearly I upgrade with sensors as well that's a video camera on her back as one of the experiments that will run on her so updating outdated medical results this was a talk given up by data Warsaw last year by Anna she's looking at updating updated medical results it was a really nice lightning talk I didn't realise this it turns out that in birthing centres maternity units when a woman is coming up to giving birth there is a critical curve developed about 60 years ago by Friedman which is used to judge whether the woman is on track to give birth based on time and cervical dilation at 10 centimetres the baby is ready to come out and so you want the track that the cervix is dilating appropriately over a period of time and if the woman is progressing too slow it's a failure to progress I think is the technical term then you need to intervene to make sure that the baby comes out successfully all hospitals around the world typically use the Friedman result from 60 years ago the Friedman result from 60 years ago was developed when we had different technology women gave birth to different ages women had different levels of health intervention and mechanical intervention were very different and our understanding of bodies were very different and yet 60 years later we still use the same guidelines and it turns out increasingly around the world there is discussion about whether this is actually wrong and so Anna was part of a team looking into how this might be wrong and how it might be fixed and the important point is when a doctor chooses to intervene because of a failure to progress that nice phrase covers either drug intervention or perhaps a cesarean operation which could have significant negative impacts on the patient and on the baby and then the question is well do you need to worry about this at this stage or actually are we intervening too soon so she and colleagues conducted experiments or conducted recordings on a couple of hundreds I think they were first time and second time mothers the link is in the appendix you can go and watch for the details they recorded the results of cervical dilation of all of these mothers over a number of hours about 12 hours I think and then what you see there with those box plots those boxes represent the majority of mothers readings at each of those hourly bars so at the one hour point cervical dilation was between zero and three centimetres and then by the four hour point it's between what three and eight centimetres and then typically by say six hours at least some of the mothers have reached the 10 centimetre dilation the baby has popped out and they're finished and then there are other mothers still progressing in their birthing this centre doesn't practice cesarean operations and drug intervention so they typically see all of their mothers through to successful delivery without intervening but there are medical facilities if that was required as an intervention lots of other hospitals follow the freedman curve and intervene early if they believe it's necessary so the red line is the freedman curve so if any mother is above this curve she's progressing either on track or faster than expected and that's fine if she's below that point and on the right hand side that's those black dots below the red line and those black dots might be one or more mothers at that point then they're not progressing fast enough and that's when a doctor has to intervene according to the classical results but all of these mothers had no intervention and gave birth successfully and so this is one of a growing body of evidence that the red line centre showing that this intervention strategy is inappropriate or could be inappropriate and that some refinement is required to improve the quality of healthcare for these mothers so what do they do just having graphed this and shown it well they then took an extra step can they give an interpretable result that staff in the healthcare unit can use and they use some machine learning to develop a decision tree so from a machine learning perspective that's an incredibly trivial result this is a really simple old fashioned single decision tree it's not deep learning, it's not big data it's none of the buzzy things but this is an incredibly useful result this is interpretable by the staff in the birthing centre it's a flowchart effectively saying help me make a better decision than what's available in the textbook this is incredibly useful and so you can see if you're a first time mother go to the left side and then based on your weight go left or right and then based on your height go left or right and then if you're not progressing within that time appropriately that's a secondary bit of evidence to provide suggesting that maybe an intervention is required or actually you're on track, you're under the time and everything is looking sensible still so this has been introduced to the staff there they like the idea of this and they want to do something with it and they're doing further experiments so what are our lessons here well check for outdated assumptions many of you work in organisations that are old they're large, old organisations that have a historic baggage maybe some of that baggage is outdated lots of it probably is some of it if you fixed it just by reviewing the data that you've got available maybe you could make better decisions so maybe that saves time or money or improves people's interventions or whatever the metric is you want to use people forget to go and check on these outdated assumptions they just become a matter of fact but if you've got access to the data because you've got access to a database or an Excel spreadsheet or whatever it is that you've got interpreting that evidence in a way that helps make better decisions and one of the important outputs there is to make interpretable advice don't make a really complicated system just because you could instead go and make something that is interpretable by your colleagues one of the big challenges I've been talking about in the last couple of years in my public talks is around interpreting machine learning output so that you can go to a non machine learning colleague and explain why this system is saying a certain thing and that flow chart there is exactly that kind of output that you want so if you wanted to make a guess for Ada's weight having seen some more photos now is about the time you want to do it Kino, Ada too, I think we've run out of pictures as I go on to the last little story so this story, this is the last of the stories before we do the little demo where are the orangatang so my colleague Dirk Gorison he runs the London Machine Learning Meetup which is a rival to my PiData Meetup but it focuses much more on the rather on the data science stories far more specifically around machine learning and advances in machine learning it's a similarly large meetup very very popular hosted in the same hedge fund HL who hosts my meetup we're both super grateful for that company for hosting us there and providing I mean at the meetup that we have and that Dirk has there about 200 attendees every month it's fully hosted which is the size of a small conference for free every month which is lovely, that's a lovely example of community contribution to help us progress our own goals so Dirk runs this machine learning meetup and he's got this personal project so some years ago he was involved in a company a commercial organisation looking to track animals in the wild to see if you could intervene and monitor to provide better care for animals in the wild that company didn't work out he managed to acquire some rights to carry on working with the underlying technology and he found a charity who wants to work with this specifically around orangutans so it turns out orangutans they are very bright primates they can be a pet and then they get bigger and they get less cute and then people just get rid of the pet they live in areas that suffer deforestation and farming and they can be mistreated so you have aid agencies that's the picture in the middle there going to re-home the animals that have been found and one of the problems with re-homing is you've re-homed how do you know you've done that successfully how do you know that the animal is happy and the new environment has integrated and that your strategy for re-homing is a good one and if you can demonstrate success you're likely to raise more funding and if you can't demonstrate success you've got a problem with these animals into a nice environment so the way you do this is you take a little radio transmitter to the device on the right and you embed it in the body under the skin you can't put a big tracker these are very bright creatures that don't want a big bracelet trapped onto them they don't have necks they've got these big thick stubby bits and whatever you try to adorn them with would have to survive years in a rugged environment with an animal who's not afraid to be a bit heavy handed so they put these subcutaneous trackers in one of the problems there is there's limited range that you've got a radio tracker that gives out a weak signal and the way you track it is a human turns up to the point where they saw the animal yesterday that's their best guess as to where the animal is today and walks around with a radio tracker and if it starts beeping, brilliant that means there's a signal within 200 meters of a dense jungle and they walk around back and forth trying to make the signal stronger and if they get the signal stronger hey they found the eight, brilliant tomorrow and at the beginning when they release an animal a team of two tracking 24-7 for several weeks and then it becomes more intermittent and then coincidentally they discover other animals that will release and they can start tracking them but it's kind of bitty and it's really time intensive so can we automate this so Dirk's project can we use drones to automate this really sensible idea can you send the drone back and forth across the sky with a radio receiver picking up the radio signal processing it and then providing some kind of GPS locations really sensible idea turns out doing this on your own when you don't have a background for example in radio signal processing and drone dynamics and automated flight systems and the like means that you take some time to build this up now Dirk's a very smart guy he also works on autonomous self-driving vehicles at a large funded company so he does have a good strong background in engineered robotics but nonetheless building a drone to fly in a jungle autonomously is a non-trivial operation so if you were to watch his keynote talk he talks about the python powered software defined radio behind this because they have to pick up the raw radio signals over quite a wide spectrum and then do post processing things like the humidity in the jungle effects the signal propagation and the wavelength being used so they have to process to find these pings there is no simple API that just finds the pinging device they have to go and do the raw processing themselves when the drone comes back and that means then that you send this drone off it flies a flight path it comes back hopefully it comes back and then you can process it to find out what it has recorded it's not a real time system which can lead to some problems so here are the results from one of their test runs they were releasing an orangutan called Susie they knew where she was being led away by keeper to be released they took the drone and I'll show you some videos of this drone in just a minute they took the drone, set it off and it starts flying in the middle top diagram you can see the green diagram you can see those black dots you can see it's basically traces flying up and down like flying up and down a field but it's flying over an area of jungle and then it gets to the bottom piece and then it flies straight back up and it's returning home and then when it gets home you can process the data and you can see in jungle when they were there they discovered they had to fly the device lower because the signal quality was worse but nonetheless you can see areas of poor signal and then bright signal, strong signal and that's where this orangutan was they had a successful test flight so then the question was well how do we take this further out and do some more work so I'll show you two videos there's no audio there should be audio but it didn't want to work so you're going to pretend that there's a buzzing sound because that's all the audio really is here so that's the drone that Dirk is using and what's recording it is a professional drone with a camera rig on there so it's incredibly stable so you can see this other drone in the background it's now going off on an autonomous flight run just on a calibration run and so it flies off across this is out in the jungle but on the edge of the jungle in a very safe area where they're developing and so this thing flies out and it's all very sensible and then I think we see it somewhere in here so you can just see the shadow coming down in the middle and then that's the drone going down to land so it flies autonomously, brilliant it's a nice wide open area then you get to the release site and so this is on the drone itself this is Dirk's unit it's flying up, this is one of the test runs it's a test run because it came back and they got the video so this one flies up you notice the hole in the canopy and then you notice the canopy around no other holes so when this thing flies off it has to fly back to exactly the right spot to come down and land and so it's going to fly off on quite a large route 10 of kilometres that it will be flying off but there's no radio link and so this thing flies up and I think it flies around a little bit you get the idea, dense jungle you can't see the orangutan from the ground you can't see the sky so you can't see the drone but then this one comes all the way back and then it comes in and it lands again and so this is nice, it comes in and it successfully lands everybody's happy they're ready for this they know there's an orangutan out there and Dirk was out a couple of weeks ago for a second run but they were out a year ago they knew roughly where they wanted to send this device so they say go the device takes off, it's got a little signal tracker he knows that the device is in the nearby area it flies off, they hear it go and then it's going to go off for some period of time fair enough and then they wait and then the signal tracker shows a bit of signal so this device is coming back and then there's nothing and then they wait and then there's nothing and so they decide right, we've lost the drone and it's quite an expensive big bit of kit it's disappeared in the jungle somewhere and it turns out by looking back at the maps they thought they had a flat elevation as they were flying a crisscross pattern and then when this unit flew back it turns out there was some kind of knoll somewhere in there so the trees were higher and then the device flies through and they crashed and then that was that and then actually some months later it turns out the aid agency found the remains of the drone crashed into the tree and that was exactly what happened and they sent it home so that was a disappointing first result but it did prove that this thing works and if you follow the keynote you'll see that Dirk had lots of problems even getting lithium ion batteries out of the eurozone you lost some of them along the way they were captured by customs and once he was out of nowhere then you've only got so much kits you can carry and then something else breaks and then you have to start a gerryrig in parts that might just about keep it going so it's quite difficult to keep this kind of thing working but the aid agency funded another device and they went out and did a test run again I was hoping to get some video of that second run Dirk tells me that it worked better this time and he got the device back but he doesn't have successful results yet but they're going to continue with this project and there will be links in the appendix where you can read about that and follow where this project might go next so hardware is hard hardware really is hard but freeing up human time is valuable if you could free up those tracking humans who wander around with a radio device just listening to it and let them go and intervene more successfully and track more animals more consistently they can only make better decisions with that kind of result so if you tackle any hardware kind of problem always expect to iterate a lot so always break it down into a project like that handheld air quality monitor always break these things down into tiny stages that are achievable so now we're going to do the live demo so we'll see if this works I'm a little bit nervous because now I've got to fetch the data from your surveys so if you remember I asked you the question how heavy was Ada without showing you any evidence of what kind of dog she was and then after showing you evidence of what dog she was so we should see two different distributions of data and then maybe we can learn something from that as a data scientist I use Jupiter notebooks this is in the new Jupiter lab interface so this is a web enabled interactive python environment where you can do charting and graffing and 3D plots and javascript and you can query SQL and big data systems and CSV files and anything that you need and you can develop it in a way that provides for easy demonstrations and if you've never used it before I recommend you give it a go so you're going to recognise some of the code but I'm not really going to go into the code that's here so I'm just going to load in let's see if over 4G I can... so we've loaded the data we've got the data files down fine so these are some examples these are the last rows of the last time that I ran this but the rest of the data will be for the answers that you've put in oh good grief a mean of infinite understand the deviation of nan so this is having put in my most robust parsing process possible in the hope and last time it ran just fine that might be annoying if not I've got the pre-rendered demo on the other thing and I'll have to improve the slides let's... no oh good grief who put in range parameter skip that one can... there we go so what I would have shown you and I will show you no don't debug it don't debug it what I would have shown you is this is my pre-rendered one that one of the first things you always do is load in the raw data and look at it and then you process the data to get rid of your outliers and the weirdnesses so you can look at the one that's hopefully a bit more sane getting rid of any mistakes that might have crept through I'm very curious to see what mistakes actually crept through but I'll debug that offline so having taken out some of the unusual guesses that have generated infinite results thank you for whoever did that we've got we've got 448 responses in the clipped region which is pretty sensible so this clipped region I take any numbers that's kilograms one or more and 60 or less so 60 is the weight of a large Rottweiler which is a pretty hefty dog dogs do go up to over 100 kilograms but they're pretty rare, they're bigger than humans they're fairly terrifying beasts my spring of spania was much smaller as you saw so we've got nearly 500 responses so I'm really happy with this an interesting distribution so we get a lopsided distribution skewed distribution so lots of guesses on the left hand side so many of you are guessing around what's that, between 5 and 15 kilograms there's a spike at around 15 kilograms and there's a spike at around 20 kilograms now I expected this if you don't have any evidence to work off you're going to probably pick a round number that's pretty sensible, just not obviously wrong so 15, 20, 25, 30 every point so we're going to see spikes and this kind of artificial result and then some of you are taking some punts much further out on the larger weights and if we could look at the raw data we would have seen guesses I'm guessing going up to 100 plus because it's not unreasonable to have a dog that heavy it's just unlikely so we've got this skewed distribution with a median guess of 12.8 kilograms so if we sort all of the guesses in numeric sequence along that sequence the median is at the 50th percentile and that'll be 12.8 kilograms which is a reasonable guess not knowing anything in advance what happens when we introduce some evidence how do your guesses change so let's load in the second one so 412 guesses in the second case with a median of 12 so it turns out you're not all dog fanciers because you're all wrong or a lot of you are wrong so here we see this distribution and this is what I wanted to see this distribution has closed down by providing more evidence those of you who would have guessed higher probably have come lower those of you who guessed very low might have guessed higher so the distribution has closed down a bit so it's still a skewed distribution there's still a lot of weight on the left hand side and a longer tail to the right hand side the median hasn't changed very much which is kind of interesting but the spikes that we saw at 15 have disappeared but there are spikes here just under 10 and just over 10 kilograms which means that you're guessing 9 kilograms, 13 kilograms which isn't crazy at all she is a smaller dog and if you're not a dog owner it would be maybe hard to guess her weight it turns out she's actually 17 kilograms but you're not too far out there now one thing we could do if you want to start comparing your results is to take these two individual sequences of numbers and put them together into a data frame so there are multiple sequences of numbers a bit like an Excel spreadsheet with multiple columns so I combine these two and then we can look at these because there are less results in one than the other one of them has got these missing numbers and that's fine for the graffing perspective and then if we just draw these and overlay them here we can see just a simple visual comparison of your before and after guesses so the blues are the before and then the orangey ready one is after and so what we see is that the blues are as we go to the right hand side so more of you are making larger guesses and particularly at those round number points we see 15, 20, 30, 40 and kind of 55 jumping out and then once you've got some more information your guesses have come towards the left hand side towards the lower numbers and we see a greater volume of those guesses around the 10 kilogram point and it's all kind of bunched up in there so the wisdom of the crowd is kind of working here you've made sensible guesses you've made some kind of dog fanciers this is not some kind of dog competition so you don't have great information about what the weight of a dog might be so you're not spot on the correct answer is somewhere in here which is a low point in the result which is interesting but the purpose of this was to see that the variance of the result shrunk rather than exactly where the median or the mean guess might be so I'm really happy with that demo hopefully what you've seen from that is and inevitably raised some new questions which drives you back to the beginning to get more data and draw more graphs so you can go round in a circle okay so it's time to wrap up so closing thoughts it's all about collecting data and visualizing it and then sharing your results there's an awful lot of hype about big data deep learning and the cleverest smartest next thing coming but almost all of my work with clients involves finding their data they've got the data they thought they had fixing it up into a way that's useful drawing graphs and then interrogating people about what does this actually mean and then providing some results and then iterating and making things slightly more complex and then iterating and iterating it's all about getting the data and visualizing it and you've all got access to that data there are data sets in the appendix you're very welcome to go and follow those when the slides go online and then find some data sets if you don't have access to your own data but working off the data you understand is the right way to go that domain knowledge is incredibly important and only you have the domain knowledge about the data that you've got I have a request of you if I've made you think about something new and if you're interested in this topic and if you want to go and make some change around your own environment I'd love to get a postcard I've been collecting postcards for the last year they remind me that these talks actually actually work that makes people think about what they're doing so I've got a lovely collection of postcards at home if you would like to send me a postcard or my address I don't care when you send it or where you send it from I just like getting postcards with nice messages saying hey you made me think so if that's a thing you would like to do please get in contact and more importantly please if you haven't yet thanked an organizer and a speaker here please go and thank an organizer and a speaker many people forget that these are volunteer run events the speakers put a lot of time in the organizers put a lot of time in and they forget to go and say thank you shall we consume from the ecosystem without contributing back even to say thank you we're a lovely group here please go and thank the people around you for the work they've put into this the right up a bit on my blog thank you very much