 Good afternoon. My name's Dr. John Graves, and today's presentation is See New Things and Act. It's a talk about big data. I'm substituting for Ned Lecture, who was going to talk about building front ends in Dash. And I really wish he had been here, because I would love to have seen that talk. Who's used Dash? It's a, if you've heard of R Shiny, it's the Python version of that. It lets you surface an interactive dashboard of capabilities on a website for surfacing insights or data analytics insights, I think, could be very useful. So since today's talk is a substitute talk, I have to give you a little bit of a background. It's also the sequel to an earlier talk. I've been working at this company Curious for a little over a year now, and I do have a PhD from this institution, AUT, an MBA in finance. And as part of my work with Curious, Curious is a spark venture. We ingest about three billion event records from roughly 2 million unique mobile phone and tablet devices that are in use in New Zealand on any given day. So that's quite a torrent of data that we have to deal with. And of course, what we need to be able to deal with them are our algorithms. And my earlier talk had a bumper sticker that the takeaway message was better algorithms, better performance. And I find that algorithms are something that typically business people that we often deal with as clients don't have a good grasp of. I'm sure you all understand them, but I'd like to give you this hand computing tool to help you explain algorithms and the performance of algorithms to those less fortunate. So first of all, let's just start off with what is this number? Four? No, that's wrong. The correct answer, I think, really should be it depends. Because the algorithm that you've used to determine that these are four fingers here is each finger counts for one. Now, wait a minute. That's a one. That's a one. That's a one. If I had the number 1111, what's that number? 1,111. As soon as we take the same number and put it on a piece of paper, we use a different algorithm which involves this notion of place value. So place value is a really, truly revolutionary method of dealing with numbers, as well as the concept of zero. So what is zero in the scheme of place number? It's a placeholder, right? So we also looked at this immediately and said those are four fingers. Well, it didn't always used to be that way. You could also count the joints of your fingers. So what would this number be using that algorithm? 12, exactly. Now, have you ever given any thought to this odd situation we have in the world where the day is divided into how many hours? 24, you think the 12 on the front of your hand and the 12 on the back of your hand might have contributed to that? And look, we've got five on the other side. There's five times 12 is? 60 is 60 seconds in a minute, 60 seconds in an hour. The fact that we use these numbers to keep track of the minutes and days of our lives relates literally to our anatomy, the way our hands are shaped and divided into 12 joints. 50 years ago in this country, people used a kind of coin called the shilling. Any idea how many shillings there were? 20. There you go. So when you start working with numbers that have place value, as computers do with binary, you can use your hand to count to something other than four or 12. You can count all the way up to 15. All right, let me take you through this. Place value in the binary system. Each digit will be twice the value of four. We start with zero or one, two, four, and eight. So just right away, if you add those numbers together, eight plus four plus two plus one is 15. So count with me to 15 in binary with your fingers. Ready? One, two, three, four, five, six, seven. So if you've gotten up to holding up three fingers, that's seven. You're pinky, that's eight, nine, 10, 11, 12, 13, 14, 15. Yeah, it takes a little bit of practice, but you can win a bar bet with this. But you can even do better. So anybody work with the colors for web design, and you're always seeing these like FF and AB. So if we've got two algorithms now, one place value and one counting with joints, if we combine them, we've actually now got base four numbers to work with. Now, this is even more complicated, but you're going to do this. We're going to count to 15 on just two fingers. The powers of four now are 1, 4, 16, and 64. But each finger has 0, 1, 2, or 3 of those. So now let's think about this for a minute. That means 1, 2, or 3, 4, 8, or 12, 32, another 16 is 40. Now, 32, 48, right? 64, 128, and 192. So 192 plus 48 is 240 plus 12 is 252 plus 3, 255. Now, 255, I'm sure you all agree, is a very important number that everyone should know. It's the maximum value of one byte of information. Now, how many knew that you could use your four fingers to count up to one byte of information? Eight bits. We'll do the four bits on one finger here. Ready? 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15. I'm sorry, I forgot to count it off for you in hexadecimal. You should have stopped me. 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F. F, F. What color if I do F, F, F for green and F, F for blue? White. Yes. You can do it with your own hands. So it turns out that over the course of the last, well, gosh, it's going on 60 years, the world of computer science has expanded to cover every letter of the English alphabet. Oh, by the way, in that one byte of information, there's 64, 65. That's the capital letter A. And you know that the lowercase letter A is 32 higher, not 26 higher. So 97 is your lowercase letter A. And you can keep counting. There's a number for every letter. There's a number for each of the colors, as we've seen. There's a number with just 16 bits for every level of a frequency of sound. And then it wasn't until 1996 that we wound up coming up with the numeric codes for every symbol of every language. So Chinese characters, Egyptian hieroglyphics, but this is getting to be old news, right? What all of these capabilities added together have done, along with Moore's law, is make our mobile phones capable of doing much more than the original analog cellular phones. We can now read anything. We can listen to anything. We can take pictures of anything or make movies of anything. Your iPhone and the 4G networks and what's coming with the 5Gs is a ubiquitous kind of computing where everything is big data. And that's why we need skills and we need an understanding of how this data can be manipulated. That was my first talk. This was the better algorithms, better performance. And now I'm ready to give the talk that we're here for today, which is See New Things and Act. So See New Things and Act is really a digest version of this book called Everybody Lies, Big Data, New Data, and what the internet can tell us about who we really are. So who's used Google Search? Who's used Google Search for something that you might not want other people to know? So the premise of this book and this author's research is that Google Search is actually the best anthropological tool that's ever been invented because unlike any survey where you get asked, what is it that you use the internet for, people say, news and shopping, when you actually look, it's porn, so everybody lies. So he maintains that there are four superpowers for big data and I've got a little shorthand way of remembering them, which is See New Things and Act. And I'm just going to take you through each one of those. Seeing, so what big data enables us to do is to look as Google does in your Google location history. Who uses Google Maps? And who has enabled this feature and gone and looked at their timeline? So this example for those who haven't is my travels around New Zealand in the recent past and each one of those red dots that the other talk I gave was down in Wellington. The big data that Google has collected about my location history, Grant, are you in the audience? My colleague, I gave him my location history recently and I think it was about 12 megabytes of my location for every five minutes for the last five years. A fairly significant amount of data and that's just me. So big data gives us ability to see this big picture. This is everywhere I went, but it also gives us the ability to drill down. So you can pick a particular day and here you can see on that Friday evening I went to that particular restaurant. Now I think this is really quite a remarkable feature of the Google's algorithms that they're able to see not only where I live and where I work. You see, those have been actually labeled on the map. But it tells where I ate dinner on Friday. And if it knows that, that fine level of detail about me as an individual, and it has this big picture of my overall activities, and it can do this for the entire population, then it has the ability to create insights that are about the big things, the big patterns of movements and who's where when, but also this long tail of very unique kinds of observations about where a particular person went to eat on a particular day or the set of people that ate at that restaurant. So this is a superpower of big data that's unlike a traditional survey which has to consider just a sample, it might not pick up some of the patterns in that long tail. New. What's new about big data and the fact that in the case of Google here, we're able to get down into the behavior at a particular place is that we can create a new kind of insight. Have you googled up a restaurant or destination and had this part of the search results tell you when that place is going to be busy and how long people typically spend there? This kind of insight wouldn't have been possible without big data unless you had somebody sitting there with a stopwatch and keeping records for all of the visitors to every single restaurant. But Google can do this because of their repository of specific individuals at specific restaurants aggregated and then analyzed from this time perspective. New insights are possible with big data. Things. So as the author maintains, when you're looking at what people actually do, what are they searching for? His actual conclusions in the book were kind of startling. He said that Americans have the desire to be politically correct. And if you ask them if they're racist, of course, they'll deny it. But if you then look at what the search histories are, you find that they're actually people who are going, where is my local KKK branch? What can you tell me of Google about white supremacy? And then he was able to say, I can pin down which parts of the country are more likely to have groups of people making those kinds of searches and darned if that map of the United States didn't overlap almost precisely with those parts of the country which voted for Donald Trump. So things is the ability of big data to get at what people actually do. Which web page do they click on? Which button on which page do they click on? Or in the case of Google again and their locations, if I wanted to go from Eden Park to that particular noodle shop, Google would tell me how busy the road was going to be along that route. Again, it's a behavioral insight that's only made possible by their continual polling of your device's location even as you're driving along in your car. So I maintain this is the biggest advantage of having big data is that you can use this tool. Google Analytics, it's by far the most used analytical tool on the planet at the moment because you can get at what drives people's behaviors and you can optimize their behaviors. And I think what's going to happen in the very near future is the kinds of optimizations that Google has traditionally been able to offer for online interactions are spilling over into the real world. And we're going to find that the rate at which that driverless car arrives at our address when we want to use it is going to be significantly decreased by algorithms which are able to process this flow of information about when we want to move and predictively put vehicles where we want them to be and so on. Everything about supply chain logistics has long been mediated by algorithms but it's with the internet of things this is going to expand into even more parts of our lives. Despite the compelling, I would say, reasons for there to be big data in virtually every business that you can see, the new things and act upon them, businesses by and large haven't adopted big data technology but I think there's something new that's coming along that could dramatically change this picture. Everyone's familiar with AlphaGo? All right, so this has been alluded to, this deep mind back in March had created a program in a long tradition, IBM originally created a checkers playing program which beat the best human chess checkers player and then they came along with the Casperab with Deep Blue in the game of chess and then they ventured into natural language processing with IBM Watson and won in the game of Jeopardy and then it was Google's turn here to come up with what many thought was going to be a very difficult challenge to win in the game of Go. I mean, how many of you looking at that board would know where you would place your next stone? You know, that's a little complicated but what they didn't stop with beating the human best, they took it one step further. And this paper was just published in October and the blue line that you see working its way up and beyond the purple line is the performance of a second Go playing system called AlphaGo Zero. And I think some of the commentators seeing this result have suggested that this is another milestone in the progress of artificial intelligence and an indicator of where we may be going in the future with regard to machine learning, the system which beat Lee Seadol, the human contestant, was trained on the games that master humans had, Go players had played as well as additional training. AlphaGo Zero, the reason it has such a terrible, terrible low score, this yellow rating here is it began with random moves. Never looked at a single human player's game but began playing it against itself until it mastered all of the strategies and with reinforcement learning, for those who aren't familiar with the concept, you don't actually know how to update the weights in your model or to improve it until you've actually reached the end of the game and then the reinforcement gets propagated through all of the strategies and techniques that were used to win. Within three days, AlphaGo Zero had taught itself. This is a game that had existed in humankind for thousands of years. Millions of people have played it and tried to become the masters of it and here this computer has mastered the game from scratch in three days. Both, of course, that played nearly five million games against itself in that time. So practice makes perfect, right? So the bumper sticker for this talk is seeing new things and act. Other talk was better algorithms, better performance but I think the real takeaway message here is that for all the superpowers of big data, the future lies in learning from data and well, big data is said to be the new oil I believe it's learning from big data that is the new oil and with that I'd really like to open this up to questions since I'm a senior data scientist at Curious and I realize we have some people that are curious about Curious or data science. We are hiring big data engineers at the moment, I could say as well. So anybody have a question about data science? Yes. Okay, so the question is where do you get data from and in the case of the founding of Curious, we're a spark venture, the premise was that we would get data from cell phones. So each cell phone connects up to a tower every time it's communicating any SMS message, data request, phone calls of course, don't get many of those anymore and that data all gets fed over to Curious in an anodized way and then we aggregated into insights. So we've had a product in the market that particularly, tourism is New Zealand's largest export business at the moment, the biggest part of the economy relates to the visitors that we get in New Zealand and I was very surprised. I thought okay, we're looking at individual unique devices that come on to the network every given day and New Zealand has a population, call it 5 million, how many unique devices do you think are in our database? We've been collecting data for about two years. 20 million is actually a good guess. It's right at the moment is 14 and a half and the reason is we get plain loads of people that as soon as they step off the plane, the first thing they do is they look at their cell phone and that means it connects up to the spark network and we get a record in our data and then two weeks later they've gotten off into the plane and taken off again and then another 10,000 arrived and another 10,000 arrived and that happens every single day in this country and consequently over the course of years we get way more foreign visitors than we actually have domestic residents. How many of you have multiple SIM cards? Right, so there's also that phenomena that there's churn in the subscriber base and some people actually have multiple phones and things like that. The question about where we get our data from though it ultimately comes from our other clients and so Curious has served government, finance, healthcare, we're really quite much across the board and then we look to open data sources to augment that so we pull in things like the census and weather data and different kinds of analyses always involve different kinds of data which is part of the fun of it but we do have a significant ETL kind of process along with an analytics process and then a data delivery, insight process which is why I'm interested in that, Zappa. I can make a chart or a graph now and Zappa it into the AWS Lambda and now I can just send somebody a URL and they can look at that or interact with that. In the back. Hi, I was just curious to know what you take away from the AlphaGo Zero experience. I look at that and I think to myself Go is a game that I would struggle to play with any competency and here's a machine that in the space of four days is beating the best humans on the planet so that's enormously impressive but at the same time, Go, it seems to me is a very closed domain. It's a very difficult thing for a human to do well and yet at the same time it's a very limited sort of thing really and yeah, I'm just curious to know do you draw a line from that to, I don't know, an android walking down the street doing everything that a human does as they walk down the street because it seems to me that is a massively multidimensional, massively more dimensional task than playing Go. Yeah, the question gets at how general can we be with artificial general intelligence and the reality is that we start small. We start with solving problems that are seemingly intractable like playing checkers and then playing chess and then playing Go and then recognizing images within objects and then having the recognition of those objects drive the controls of a vehicle, a motor vehicle which used to employ somebody called a truck driver that we now don't need anymore and so when you say you're concerned about how you get from solving Go to something that matters to us in the real world that's the way I make the connection. It's an accumulation of skills that one analysis that I read just yesterday suggests humans may be like horses and there was a point at which machines simply had much better capabilities for generating power, providing transport and so forth so that the horse population, at least in the United States and which was used in the study, dropped by 90% and it may be that the work that people are currently doing could disappear at a similar rate at some point in the future as the skills of the computers multiply and I really think it's a policy question and something that we're gonna have to face at some point. The thing that excites me is that we're all here. We all are able to think about this and gain the skills that will enable us to build some of the systems that will, yeah maybe put a few people out of work but we're also building systems that are gonna help people to learn and we're gonna help people use 3D printing to create the goods and material wealth that we can benefit from at a far, far lower cost. There are ways in which the world is gonna be transformed into a world of abundance and you just really may not need to have a job if pretty much everything you need can be created with solar power driving automated equipment at very, very low cost. Yes. Hey, my question's maybe coming back a little bit to the data science. You mentioned you get your information from Spark. Yes. So given these two other carriers, do you just kind of extrapolate or do you assume that maybe customers of other networks behave a little bit differently based on maybe they're cheaper or more expensive or how do you kind of fill in those holes, I guess? Yep, that's an excellent question. We typically do just provide an extrapolation factor to cover off on the fact that Spark and Skinny only have a limited market share within New Zealand. It varies between about a third of the Auckland market to very close to half of most of the South Island and interestingly, we actually can have some insight into the amount of traffic on the other networks because Spark is the offspring of telecom. We actually own the whole backbone. So when somebody needs to move data from some place to another, they actually still wind up coming in through our wholesale network and so we actually have some evidence about how often our subscribers are making calls into numbers that aren't ours. So that also gives us a perspective on precisely what the relative position of market share is. So we can make some fairly reasonable extrapolations. The thing that gets really tricky though is all the people that get off the plane. Remember, we're talking about these international visitors as being the sort of key to the puzzle here in New Zealand and some of them just use their phone, they roam. Some of them come and the first stop is the Spark store, we hope, in Auckland Airport where they pick up a SIM and then how do we tell that that is actually an international visitor? That has been a very interesting problem for me to try and come up with what we call the international local SIM segment of the market where it turns out that for the most part I can look at those devices that the SIM identifies the country of origin. I know that they are a Chinese phone and then I can say, all right, the Chinese phones go to the casino and then I can say, okay, here's another local SIM that went from the airport to the casino, must be Chinese and it's an assumption, there could be some locals who do the same thing every time they come to Auckland to gamble. Next. John, just going back to the gentleman at the backs following on from that, his question. Don't you see a crude correlation involved in what you're doing and policy ramifications in terms of non-rational discrimination in terms of the context dropping of things you know would have an impact but you're not including them because either you don't have the data or because you gotta keep your data set simple enough or something of that nature. You gave an example of Trump earlier and drawing a correlation. Oh, right, yeah, so not to imply that every Republican is racist, no. Yeah, yeah, so the fact of the matter is that big data has been fingered already for this. It's the white Anglo-Saxon bias, we just don't have that much representation of minority voices in the building of artificial intelligence at the moment. It's just a fact and consequently, the systems that are being designed just sort of make assumptions almost subconsciously. It's like we haven't gone to that extra effort to protect against our own prejudices and so I think one of the things that has always impressed me about PyCon and the Python community is our sort of emphasis on diversity and PyLadies and bringing additional voices into this discussion but I think it needs to happen even more and that if we don't do it, there's gonna be people who are gonna get hurt and it's kind of unintentionally and just as a result of systems that are meant to score mortgages and help a bank to optimize their loan decisions, marginalizing a group of people simply because they have the wrong skin color. I just wanna ask how do you deal with noisy data and what sort of processes you do for data mongling? All right, noisy data is a fact of life. Every data source that we get has got noise in it, has got missing values in it and oftentimes has this distinctly challenging problem of it's the data that some person put in there that they thought was right at the time which it looks like it could be good data but it really is oftentimes not and so you begin to realize that computers have a preference for things that are systematically and reliably and repeatedly true and this is why the internet of things could be such a big deal is because that sensor that's telling what's the temperature in this room and how much electricity was being used for the air conditioner to make it be that is something that you can measure as many times an hour as you like and as a result the decisions that are being made based on that data can be modeled, it can be something predictable. So when we get junk, one of the simplest things that we do is to try and abstract away from it so we don't look at things at the granularity of the data as it's delivered. Just make histograms of the values that have been supplied and bin it and then you look at the stuff that is in the outliers and then maybe decide that those aren't even appropriate for your analysis but you get an idea of where the signal is, what is likely to be noise and then you come up with really effectively a compromise solution that addresses that question. What is your software stack for storage and compute? All right, we have a variety of methods for going from raw to insight but the curious is really built in terms of its product delivery pipeline on Hive and Hadoop. So we do have our own on-prem Hadoop cluster and there's another part of our business that uses the Google Traffics data that we looked at earlier and we can generate insights out of making calls into that API and all that's now on AWS. So we've worked with no SQL databases and we have stood up next to the Hadoop is a Postgres Green Plum instance so we can also do distributed SQL queries into that. I personally really like this kind of combination of Jupyter notebooks with the HD5 format. Anybody familiar with this one? It's a scientific computing, it's virtually a memory image so it's very quick to get it down on disk and get it back into your pandas data frames with a simple pd.hdf store, sets up the relationship and then you can store multiple tables in one of the files, it's like a dictionary. So store df equals df means that there will be a data frame with that data content in your file. Store.close and you're done. So it's a really simple way that you're not having to do a whole bunch of SQL alchemy kind of magic to just get access to your data. My question is around privacy of the users. You're using the data of users who use your network. How do you ensure that their privacy is not violated or are they aware of the fact that you're using their data for these kind of analytics? There are small print in these bark and skinny terms and conditions. To get to the data, I have a swipe card that gets me into my office, I have a key that gets me into my locker, I have a password that gets me onto my laptop, I have an RSA key that gets me onto the VPN and then I can ask a query and if that's not enough security to keep some hacker from coming in from the outside, we have a pen testing service that we use to make sure that our VPN is secure and we have a guy named Guy Kloss, and remember Guy is with the NZPug on our staff and he's kind of one of our security experts. So we take security very seriously. All of the data we get from spark and skinny has been anonymized, it's all been hashed, and so while I can see a pattern of movement, I don't know who it is, I can't tie it back to any customer record or any even phone number. All I know is this hash ID and the series of cell towers with that device is connected up to, and again, our goal isn't to track people, it's to look at aggregated insights and so we're really concerned with things like how many visitors came into Auckland in the last month and then what does that look like as a time series over time? Yes. Okay, Python is my go-to tool for data analytics and that's not just me. Google recently acquired Kaggle. Kaggle did a survey, 16,000 results. Python is the number one data science tool in the world. Yay! And I think Jupyter notebooks are really one of the most significant advances in sharing of data analytics insights based on Python, although you can now use Julia and R and other kernels within the Jupyter environment. I use pandas, I use a number of related sort of geospatial libraries and I think there was a talk about geopandas. You're giving it, excellent, all right. So if you wanna learn more, go to that talk. If you wanna learn more about how you actually get your hands on some big query data with Python, that's the following talk. So I think that's time for this talk. So thank you.