 I was a little surprised when I was mentioning some people that I was doing a talk at scale on Kaggle and R. Quite a few of the people that I spoke to that actually work with data weren't too familiar with Kaggle. Can I get kind of a show of hands who's heard of Kaggle before? I was familiar with it. Okay. So pretty good. And then I'm assuming everyone's heard of R. Yeah. Okay. Didn't need to answer that one. Give me one sec. Let me see if I can get rid of this bar at the top. Is that better? Do you guys prefer the bar at the top on the other way? Go that way. Okay. Okay. I'm going to... I have a lot of slides so I'm going to go ahead and get going and let people catch up. We're going to have to go pretty fast. I was going to end up being a little bit of a lightning talk just stretched out over an hour. I ended up with a lot more data on Kaggle than I anticipated. Alright. So since I can never make a decent opening slide in R Markdown, I just went ahead and made one manually a few slides in. There was recently a Kaggle competition where you're trying to identify humpback whales. I found out about the competition far too late to actually join it but I was kind of having fun looking at all the images and scoot around in R and did some edge detection and put that together and thought, you know, that would be a good way to kind of just open things up. Alright. So we already kind of did who's heard of R, who's heard of Kaggle and it seems like pretty much everybody here has. Is anybody here a rated Kaggle person? Do we have any masters or grandmasters by chance? Okay. And how about anybody here on the R 14? You're on the R 14? Alright. Now I got it. Okay. Looks like I'm going to have to leave the talk. Okay. So my goal is not to try to recreate one of the all the great material that's out there for learning how to use R. I'm going to basically try to give you the meta knowledge so you can go from being a beginner to an intermediate as quickly as possible. So we're going to focus on the not reinventing the wheel is me not showing you basic tutorials that you can see of dozens of websites. We're going to kind of focus on a few important concepts. We're going to cover a fair bit of our terminology but not in depth just kind of as we use it here. There's a few common misunderstandings that I'm going to touch on. We're going to get into some packages specifically packages that are useful for Kaggle Contest. And that's pretty much it. And I haven't put this up on my GitHub yet or on the scale side. I've been kind of tweaking things. When I originally created it, I had several different scripts that I'm pushing together into an R Markdown document that has been a little more difficult than I anticipated. The I've intentionally not put libraries at the top of the files so that people who are kind of new to things will see, you know, here's library reshape and I'm using some feature from it. Here's ggplot and I'm using some feature from it just to make it a little easier for people to match functions with packages. Alright, so let's jump into Kaggle. So Kaggle is an online community. This is the Wikipedia definite now. An online community of data scientists and machine learners. And that's kind of a broader definition of what Kaggle is and if you'd heard about it five or six years ago where you just said that's the machine learning website where they have competitions. And that's really what they're well known for. Depending on if you look at Wikipedia, if you look at some other things, Kaggle was founded in either April 2010. That's what it says on Wikipedia, but I've seen other documents where Anthony Goldblum, one of the founders, is saying they were founded in 2009. So I don't know if they had like a self startup period that he's referring to and then they went public in 2010 or what the story is that in March of 2017, Google acquired them, presumably as part of their cloud machine learning strategy, that whole platform. They currently have more than one million members. At the time I healed some data from their competitions page, they currently had 301 past competitions, 16 current competitions. Since then, three of the competitions went from current to past. Last time I checked, that may be four by now. Okay. Why should you care about Kaggle? So they have some really cool competitions. We'll get into that in a little bit. If you're trying to use machine learning to further scientific research, they're definitely a great resource for a lot of material on that. A lot of good sciences happen because of Kaggle. We'll get into some tips from the Grandmasters, but it's a place where you see the papers on different topics in machine learning and it's kind of the place where these things get implemented. A lot of the guys are like, until you've done it in a competition, in a lot of cases you don't really understand what's in that paper. You haven't worked through the hard part. It's also a great website for people looking to work in data science. You know, trying to either starting out fresh or maybe already working at a tech position or something and you're transitioning over, you know, kind of thing. Another interesting thing that they've done recently, they have Kaggle kernels and now they have a Kaggle kernel where you can use a one of the Titan X GPUs up in the cloud. So you kind of get a nice little setup for free if you don't have your own GPU or if you don't want to train one or set it up, you know. Okay. Okay, so here's my plea from the whales. I personally think some of the coolest Kaggle contests deal with the environment with animals, things like that. In the opening slides I mentioned there was a humpback whale identification challenge. Literally, people were trying to match, identify humpback whales based on their flukes. So apparently humpback whales have unique characteristics and their flukes and the goal of the contest, you know, you got two different images taken at different times and is this the same whale or not. And apparently they based on there were two versions of this challenge. There was one that was in the playground, a playground level competition. We'll get in that terminology in a minute. And the most recent one was a feature level competition. I read some of the forum comments by the host of the competition and he was just blown away with how many images the Kagglers were able to identify that they had in the past. There are several other interesting science ones. But maybe you're not into, you know, the environment or animals and that type of thing. If you want to save humans. So there's a bunch of interesting medical competitions that are available too. So I kind of just listed these out depending on, you know, who was interested in what one of the current ones is that the histopathology cancer detection competition. It's in the playground section. So what they're trying to do is actually make this cancer detection project as approachable as the MNES did recognition problem, which is kind of cool. So hopefully we'll see some progress in this area as well. In terms of prize money, a lot of people have heard about this. These are the five largest prizes that have been awarded by Kaggle, the TSA passenger screening algorithm, they were awarded 1.5 million in total prizes. The little prize was 1.2 million. The data science school 2017, which was another cancer detection problem, which was a million dollar prize. So there's, you know, if you have the talent and you got a good team, you could actually scoop up some pretty good money too. So one of the things I did is I went through and took their competitions page and pulled the data from that, beat on it in R and Emax for a little while and came up with my own little data set to try to explain what happened in past competitions. So here's the different prize categories Kaggle has awarded broken down by year. We can see that I couldn't get GG. I'm not super talented with GG Plus, so I couldn't get it to stop putting decimal points in my years at the bottom there. Apologize for that. You can see that the money I added a Z in front of it, so it would keep it all on the bottom. It's kind of gone up and down, but it's kind of leveled out in terms of 20 plus competitions per year that award prize money. More recently, there's been a push for these job positions you see start popping up in 2014 through 16 time range. There's the newest thing as also the swag that they've been offering a lot more of lately for often the recruitment contest. Okay, so there for their research competitions, they often have slightly more data that's not necessarily tabular or normal image data, things start to get a little fuzzy, and it's not following their normal competition format. So the idea of knowledge is that you're actually, and in some cases, you're actually working on like ImageNet and other academic projects sort of indirectly. So you have the opportunity to be invited possibly as a speaker at a conference and things of that nature. But that those contests are in some cases putting a little bit more on the bleeding edge of machine learning techniques as part of it. One of the things that you know, I found out about a long time ago, and it's been one of those things I kind of put on the back burner had never really got involved with. So part of the reason I wanted to give this talk was it was going to force me to actually begin and get started with it. But one of the original things that I saw that got me interested in it was when this happened, it was in 2011 or 2012. NASA and the more is it the Royal Astronomical Society, they created a Kaggle challenge. And it's basically trying to identify the degree of ellipticalness. That's not really a word in galaxies. And within a week, I think it was, yeah, in less than a week, a PhD student on Kaggle actually beat the current existing pop algorithms. I was like, whoa, that's kind of interesting. And that's fast to right in a week. And then from there, just, you know, you saw this improvement. Anthony Goldblum does a lessons learned video. I'll have a link to it in a little bit where he shows the rate of improvement in this contest. You immediately see like this drop in the score. And then it kind of levels out as people. Actually, it might be my next slide. No, it's people sort of find signal in this noise. Another reason I'm interested in Kaggle is so many of the top Kagglers seem to have really nice data science jobs and just kind of the choice of their positions. One person in particular has been written up quite a bit. And the link here is for a wired article. There's a guy named Gilberto. And I'm not going to try to pronounce his last name. You can see it in the middle there. It's P I T R I C V. He he kind of just started he was a electrical engineer just started doing the contest on the side, became kind of addicted to doing machine learning. And within a thing, it was in a three year period, he became the number one ranked tabular. He literally became a master. They have categories in the production system will get into it a bit. But after two competitions, he went from novice to master. So he had this, you know, exponential increase in his ranking in his progression. And you know, he's well, the last I've seen he works at Airbnb still. There was another gentleman that they were talking about in this who's in part because of his Kaggle credentials, salary tripled three times in three years, which is pretty phenomenal. Good all use some of that, right? Okay, so for the most part, the Kaggle website is fairly using a navigate and everything sort of laid out. Competition information, we're not going to cover data sets to get the kernels, discussion forms, which are very important. And then there's the triple dot with the whole bunch of different information. A lot of the help is in right here in this one documentation page. There's a really cool blog kind of cut off here at the bottom called no free hunch at the bottom, definitely worth a read. And then the user ranking information here. So we're going to dig into each of these levels and their site. So the competitions, this is the, you know, just kind of the top of the page and their active competitions, you can actually see a lot. And this is where I kind of cured all the past competition data from tells you whether it's a feature competition or not, any prize money, the number of teams currently signed up. A general these tags kind of tell you what sort of topics are involved. There's going to be tabular data, you're going to do classification, are you going to be dealing with sports data or finance or what have you, once you click the link, it'll bring up an actual competition page. There's several different types of Kaggle competitions, or they're all listed. Based on the 301 past competitions that had data for, they sort of broke down like this. Almost all the as far as I know, all the money competitions were, quote, unquote, featured competitions. There's, but there's all kinds and there's several getting started. And there's several there used to have masters competitions will dig into these in a little bit. So breaking down this competitions by year. The we see that the playground became popular in this time range, featured competitions are kind of their bread and butter and they'll always have that roughly 20 per year to pay the money, right? Recently, they had this analytics competition for the NFL. This competition was a little different. It wasn't so much a machine learning competition, you were looking at data, trying to suggest changes to the NFL's punning policy, the rulebook, in order to reduce concussion injuries. And basically, you were giving a presentation to, I guess, executives at the NFL, and they were the judge. So apparently there's the potential, I guess, for Kaggle to kind of move in this new direction a little bit away from traditional machine learning competitions and probably would be more a little bit more of like statistics and medical science type stuff. So that'd be kind of something to keep an eye on. Okay, so on on their website, when Kaggle lists the types of competition, they tend to have these, you know, wordy things for what they are, you know, but the, what they say about the feature competitions is largely true. You're gonna have prize money, and you're gonna have formal experts and we'll get into that in one particular competition a little bit. It's hard to quantify the jump in expertise between, say, a playground competition or a research competition and a featured competition, but it does seem to be pretty significant. Oh, one of the early competitions, featured competitions didn't offer prize money. They introduced kudos eight years ago back in the early days of Kaggle. Let's get it. Alright, so here's sort of the breakdown of the prize money that they've given out over the years, you know, topping in 2018, I believe both the TSA and the Zillow Prize were ended in that timeframe there. So that's why the big jump to 3.3 million. Okay, so one of the things that became pretty popular on Kaggle, I was surprised at how many of these competitions there were this idea of recruitment competitions. So basically, a company hosts some data relevant to their business, you get a look at it, work on it. And if you do well in the contest, you're potentially getting an interview at, you know, Facebook, Facebook was on there several times, it was really surprised. So Walmart's on there a bunch of times. There's a couple of finance companies, all state for insurance, Yelp, Airbnb. So quite a few companies are hiring Kaggle's in almost a very direct manner in this way, apparently. Okay, so they would they say about the research competitions more experimental. Again, I don't know how to quantify that, I kind of looked around at some of them that just seem to have data sets that are less well defined from what I can tell. These competitions don't offer prizes or points, we'll get into that in a little bit. Okay, and so here's what I was referring to about that jump in the level of expertise. So this person I forget her full name or first names Olga, she plays second in this competition, the original whale identification competition, which was in the playground category. And then she entered, they believe it was eight months later, the actual featured whale identification competition that offered prize money. And she her score significantly dropped her rank on the competition. She went from sixth place, her second place to 1000 and 43rd. So just shows how much more, you know, once the money is there, right? More talent shows up. Okay, so our I'm not super familiar with the playground competitions. The we'll get into the getting started here, but they describe them as, you know, a level up from getting started. They seem to look pretty interesting. Again, I mentioned this one here, the Hista plus the logic cancer detection project is currently ongoing. They're, they're offering quite a few of these competitions. So maybe Kaggle saw a need for moving people from getting started to feature and kind of a more progressive fashion, you know, not just a huge jump. Okay, so the beginning started competitions, they're these are the ones most people are familiar with because there's a ton of tutorials on them. And they're, for some reason on Kaggle site, they call them past competitions are definitely running and most of them are running until like 23rd, you're some crazy time period. They're the digit recognizer problem with the endless database of digits. The original one that I remember is the Titanic, you know, you predict survival of people from the Titanic. And the most recent one was the housing prices advanced regression techniques one. And what these competitions do is in order to kind of entice people to interim and keep incentive, they have a rolling leaderboard. So every two months or your if you submit to one of these contests, your score will stay on the leaderboard for two months, and it just kind of naturally fall off. No prizes or points just kind of a fun learn how to use Kaggle kind of thing. Okay, they have a every year they have two different annual competitions. And March they had the March machine learning competitions. And then in December, they have a Santa theme competition. I put so when I one of the things I did with the data says they have their tags, they had basketball tag, but they didn't have clearly whether it was one of the March competitions or one of the Santa competition. So here I just said show me everything with basketball. And I thought it was funny that they had a Kobe Bryant shot selection competition, but no past selection competition. It's not an update apparently. All right, so here are the the Santa themed competitions. Until I dug into this, I wasn't aware that they had these competitions. Imagine they're not super well attended being in, you know, the holiday period, but they've been going on for quite some time. Okay, and the the list this on the website, I don't know if they're doing it anymore, and they just have it in the documentation, but they had these limited participation invite only contest typically you had to be imagined now you have to be a grandmaster, but in the past, you had to be a master or higher. They ran six of them in three in 2013, three in 2014. Some of them they show the money that they offer. There's two of them where they, they list usd is the tag and they don't tell you how much money was offered as a prize because it was kind of interesting. Okay, and they on their site, they actually list three different types of competitions we're gonna dig into these and say, really, I call it 2.5 because there's really the classic competitions where they give you a data set, you train your model, you take their test data set, which isn't labeled, and they've held a certain amount of data will jump into the percentages of how much they're withholding. And then you submit this, you know, prediction labels or regression, what have you depending on the contest back to them and then they tell you what's going on on the public leaderboard and ultimately you see something on the private leaderboard. Later on, they added kernels competitions where you essentially you submit code, and they run they'll still give you the data so you can develop the code, but they run it. They're the code in a kernel on their site, and you don't actually submit a data file. And they they have what they call two stage competitions. As far as I'm aware, you can have a classic competition in one stage and a kernel competition in the next, right? So it's not really a third different kind of competition. It's just two stages of a competition. Okay, so I kind of jumped into this, but so you're going to build models locally, right? In the in the classic, they describe this being by far the most common type. When I looked at the past competitions ended up being 96.68% of them. So very by far. Okay, then the two stage competitions, like I had mentioned, they're, you know, just two parts, you know, and usually there's a feeder part. If you do well enough on stage one, you get an invite into stage two. The first competition that I saw that had this was the Zilla Prize, opened in May of 2017, had a private round that started in February of 2018. And then after three months after that, that closed. And this was an interesting competition in that they don't actually have the test data set for the private leaderboard, they're using actual housing sale data. There's another one where they use stock market data. So they're actually acquiring sort of real world data as the competition progresses, or over a period of time, that's how the competition progresses. Okay, so the the kernels only competitions. It's only been there's only been three featured kernel only competitions, which kind of surprised me because they make quite a bit of noise about these competitions, you'll read about them a lot. There's seven of the playgrounds that offer them. And that one of the ideas was, if you listen to some of the panels on Kaggle grandmasters, don't make a point of saying that, you know, it really does help to have a good computer. You know, one of the guys has two computers, one with two GPUs and one with four GPUs so he can train his model, right? So he's definitely at an advantage against the average person when it comes to like image, you know, deep generating deep learning models. So part of the reason for the kernels was to kind of level the playing field. And it was probably also, you know, imagine most people here are pretty comfortable in Linux being at scale. But there's probably a lot of data scientists that, you know, they don't want to set up a system with the GPU that they have to use either. Okay, so the, the three featured kernels only competitions that they ran were these ones threw it out there just so you guys can see them. Okay. Unfortunately, I kind of ran into some issues trying to get this to work. But I want to eventually go back and show how to use their GPU kernel system. I think that's such a nice feature to be able to go in and run models on one of their GPUs that I'm going to go back and fill in the site. So I just left a point. Okay, so if you're going to compete in a Kaggle competition, you're going to be part of a team. Now, even if you compete as a solo competitor in a competition, you're going to be a team of one, you always have to have a team. And they have a, on their documentation, I give you guys sort of a cliff notes on these some of these rules. So if you have a team of five people, if you have a team of one person, typically you can only do five submissions per day, doesn't matter. You can later on the next I will talk about merging teams and stuff like that a little bit. On teams, if you do have multiple people, you're going to have one person as a team leader, everybody else is just a member. My team ops slide, they, you know, just kind of some of the basics. You can change your team name, quite a few of the people have funny team names, you know, they just kind of have fun with it. Doesn't have to be your member name, but the default is for it to be that a lot of people change it. For some reason, anyone on the team can change the team name, not just the leader, which I thought was kind of funny. I don't know why they did that. At a certain point in most of the competitions, you will see a merger deadline where you can no longer kind of co less, you know, you'll see this person's done really well with this type of algorithm or fairly well with this type of algorithm and someone else done well with this algorithm, and they talking in the discussion forum, and they decide to form a team and try to, you know, create an ensemble model that does better than either one type of thing. So at a certain point, they stop mergers. You can't merge if your team is already over the size, obviously, and there's kind of a funny gotcha if you've done your five submissions in a day, you cannot merge that day. So if you're getting close to the merger deadline, you need to kind of keep an eye on if you're planning on merging, you know, leave a spot open so you can. Let's see. Oh, okay. So they're on the cable page. There's the their user rankings, right? So there's essentially two ways that Apple ranks people. They rate you. They're not going to get into this. Their user rankings are based on a scoring algorithm. It's not like ELO and chess, but similar by analogy, where they, they fit people. And then they have this other system of grandmasters, they call it their progression tiers. So they have five different levels of progression tiers. You can see the highest and they do this across competitions, kernels, and discussion. And on your, on this rankings page, they show whichever one you're highest on across any of those three. Here, they also show the medals that you've got the competitions here that people have won in competitions. Here's the account for Gilberto. Gilberto slide a while back, currently in second rank second. They show the points that you, you know, each person's earned. So kind of another fun way of getting people to, you know, be involved in the competitions. All right. So say you decide to go for it and you want to become a Kaggle. So they have their progression system. Here's the five tiers for competitions to go from novice to contributor. You register. You do one submission, any of them, I believe. And you do an SMS phone verification. And now you're in the contributor status. If you earn two bronze medals and bronze stretches out pretty far, you jump up to be in an expert. Like I'd mentioned previously, Gilberto jumped from master. So apparently the rules were different when he did this. But after two competitions, he placed third and 11th. Nowadays, you have to do at least three competitions to jump to a master level. And a gold medal is actually, let's see in a minute, it can be pretty hard to get. For Grand Mastery, you need five gold medals. And one of those gold medals has to be done on a team where you're the only member. Okay. So a lot of what I kind of, the tips and tricks, the Kaggle that I kind of picked up were on these panels that are hosted by H2O. They must have like five or six of them. I am particular focused on two of them that I thought one was the first one I listened to was the H2O World 2017 panel. And the second one where I picked up a lot of the techniques from was the 2018 panel that they hosted in San Francisco. In the first panel, I was surprised at the number. These were five people that were a little bit older Kagglers. Almost three of the five were using R, like it's their first language. And it's what they kind of went with. It's kind of surprising. I thought it was going to be mostly like, you know, four to five Python, you know, one R user. So that's kind of good to hear that. And they had actually really good things to say about R. In particular, one of the guys, I forget his name, I think it was something Landry, you know, data tables, GG plot, he just was really into it. Gilberto, the, you know, who's ranked second now, but who was kind of the poster child, he actually got his start, he was an EE. So he just started using Matlab. He was familiar with it. And the neural network toolkit, and that's how he got the two competitions, you know, his first two competitions, it took him the master and took that was kind of cool too. Those of us who use Matlab is kind of an interesting language. So there's a fairly popular talk called Lessons Learned After Two Million Machine Learning Models by Anthony Goldblum, one of the founders. And if you listen to this talk, he sort of will give you the impression that the competitions in 2011, 12, 13, everybody won using random force. And then deep learning happened. And so all the image competitions started being won with deep learning algorithms. And now it's either you went with a deep learning algorithm or if you have some kind of tabular data and you're doing regression or something, it's one of the GBM, you know, gradient boosted models like XGBoost. So he's basically, from what I can tell based on sort of how the Kaggleers approach problems, that's a giant simplification. You might use one of those two algorithms, but how you use it is very important. So the top tips that I was able to kind of kill from those guys, it's very important to understand the competition, especially the evaluation metric for the competition. If you don't understand that metric, you're not going to do well in the competition. It's almost guaranteed. All of them seem to create a baseline model right out of the box. So they'll do a K&N, K&N, you know, the nearest Naples search, they'll do whatever decision tree algorithms, easy for them just to get an idea of a baseline performance that they can work from. Then what apparently most of them do is they kind of jump from sort of the beginning, the sort of bookshelf, the middle part where you're doing, iterating over a lot of models. They'll jump to, how am I going to do cross validation? It seems by far the most important thing to the top Kaggleers is coming up with an offline cross validation scheme and understanding how, if your model's improving how and why and making sure you're not overfitting. And then after they kind of got those two things in place, the baseline and the cross validation, then they'll start understanding the features and doing some feature engineering. Let's try this algorithm or that algorithm and see if we're improving kind of thing. Gilberto went into the cross validation part in a fair bit of depth. He actually tries to do what he calls offline cross validation. So I'm imagining an image contest. He's creating extra data for himself to verify things. And his point was he spends one or two weeks just on this task after, you know, jumping through and spending a few minutes doing a K and N or something like that. And then he tries to ensure that his model is so if you see the 10% gain in his offline validation model that he's seeing that same gain in the leaderboard, the public leaderboard, just so then, so he knows that he's kind of on the right path. That's his sign that my offline validation is good. And then I go saying he then you'll start generating the models, you know, going to the apparently he's known for these very complex stacked models and tons of hyper parameter tuning type of things. There was a one of the talks or one of the grandmasters joined late. So I just call him grandmaster number four. And he came in and apparently he hadn't heard the same thing. So he repeated like he confirmed what was being said and you know, right off the bat. Most important thing, establish the right validation scheme. He does something. And all the taglars talk about this. There there's a lot of good information in the form. So if you do join a competition, keep an eye on the competition form. Every competition has one. Not only will you catch things like external data being posted there, but any changes in the competition early. So just keeps you on track. But a lot of people are also kind of posting in there. Hey, I tried this and it worked well. Or I tried this and it didn't work so well. So they're doing what they call human grid search where they're kind of looking through the forms and just to get a sense of what everyone else is doing. He made a big point of also reusing pipelines from previous competitions, you know, so it looks like for him, it's kind of an iterative process. I used this image classification pipeline on a previous one. It's a lot like this current one. So that's my baseline just going to start from there and kind of work up. And he also suggested something. I had an architect that I worked for that would say stuff like he had a very Spanish accent, he would say, let the machine do the heavy lifting. Right. So and this is this guy reminded me of him. So his idea is, you know, don't spend a ton of time, you know, if there's 100 features on a data set, let a decision tree tell you which features are the most important. Let some other method identify important features, you know, don't don't create scatter plots until you're blue in the face. Let, you know, let the let the computer do some of the work for you. This is the the humpback whale identification. So if you join it, you don't even have to join a competition. If you click on a competition, you'll get to this page. So there's several parts to it here. The overview is very important that evaluation metric is right under the tab evaluation. So spend some time on the description and evaluation understand it if you're going to jump into a competition based on analyze three of the past competitions that recently ended. And it was pretty clear that some people did not understand that evaluation metric. We'll get into that in a second. Over here, go and jump back real quick. There's the data tab. Click on it. This is a very important one to write to the general idea of all the data that's there. A couple things to focus on, you know, if you're kind of limited for computing resources, six gigabytes of data might be a little bit much, right? There was one competition that had a terabyte of images. That's probably way too much, you know, you kind of understand what your constraints are and also to give you a sense of, you know, how much time can you actually dedicate to a competition? You see a giant data set and you're running short on time. Don't you know, don't jump in. So based on the tags that are on that initial competition page, I kind of just pulled what competitions that, you know, they say of text data, image data or what not. Image data, obviously has become popular, you know. But still, there's fair bit of tabular data. One of the taglers on the panels was complaining that there wasn't enough tabular data competitions anymore and that like his ranking was dropping, that was his area of expertise. The thing to note here is that Kaggle doesn't label there all their competitions very well, right? So out of 301 competitions, I'd been bothered to add this up, but that's nowhere near 300, right? So, you know, take that with a grain of salt, you know, that's not exactly what's happening. There might be way more image competitions or tabular competitions or time series competitions than suggested here. Okay, so if you're in a competition, it's very important to understand the difference between the public leader board and the private leader board, right? So, once you do a submission, you're going to show up, your results going to show up in the public leader board almost immediately. And that gives you kind of a rough idea of where you're standing in the competition. Now, if anybody who's submitted to one of these getting starter competitions or a playground competition will know that there's big changes between what happens in the public leader board and what, and when Taggle actually uses their holdout validation set data and re-ranked everybody on the private leader board. Now, how you perform on the public leader board has nothing to do with how you do on the private leader board or how you, you know, all the prizes, everything is determined by the private leader board. So, if you overfit your model, you're going to just drop drastically, and we'll see a couple examples of that. Okay, so this is just the image of the leader board. You can see the publicly leader board, private leader board. Here's where they indicate how much of the data is being used in the public versus private. We'll run through the current contests and show that split information. And most of the getting started, I believe they're 50-50, but the featured ones tend to vary a fair bit. Okay, and according to Taggle's site, and I just threw these up here, there are three tips for using the leader board. You probably should, you know, if you're in a competition, keep an eye on it, see how things are changing and, you know, if you drop all of a sudden because everybody's submitting new stuff, you know, it's time to go to the forum and see what's going on. Okay, there's, this is just the link to their discussion forums and mentioning, you know, a lot of Taggle grandmasters, you know, they're very, they consider this very important and you check their profiles, you can see that almost all of them have been on Taggle every day reading the discussion forums. It's kind of interesting. There's also on Taggle a couple other items. There's Taggle learn, they have these, from what I can tell, I haven't done any of them, but from what I can tell, they're kind of short task-oriented lessons. You can sign up for their courses are very Python focused. Most of them require you to know this initial Python course. Okay, so their current competitions, the decipher this for you guys, this type is the type of competition. So F is featured, R is research. There's more, this is all that I could just spit on one page. K is whether it's a kernel competition or not. Points, do you receive ranking points or medals for the progression system? The NCAA competitions are marked as two-stage competitions, but they're not a typical two-stage competition. The first stage is supposed to be a fun, kind of warm-up stage using past data. Everybody goes to the second stage, so that's what I'll ask right there. Whether or not the competition allows external data. So from what I've seen, almost all the competitions, this one didn't, but almost all the competitions, if you click on the far right in the rules and just do a find in your browser for external data, you're going to see the typical policy is something along the lines. You're allowed to use external data from public sources. You've got to make it known and available, you know, URL or something, in the discussion form for the competition. Then it's good. Whether it's classification or regression, so this one's binary classification, couple regressions, I don't know why that. Oh, this is a multi-class problem where you do one to five. That was a regression. This one, I'll have to look at the problem again. Two regressions here, you know. So, and I got my CMB here. This one's also binary classification. The metric used to evaluate the models. So the LLs are log loss, area under curve. Surprisingly, if I remember right, this was something like Matthew's correlation coefficient. I wasn't familiar with it. Many of the regression competitions are root mean squared error, log loss. Many of the image classification contests, you're trying to maximize area under the curve. Rough size of the contests. So one of the things that I think's nice about Kaggle is the data is approachable. It doesn't push you into that kind of the joke of, you know, what does a data scientist do? Spends 80% of their time beating on the data and 20% of their time complaining about the data. So it's small enough that you're not, you know, gonna not want to compete. And the LB split, this is just how they had their leaderboard split. The reason these ones are 100-100 is because it's literally, you know, they're determined by the outcome of the March Madness Tournaments. And there's a dent fit on the page. There's one that's kind of interesting. It's in the playground competitions. It's called The Light. They give you something like 250 rows to train on. And a very small percentage of the, of the, their data set is made available for the public leaderboard and they're keeping like 90% of it. So they're, they're really pushing you to, you know, kind of come up with a good model, but, you know, if you overfit it all, it's gonna drastically change your outcome. These are three, the three category, the classification problems are just simple accuracy measurements. And I realize they had withheld 75% on them. And the three competitions, the three competitions that recently ended, two of them root mean squared air and the whale one you were doing, it was kind of an interesting formula but basically you could list up the five whales whether or not, you know, you thought this matched this whale or that whale, that whale kind of, you guys can check it out. Okay, out of those three recently entered competitions, or three recently recently concluded competitions, the, the end of money placings, fifth and third, if you place 14th, 18th or 17th, respectively, you got a gold medal, you know, silver medals, bronze medals, the total number of teams that entered each one, it's kind of interesting in the Google, Google analytics, something revenue challenge. There's a lot of bronze medals given out relative to the number of teams. So, you know, kind of keep an eye on it, on the leaderboard page they'll have the number of medals that they're giving out. So, keep an eye on that if you're trying to, you know, earn medals and move up the progression system. So, these are, oh, okay. So, this is the point about doing cross-validation at home and understand what's going on. So, if you submit your score and it comes out in scientific notation, everybody else is 0.88, you're doing something wrong, right? This guy has way more air, you know, all these guys do. And literally, you know, to create the next charts I had to remove 50, I think, people from this one just because it skewed the outcome so much and 25 from the, another one with the Rubin Square Air. I don't think these people realized here's the same thing for the other competition and it had a similar leaderboard with, you know, most of the scores coming in at less than one. So, so I was mentioning that this isn't a, my attempted Pollock type art, you know, this is, this is what happens once the leaderboard changes. So, these represent people who fell, you know, 600 plus positions, 800 plus positions, these people jumped 200 plus positions between the public and the private leaderboards. Right? So there, in this competition I thought this was going to be as bad as it gets. You know, when I first saw this I was like, wow, these are people that, you know, this, this team fell 800 some positions, kind of thing. It's not. So in this one, you know, the leaderboard between the public and private completely changed. Like, there are so many of the people, you know, all these people jumped from the original public leaderboard people. And then it got even worse. You know, so, like, literally it was almost like the public and private leaderboard did just one spot. You know, first place became near the last and last place jumped to the middle and, you know, just kind of rotated almost. When you look at the scores, again, kind of in an aggregate fashion, so, you know, these are the positions, these are scaled, this was accuracy, the other two I did. You tend to see sort of this pattern. You have a couple of plateaus. If you turn it into a density plot, you get the the binormal sort of distribution. You know, originally I was calling it students and grad students, you know, students and pros, kind of thing. Here I had to eliminate the 20, either 25 or 50 people in order to see it. Also, it was all smashed just flat because of the people who had done some poorly. Same thing with this one. So, I thought that was kind of interesting to know that, you know, a lot of a lot of salaries tend to fall into the sort of this distribution. A lot of people coming out of college, you know, their talent level and their field tend to fall into this distribution. You look at different measurements of skill, just kind of stacking those same density plots on top of each other. Okay, so, in these last three days, there were 6700 teams that competed. 70 of the teams competed in all three of them. So, how did they do? So, the vertical axis is, you know, their index, each of the teams. The x-axis is their placings and the competitions. So, few of them, you know, kind of not so good over here in the worse than a thousand spot if we zoom in. We're going to see there was a couple of teams in particular, good placings, this team in particular, and this team was kind of variable, but they ended up getting a gold medal in one of their competitions. And these are the medals that these teams that placed all three of them, how many they scored. So, 615 and 18 medals in total for each competition. Okay, so that kind of covers the Kaggle part of it. And we're going to jump into R. Now, I'm not aware if anybody knows. R is not the original, right? There was S that was developed at Bell Labs before R, and R was the free implementation of S. It was Dr. John Chambers who was the primary person working on S at Bell Labs. He received an ACM award on this for his work on S. Apparently in the 70s, a lot of statistics were done, you just wrote four-tran routines and did your statistics on the manual version of it. His goal turned ideas into software quickly and faithfully. Down here, I have the other people that helped him. It was back in 1976. So R has a long history and if you look at a lot of the help pages for different R functions, you'll see references to S and the original work in the manual that Dr. Chambers had originally written. What is R? I imagine most of you since you raised your hands, it's a large part but it's a lot of things. Most of us recognize it as a programming language. There's tons of packages and additional features. There's interactive environments for it. There's services like Shiny Web Server and it's kind of a broad answer. Why should you use R? So R to me is a very expressive language that can get a lot of work done and very few lines of code. We'll talk about that in a second here. Quite a bit of this I took from Hadley's book on Advanced R. It has deep language, deep-seated language support for data analysis so things, concepts like missing data are handled in the language. Being able to use formulas handled in the language. R was created in a period where scheme was very popular. We saw the beginnings of functional programming what we now call functional programming which if you look at what we call it applicative programming back in the 70s kind of came out of that era so we have our apply functions. There's just a lot going on for R and that's great support. If you look at the language features interpreted, dynamic functional and those are obvious not functional like the high school programmers would say but you can use it in a very functional manner. Definitely object-oriented languages are available in base R easy to connect to Fortran and Sea Routines very cool very powerful metaprogramming facilities so you know you can list the variables that are available you can get variables you can remove them all kinds of things like that. Definitely inspired by scheme we see that with both most of the time we use R in a way that's using lexical scoping semantics but there's also dynamic scoping and Hadley had mentioned that I had never used R before you can also use it as a general linear algebra matrix calculation tool R is quick Fortran is fast so that's one of the ideas this is my example of very few lines of code go out I'm able to get a box score using a URL and just display it on the screen try doing that in Java easily 100 lines of code so it's just very easy to write scripts to do very useful things how do you run R? instead of again, this is me trying to give you guys that don't know I imagine all of you are familiar with this sort of a meta knowledge of how to do it so here I have links on installing R in different areas I'm not sure at this point if RStudio is still the most popular way of running R if Jupyter Notebook might be more popular now I think in our community probably R Markdown and RStudio are still the two go-to ways of using R quite a few people have started using Shiny server kind of makes me smile when I hear about a new project using Shiny server it's very cool and if you're part of the groups that have to do you know have to take your code to the data you've got the in database support so there's KLR for Postgres and then there's SQL Server has their different ways in 2016 Microsoft purchased revolution analytics and all of a sudden you saw R start popping up on Microsoft sites and my example got cut off here and you can run little R and do actual R you know in a command in a shell okay so again I'm going to skip a lot of the obvious stuff stuff you guys know operators, operator precedents the single quotes versus double quotes thing it's you know if you want to find out really basic stuff in R I recommend R for data science it's free online it's by Hadley it's kind of like the Madonna of the R community he covers base R ggplot the idea of tidy data and tibbles data types how to use pipes it's quite a it's a pretty comprehensive resource it's quite good I like it if you're already at the like the intermediate level you got most of the language concepts I highly recommend the advanced R that's going to try to take you from the intermediate to the advanced level very quickly you're going to come you're going to come across you know closures and how to use special operators to create special functions to create infix operators a lot of concepts you know it's one of those books you could probably there's a continuing R person reread every year six months oh I forgot about that or that's what he meant by that it's very deep I'm actually surprised the carrot package has a free online book that covers how to use a lot of machine learning algorithms in R I just was going there to try and learn how to use carrot which will touch on it a little bit but they cover quite a vast array of materials that's pretty cool I didn't bring R in action of actual books you can hold that aren't free online I think R in action is probably the best one that I've come across that I have it's very comprehensive like the writing on it then if you're interested in deep learning here's a deep learning in R so it shows you how to use carrot basically it's the equivalent of the deep learning with Python book but everything all the semantics are rewritten further R library let's see okay I think a lot of people I've done R training in a couple of jobs I've worked at in the past Java programmers more traditional programmers is the dot they're not used like the dot not meaning something special so in R and base R especially you see the dot all over the place it's just simply used to separate it's the same as an underscore essentially don't get hung up on it personally I like using dots more than underscores so a lot of my code's written that way a lot of the looks a little more like Python or Ruby you know kind of looks more modern R does this interesting memory mnemonic I think screws up a fair number of people often you'll have a kind of a base function or idea and then they'll throw a character on the front so like you see it with the apply so the typical apply works with data frames or matrices in a two-dimensional manner you can say okay I want to go down rows for vectors and I want to try to return a vector L applies for lists or vectors and I want to try to return a list so the S and S applies for simple you see it in the distributions as well so you got the uniform distribution all these functions these four functions all deal with the uniform distribution all these deal with the normal distribution the letter at the beginning and D give me the density the distribution function Q the quantile R binomial distribution values or exponential distribution values to me it makes sense I like the idea I know some people kind of kind of verbose languages like Java seem to be thrown off by it okay one of the things that I see a fair bit of data is going to be people in this room but a lot of people use the equal operator to do assignment and there's a lot of minutia we'll say surrounding this in general you're always going to want to use the left arrow operator to do assignment the right arrow operator if you're in an interpreter and you're just screwing around with something all of a sudden you need to save a value hit the up arrow do the right arrow assignment to whatever your variable name is you've done right it makes working at the command line easy and then save the single equal sign when you're either defining functions in the bottom here or you're passing arguments to with the keyword right I don't want to get into the details and in my experience if you do this you won't run into some really prickly corner cases Hadley covers some of it in advanced R you can find a bunch of it on the different R mailing list and just kind of skip over it so vectors there are three different basically there are three in base R different types of homogenous data types you get your atomic vectors you have matrices you have arrays I think the important thing to notice here is that they tend to be named after the similar math concepts so vectors are one dimensional like a linear algebra two dimensional matrices once you jump to three dimensional and beyond it becomes a ray it's not a tensor but it keeps the memorization and naming easy an important point that a lot of people I don't think catch or that there are no scalars so any scalar what looks like a scalar is actually a vector it's just a vector of length one so that makes working with functions like the applied functions or what not which any function that's expecting a vector makes it easy to do I think as an approach that you know people coming from again Java and other traditional programming language backgrounds to them a lot of functions have either a lot of arguments or they return this list with a lot of data and they're not used to that to me I think it's really easy to get used to and really convenient and I'll have to hunt over the Java doc to figure out how to do a bunch of little things so you know SDR is your friend figure out the structure of whatever is being returned you can see all these names kind of go from there and then another one maybe this is the Python before R so using Pandas and NumPy and you just get used to using the dollar operator call it an operator I'm not sure if it's technically that to you know and your list to access different things so you know if you have competition data or if you have a data frame with a call and call price you know it's an easy way to access it without having to put that inside a square brackets I don't know some of the R programs that I know weren't familiar with this is this idea of the recycle rule so in this sequence above R is you know this five is a vector and R is recycling that value ten times in order to do this computation and the same thing is happening here but instead of recycling a single value ten times it's going you know one through ten on the one vector and one zero five times right that's how we get the even thing there if this was one to eleven and this was still one to zero you'd get a warning saying that they there weren't multiples of each other you know to stop off before the end of the vector during the computation okay in my opinion you have to know three packages in R for plotting you have to know base R just because it's fast it's easy you're going to see similar semantics and other R packages you have to know ggplot it's two or three of the table grandmasters that were mentioned and I think it's just kind of universally considered one of the best plotting libraries available based on this idea the layer grammar of graphics and you can find a free paper and see which one is it in the not the well it's probably in the advanced R book but also in the beginning one that I have links to earlier and then in my experience you need to know GoogleViz if you're working as a data with my bosses or other people written in ggplot or just regular R plots they didn't like it gone back the next day with the same information GoogleViz got a thumbs up just I don't know something about it like people like seeing what they're familiar with and so in my experience it's useful if you're going to do mapping data leaflets are really cool library map view is cool lattice I don't know super well trying to describe it and then shiny dashboards or something new that look really cool I've done a fair bit of dashboards in the past and they've been kind of a manual process this will apparently save you quite a bit of time so I'm just going to throw it up there so people can know about in terms of working with your data I think everybody I know that does serious R programming is using string R all over the place the original string manipulation libraries and R are little they're not they're not always consistent in the way their arguments are ordered or how they operate string R is based on what Hadley saw happening over time and Python and Ruby and a bunch of languages and kind of comes up with the best ideas of that it's vectorized so if you have a vector of strings you can just pass it to it with some pattern and have it return true false for everyone that works working with dates fairly easy by I know there are some very complicated corner cases with dates I've been fortunate and don't have to deal with that so easier for me may be different from you if you're dealing with a lot of a lot of issues and a lot of there's a lot of prickliness with dates at times the reshape two package is really cool so this concept of wide data versus narrow data so wide data would be something like where each column is some kind of feature and each row is some other thing reshape will allow you to melt that data into a long thing so you'll have a row identifier or a column identifier and the value for each of those and then you can pass a function so if those things repeat you can aggregate them in some way so it remind people that are familiar with SQL in some ways it's like kind of like group by with aggregate functions matrix and one-liner I got like four more minutes I think I have negative form I have negative form alright I'll go really fast okay so to my mind functions are R's like in my opinion the best language for function definitions like so the difference between a name function and an anonymous function just do this for an anonymous function just write function you put your R or whatever stuff you want to sign you do the same thing you do when you assign a value to a variable just use the assignment operator and you'll see it lambdas used a lot anonymous functions used a lot in R when R is a very functional programming language so you'll see this all over base R just pass whatever function you want it to do whatever aggregation operation or what so last year there was a talk on R I brought up that I did often for loops and R can be very expensive and you end up with these kind of inconsistent performance so most R programmers will favor the apply functions over for loops pretty greatly the speaker actually updated his slide and showed a case where and it was a valley case where the for loop was just a little bit faster than the apply function this is just kind of a pink or toy case number I wanted to point out a few things so if you look at the traditional for loop method inside this function you have to define a variable a vector to keep your values you have to have the index you know keep track of it and you gotta do the for loop kind of operation you know that becomes a one liner if you're using vectorized operations in R so it's shorter most of the time your code will run faster and it's much easier to parallelize which is a nice feature in R in this case the difference in time values was not an 8x factor I think it was 8.4 times faster another misconception is there's not good machine learning or deep learning algorithms in R or libraries available in R which just isn't true and I screwed up the thing where if you click this link it takes you to the CRAN task list with machine learning stuff you'll find R libraries for TensorFlow you'll find R libraries for Karaz you'll find XGBoost C50 which was the big decision tree algorithm when I was in grad school and many others in random force you know there's a hundred items listed there okay and I'm going to go really fast to this just an example where we're going to from the digit recognition challenge in R it's just the KNN model I'm using the FNN library which is fast nearest neighbor library the dataset came with 42,000 rows there's just a little top left corner so the label for each of the items on the training set is in the first column you just have pixel data the pixel value is 28x28 columns worth of pixel values okay so when I first used the FNN library to my surprise it wasn't fast at least not on the dataset so you know using one of the things iterate quickly so I tried it with a subset of data figured out pretty quickly that this got cut off but there's a cover tree version of the algorithm was very slow if I'd used the brute force algorithm way way faster so okay so figured that out later I didn't want to put all this computation in my slides and run the risk of it screwing up so I just kind of handcrafted it here I started going from you know here's use a for loop let's use L apply so here we're using K nearest neighbors from 1 to 12 and the amount of time it takes when we start introducing packages for parallelizing this code so here's the I think it's just the parallel packages called yeah libraries parallel how fast it ran then using four code four cores with snow snow is a library for simple networks of workstation so you can specify and I have the code down here in a second how fast it's running so it was really easy and very few changes in the code so here's the for loop implementation here's the same thing doing L apply here I import the library parallel don't change just using the defaults do you know this changes from L apply to MC apply and they get a what was the bump you know almost twice as fast not quite good here's the implementation using snow here I tell it to use four cores on my laptop I had to pass it the training set the test set and the the K and N function that I'm using and then same thing but here it's PAR with the capital L apply rerun it and now I'm what is it five times faster I think it was okay this this this example up here just to be clear I did it for one through 12 I did run this computation on my laptop but it scrolled off the side of the screen so I changed it to 1 to 8 so that's why the 218 seconds there and you got to remember to stop your cluster afterwards but just just pumped out a table how well did K and N do you know just for the base thing so I tested which is 2000 random values here's you know on the diagonals the ones that got correctly so things like fours tend to get confused with nines nines tend to get confused with fours you know I was surprised for the different K and N values you know on this data set because the features you can run K and N and get a pretty high accuracy score kind of surprising there's just another bar plot of the same thing now that's me doing it kind of manually I mean this is here here I'm doing cross-validation where I break up into a bunch of folds manually right what traditionally you don't want to do that when I did this across the whole data set using folds I did 42 folds came out with an accuracy of 96% that's going to put me it's not going to put me in a bad location on the leaderboard but not near the top right so this is just that base model to get things going I wanted to do a big focus on carrot unfortunately the digit recognition data set is too large for carrot and I keep getting these errors so I had to split it up into smaller ones but you can get carrot to do this thing the train control and then to do the folds for you and then you can drop that into this train function algorithm you want to use I had to drop it down to 10,000 items in different tuning parameters so this is you can try up to 10 different values of K and it hunted for the best value of K for me it did a grid search right so that unfortunately the carrot can in algorithm isn't the one from FNN and it's much slower so I'm kind of conflicting on which I would use in practice especially for large data sets but this kind of functionality and in case I didn't mention it carrot stands for classification and regression train so it's done a lot of the work that you know kind of the handcrafted code for cross validation you would have to do here's basically the same code using starting to finish using the house prices data set with carrot training the train control kind of going through the iterations and I wanted to add this in our cases where you can use four loops I was going to say it's where you want it you intentionally want a side effect I don't like that language in my experience the place where you want to use four loops in R is if you're trying to manipulate a bunch of columns in a data set so say you have a data set with a bunch of factors and some machine learning algorithm you have requires it to be integers you can do the whole place right as opposed to doing a column one column two column three you might have 100 columns and another place where I think four loops are useful or when you're creating a giant number of plots you'll see that what bits are that's it any questions let me give you this when you were talking about the offline validation strategy is it splitting off an evaluation set or just cross-validation in general or so when he mentioned the term offline validation strategy or cross-validation strategy it was the first time I'd actually heard that term and he kind of got into deal into a little detail in the talk there is a interview on the blog post on Kaggle where I think he gets into it in quite a bit more depth I haven't had a chance to read it going to pretty good great lengths to create a second test set kind of an artificial one that he's trying to make very similar to the one that he's seeing on the leaderboard it sounded like most of the other grandmasters that were on that panel were okay with just understanding where we're doing root and squared air for this regression and the nuances of that making sure when they're doing their cross-validation with the training set just standard cross-validation or what they're they're expecting that they're just kind of looking at it but Gilberto in particular seemed to put a lot of effort into creating his own validation set and that might have a lot to do with apparently his whole thing is creating either large neural networks or stacked ensemble kind of models and something along those lines he made a comment that on weekends he'll run a script and literally train hyper parameters different models like literally 24-7 he's like my computer's running all the time that's kind of thing he's like I'm not at my computer all the time but it's looking for stuff all the time so I think it probably has a lot to do with that too thank you other questions this might be a bit of a naive question but have you ever seen a like a capital competition that had a maybe like a latency requirement for maybe just kind of real time detection like the amount of latency it takes to make a prediction successfully for some kind of accuracy minimum no so the closest thing that Kaggle has to that so in the classic competitions which were like 96 some percent of all the competitions you do the you take their test data set and you submit back a data file with the labels or whatever is appropriate for the competition the closest to that would be a test where you actually have a computation time restriction and they do that intentionally to try to level the playing field but as far as I know it's just you know you have two minutes or something along those lines of computation time nothing like each value has to come back really quickly so it's that would favor this algorithm algorithm A over algorithm B in some cases you have to be aware of it none that I'm aware of any other questions we're hi all right hi everyone we still have like probably four more minutes before I'm going to get started but I just wanted to kind of quickly introduce myself I'll do it again once the talk actually starts but I'm Lucy and I like to open these talks with a couple of dad jokes from niceonedad.com so please feel free to check it out on your own it is a wonderful website but the first one is why do crabs never give to charity because they're shellfish all right I'd like to give a big shout out to all the sidewalks for keeping me off the streets does anyone else have like a go-to joke that they want to share kind of quickly yeah oh one of my favorite kind of more technical jokes is I was going to dress up as a UDP packet for Halloween but I was afraid that no one would get it get them every time yeah oh no my computer just slept I think yay yeah as this slide might suggest my slides are available on the tubes at slides.luciewin.me and I think they're probably linked from the scale page as well I saw a nod which I will take as affirmation so yeah cool did anyone else go to a Cory Quinn talk yesterday on negotiating your salary yeah okay good only one person because I watched that and I was like I will never be that funny in my life ever so I'm glad everyone's expectations are appropriate for me um yeah um I gave this talk over the summer and I did have a couple jokes sprinkled then and in watching the video back it's just crickets like it's a little painful to watch you can definitely feel free to look it up on YouTube and just nothing yeah kind of like this actually yep alright well um I do have like a longish preamble so I think I'm going to get started and hopefully people trickle in and not miss like actual content but um welcome to introduction to blockchains uh the slides are available at this URL um yeah let's get started so in the course of this talk we are going to talk about what blockchains are starting from principles so this is meant to be a really introductory um kind of lesson on what a blockchain is uh it's meant to be pretty academic so there's no like use this particular blockchain or anything like not um I don't even work with blockchains professionally this is really just something that I'm interested in um so we'll start pretty low level and talk about what blockchains are about blockchains um we'll talk about when you should and should not use blockchains and not just using examples but we'll kind of provide a framework for thinking about what is an appropriate use case for blockchains and what kind of isn't um and then we'll talk about different ways that people are using blockchains now um and uh as a caveat I try to avoid talking too much about cryptocurrencies um but cryptos are also like the main use case for blockchains uh that are being used in the real world so I might save that coin a lot and I apologize in advance um really quick uh my name is Lucy Wyman I am at Lucy C. Wyman uh on Twitter and I'm a software engineer at Puppet I work on a tool called Bolt it's a really great open source command line tool whether it's public code or scripts over SSH or WinRM definitely would love for you to check it out come talk to me afterwards if you want to know more um and this is just a picture of me my boyfriend's cute my dog is also really cute but he hates having your picture taken so pretty much every picture looks kind of full of principles um or like basic building blocks that uh you're going to need to know about and apologize in advance if you already know about these um but maybe it'll be a good refresher um so the first one is a hash function not to be confused with a hash object um but a hash function has an output uh like a whatever 32 bit string um or I can put in all of Moby Dick and it will also output a 32 bit string um so no matter what size of the input output's always the same and in addition to that we can guarantee that the output is relatively unique to the input and what I mean by for this function is a lot bigger than the output size so there will definitely be what we call hash collisions but with a hash function what we can guarantee is that it is infeasible to be able to find those hash collisions between two meaningful inputs um so basically like if I knew your password was foobar uh it is infeasible that I would like be able to computationally find another input to the hash function that would have a hash collision with foobar um so that kind of makes sense so it's infeasible to find collisions on your own is what we mean by output is unique to the input um so we can kind of act like the output is unique even though technically it isn't um and in addition to that the collisions are super rare like git uses hashes for all of its commits is everyone familiar with git show of hands okay lots of nods um so yeah git uses uh hashes for its commits and uh someone calculated that it would take like 6000 years for there to be a collision of the hashes for git commits an example of how like basically there's no collisions um in addition to all of that a hash function is idempitant and this just means that no matter like what time of day it is or what system I'm on or you know what my identity is whatever uh like the output will always be the same given the same input um so we can guarantee that like I can run this anywhere at any time and if I'm using an algorithm it will always have the same output if I give it the same input um and lastly and most importantly it is infeasible to invert um and again infeasible does not mean impossible it just means that it would take like too much computational power uh for like any human or computer that we currently have to be able to do it in like a human lifetime um so that means that if I just output unless I like guess it and get really lucky um but uh yeah given the output you can't know what the input is um people look like they're with me keep going lots more to learn here hash functions uh definitely check it out uh really quick demo uh this is just in a ruby ripple so I have two strings length of uh output for the SHA-256 which is a pretty common hashing algorithm uh and then I have a different input so it's longer output same length longer input same length output and the same length of output okay okay the next topic that I want to cover really quickly is public key encryption now public key encryption is not directly related to blockchains encryption but in raining a lot of the like introduction to blockchain materials a lot of them start out by introducing public key encryption so I wanted to just make sure that I quickly covered it because it is kind of related um so the way behind public key encryption which is also known as asymmetric key encryption is that you have two keys one of them is public and one of them is private um so the way that public key encryption works is there's two nice features that we get from public key encryption anyone can encrypt a message with my public key and know that only I can decrypt it with my private key so it's a way that anyone can send me a message securely and know that I'm the only person who can read it um and similarly I can sign a message with my private key I can verify with my public key that I'm the person who sent it so this is how like package signing works like when you install a new program on your machine most of the time uh the repository will sign it with a key um and then you can verify with their public key that that is in fact like that that package is coming from the repository that they say it's coming from uh very reliable very widely used uh like my internet connection right now is using public encryption you know SSH um all that uses public encryption we have a really high degree of trust in it so particularly with uh RSA which is pretty common public key encryption algorithm uh we can be really sure that my public key uh so we can trust that as long as I'm the only person that has my private key I'm the only person that can sign with it and no one can like fake that they have my private key um and again lots more to learn here um so the way that public key encryption I'll talk about this a little bit more later in the talk but the way that this kind of like factors into a block came from a specific person so a lot of times people put their public key as part of a block um and that's how you can kind of like uh set ownership for a particular block but I'm getting a little out of myself uh what is a block boom so a block is just an object it has a couple of different attributes um the very first one is a cryptographic hash has all of those attributes that we talked about before um the main one being that it is uh infeasible to invert with a normal hash function or like a non cryptographic hash function uh you don't necessarily have that guarantee but for all most of the time when people say hash function they're talking about a cryptographic hash function um so what we do is we take a hash of the previous block in the chain and that's kind of the secret sauce of blockchains so we'll kind of keep coming back to this idea um but this is how a block chain remains secure is that I have this input which is the previous block and if that block changes at all then the hash of that will change it'll be a completely different hash and so by storing the hash of the previous block uh I can't modify that block otherwise I would have to update the hash that's stored in the next block and then the hash of that changes and I would have to rehash it and update the next one and so on cover this again but uh yeah so cryptographic hash of the previous block secret sauce of blockchains um then we also usually store time stamps just because uh blockchains are typically time-based uh you can only append to it so it's necessarily chronological um store time stamp to verify that um and then you can have some data it can be pretty much any data uh typically it's transactional in nature because of that append-only uh function uh but yeah it can be pretty much anything um and like I said this often the data will often include the kind of owner of the block's public key so yeah um so I have a really quick example uh this is available at my github and then blockchain-example um so I have this little class here uh and it's pretty small whatever eight lines um so yeah all I have is like an index of where is in the blockchain I have a time stamp I have data um and then I also am going to store the hash of the block itself so that I can verify that no one has modified uh my like previous block um so I'll show a verification algorithm later um but this is just here for verification that no one's tampered with our data uh a block chain that are a lot more fleshed out than mine mine is really just an example of like the blockchain data structure um there's a lot of examples that include more about like the networking and consensus algorithms and like the distribution in nature of block chains um and they're available in every language that you can think of pretty much um so these are a couple examples um and then I do want to talk to about like how the chain part um so a blockchain is really just a data structure and I kind of like to think of it as just a glorified link list with only the append function and the reason that it only has the append function is because if I was to prepend something then the previous hash of the first block would change um and then again I would have to do that whole like okay well now that the previous hash in the next block rehash it update the previous hash rehash um and that is just computationally infeasible to do before other people have kept adding to the chain um so I would have to like modify the whole chain faster than people were adding to the chain uh which um especially with sufficiently large chains is just like not possible so I can't prepend and I also would have to the block in the chain because it's kind of the same principle if I was to modify the data um in any of the blocks then I would have to rehash it update the previous hash rehash update the previous hash etc so for that reason we can only append to the block chain because then we can the last one um blockchains are also often decentralized slash distributed slash public and this is because the main feature that blockchains provide is verifiability um and verifiability is only really valuable if you can verify it like if you're trusting another entity to verify it for you then really you still a lot of people think that the main value of blockchains is in there being decentralized slash distributed slash publicly available so everyone can see the block chain um and everyone can verify that none of the data has been tampered with um because in theory if there was someone who could tamper with the data then they could just say like oh change this bit and then rehash and no one would be there to like verify that they had done that or hadn't um so yeah uh blockchains are usually decentralized but they don't have to be again this is really just a data structure and there are tons of use cases not tons there are a handful of really valid use cases for having um private or semi public uh or gated uh blockchains is another way that you can think of a block chain if someone says like hey I know you went to that talk what's a block chain you can just say a distributed digital letter and it's usually kind of a tweet version I guess of how people describe block chains um this is a cool gift from how it's made making a chain side note I don't know if you all have ever spent some time on Giffy pulling out crayons or like making peeps like the little marshmallow chickens it's wonderful I highly recommend it okay um so here's kind of a nice visualization of what I was talking about so um each of these blocks might be a little hard to read but each one has an index a timestamp some data hash and that first block uh just has a zero as its previous hash which is pretty common um just to have kind of a nil block as your first block um and then each one links back so if I was to change the previous hash of like the first block then the whole hash of that block would change and then I would have to update block one rehash I know I keep if you get one thing out of this talk it's kind of this visualization idea um yeah okay so here's a little bit more of my blockchain example um I this is the kind of rest of the class so I have um the actual chain which just has initially the first block my first block is just like I said a bunch of zeros kind of some nothing data and then I have a function for adding a block um so that just gets all of the data from the previous hash or from the previous block and uh appropriately makes a new block and then adds it to the chain um and for this example you'll see that my data is so and so voted for Hermione Granger uh so for this example I'm going to be uh vote counting system um blockchains wouldn't really be used for like getting votes um but they could be used to count the votes and to verify that everyone's vote like wasn't tampered with and uh was submitted as a like actually submitted so like you could go and verify that your vote was counted for um and this is one of the main use cases that people are kind of talking about that might actually be viable for blockchains outside of uh cryptocurrencies but we'll see if it actually happens uh I wouldn't hold my breath but um yeah so let's real quick switch over um and just kind of take a look at those two uh things so yeah we've already seen so I won't dwell on it too much but I have my block uh object and this is just in ruby because it's the language I write all day um I have this two string function which is a little bit important so this is how I turn my block object into a string so that it can be hashed um I don't really know how like quote real blockchains do it but this is the previous hash the time stamp and the data I don't take the hash of the block itself because if I was to like hash that hash then the hash would change and it just wouldn't really make sense um so I just take the index previous hash time stamp and data um and that's what I'll be hashing and then I have um that shot 256 and that's everything that we just kind of went over oh and then I have this verify function so this is how um I verify that the previous hash of the uh block that I'm checking of the next block of the one I'm checking matches the hash of the block itself um and then also uh I take the hash again just to make sure so I uh not only verify that the uh newly generated hash hasn't changed but that the stored hash in the block also hasn't changed um and then I have this little example script um so I've got a a new chain I've got a couple of people who are going to vote and then I just add blocks to the chain everyone votes for the same person in this one um and then I print the chain and I just print the whole chain you'll kind of see what the output looks like and then I verify that I actually registered so run that little script um and you can see again kind of visualization of this is the hash of that first block and then we can see that the stored previous hash of block one is the same and then similarly the hash of this block is the same as the previous hash of this block etc uh pause for questions yes we will get there yeah don't worry this isn't the end of the talk this is uh just a lot to absorb yeah yeah a little bit so the question was uh am I going to cover how this gets distributed and the question is a little bit I won't get into details about like I'll talk about consensus algorithms that's your interest but yeah sorry you're saying that the trust is basically using the same you know certificate systems that we use with websites and packages right yeah yeah why we trust it kind of yeah so um one of the ways to establish ownership of a block is using asymmetric key encryption um and that is part of how we verify websites um so like if I uh I use Firefox you should use it too um if I go let's see ooh my connection isn't secure more information um and I can look at maybe here do do do no no this is not what I wanted somewhere in the Firefox info page they have like what algorithms uh it's using which one oh I see oh that's why it wasn't cool yeah okay so yeah so here we can see how my connection is actually encrypted um and this ECD HE is elliptic curve Diffie Hellman um so that's how the keys are exchanged um and then RSA is the RSA with AES 128 is the uh public-private key that's used too um so like it's not the only thing that is securing my connection I guess was the point I was trying to make um there's like more complex layers to SSL certs um but yeah but it is that took longer than I thought it would um cool cool cool everyone's like I'll save my questions for the end since you're probably going to answer them so just the observation that um you know if we see so much about URLs or addresses or something the interesting thing here is that the only pointer back to the previous block is a hash yeah so in other words that's why they talk about having the whole if you don't have the whole thing you won't know where to find it is that sort of so because there's no way to get to it if it's not in some sort of a data structure that you're also carrying with it the array I guess I mean you could now you could have a blockchain stored somewhere else still um so it's not that you don't have access to it if you don't have access to it somehow to it so the idea of decentralization is just that if there's only one copy and it does get compromised then it's impossible to verify that it was compromised um but if everyone has a copy and one person modifies their copy they can't be like oh I have the real copy because everyone else will be like no you don't all of ours match in yours right we'll keep going there will hopefully be time for questions at the end but yeah boom all right so now that we understand kind of the building blocks of blockchains let's talk a little bit more about how they work um so when I go to add a block to a blockchain usually I make some kind of transaction generate some kind of transactions we'll get bundled into many transactions because there's so many happening at once um and then I generate a block for that transaction or that bundle of transactions and I try to submit it to the blockchain so I add it to my personal copy of the blockchain um and then I have to like submit it to the true blockchain so there is always only one true blockchain it's called the authoritative chain um and there are a lot of blocks that will try to get added and will not get added they're called orphaned blocks um and they literally just stop existing like eventually my personal chain will get overwritten with the authoritative chain but what happens to all of those transactions then my block gets orphaned does that just not go through like no not at all it just gets submitted into another bundle of transactions and that just keeps happening until one of my blocks is accepted so the way that we determine the one true chain is through consensus algorithms um so because the chains are distributed to users can have update their chains to have a new block at the same time uh or before the network can update everyone then there's like two technically correct copies of the chain and one has to win out um so this is reconciled by having what's called a higher value chain uh what is value good question um so usually this is determined by a consensus algorithm and there are two uh types uh or not there's many types of consensus algorithms there are two main consensus algorithms uh that blockchain is um or that like popular blockchains these will say a really common one is proof of work so uh that is proving that you accept it as the authoritative chain and I'll go into more detail about proof of work um in the next slide um another one is called proof of stake so that's where you uh like submit an amount you basically kind of bet an amount that you are willing to put at stake in order to have your chain be the authoritative of whatever uh data they have at stake uh wins out um part of the consensus algorithm is that blockchains must be Byzantine fault tolerant um so what this means is kind of going back to the 51% attack you want to make sure that if I have a blockchain and I have this block to uh my personal copy of the blockchain then I can just say like oh yeah this is totally the authoritative chain and you want to have enough other people be part of your distributed network to say no it's not um and the idea of being Byzantine fault tolerant is that you want to combine with the bad data say that my friend gave me a hundred dollars when they didn't really um then uh like everyone else can say no but then maybe I convinced a couple of other people to get in on it with me um and they also update their local chains and they say well we also have this copy that Lucy has um and so the idea is that you want to have you can kind of limit bad actors and uh specifically with fault tolerance uh the uh there is some mathematics uh that shows that if you have two thirds good actors and one third you can tolerate up to one third of your network being bad actors um and there's the idea behind the fifty one percent attack is that if a bad entity was able to control fifty one percent of the network then they could just say whatever blockchain they wanted was the authoritative chain and no one could really say otherwise because they have the majority of the chain yeah sorry say that again for example you are seeing that if you can try to change the information on there and it's going to show information how about it's a real new data how you're going to be accepted yeah so you can submit I won't say any data you want like there's different ways of verifying the data that you submit um but that's kind of a separate problem for blockchain itself the blockchain only guarantees that you haven't modified any previous blocks it doesn't make any guarantees about newly submitted data yeah and this is a big thing there's a lot of barriers that you can make it so that people can verify that they voted for who they wanted to vote for but you can't verify there's not someone standing over their shoulder being like hey vote for this person you know or that they've been coerced in some way so um yeah blockchains can only make I guess so many guarantees management and kind of all the same systems we have currently in place that I can't do that um would also apply to the blockchain so I couldn't submit that new transaction so you're ad data business um you're in a distributed network my friends giving me hundred dollars can't think does that reach only one node or would that same ad data request come to multiple of those nodes one that I'm directly connected to in the network um and then uh through network magic wave hands uh the like the the each machine will have multiple copies of the blockchain and they will determine which one is the authoritative chain and then distribute that to all other nodes in the network and then each node the authoritative chain is so everyone should have the authoritative chain they might not all have it at the same moment but eventually each node will reach the same consensus on which chain is the authoritative chain before you made a block you might have had multiple little transactions getting added the data piece so that block so one data piece let's say I've got another node you're another node do you all see that teeny tiny transaction piece when you're trying to each construct your own block uh I'm not sure I follow so the data the data in your block let's pretend that data is like five different bank transactions of money moving and your distributed network let's say has a thousand nodes to all of them get that little tiny piece of $10 moving yep yeah so everything is recorded publicly it can be encrypted so it won't necessarily depending on your block chain use case it won't necessarily be like public knowledge but everyone will be able to see that your friend gave you $5 yeah yeah do pretty much all client transactions add one transaction onto the chain or is there aggregation going on like I know in Nubis there's some kind of model where they bundle a bunch of stuff together but I don't know how they build those mega blocks and yeah I don't think there's any kind of aggregation because it's so quick that you couldn't really do like oh like I got $100 but at the same time lost $100 so zero or something so I don't think there usually is but there might be use cases where you can aggregate data somehow so I guess my answer is maybe and I don't know but I don't think there is I think usually it's just multiple transactions for many people getting bundled together yes I was just going to ask that I'm going to cover the update function to generate a block chain locally and I'm submitting it to the consensus chain and I lose at that point but then I might win a second round since the person whoever won got added to the latest block now my hashes are going to have to be changed to update to the newest one yep the new ending block yep yeah it just goes through the same block with the same data so yeah once so your computer will get the authoritative chain and then it will generate a new block based on what the new last block is just with the same transactions and that all usually happens automatically so like like I said like it's not like your transaction won't go through or won't like it'll eventually get added that same topic how do you avoid starvation and situation where my job is always lower priority than everyone else's forever oh so actually let me go to the next slide and you'll kind of see yeah so one last point on consensus any given block will initially have a relatively high probability of not being included and that will decrease exponentially as time goes on no sorry actually the next slide that will talk about what you're talking about so I want to get into proof of work like I said this is a pretty common let's see how are we doing on time okay this is a pretty common way of determining what the authoritative chain is and proof of work is basically a lottery I mean it's going to take me a while to get there but to answer your question proof of work is a so because of that anyone can win and usually your transactions can get bundled in with like I said many other transactions so like once someone wins a block you can just make a new block and it'll like eventually it will get added to the letter just because it's kind of random but yeah so with proof of work you start with a nonce typically a 32-bit number and that is part of your block so going back to this blockchain example I would have here a nonce and then this is just like a 32-bit number I'm not sure why I didn't add nonces anyway so I start with a nonce and then I hash the block and I check to see if it is under the target just a number and it's prefixed with a certain amount of zeros I think at the moment for bitcoin that number is 32 but it gets updated every two weeks to make sure that block generation time stays relatively stable so I think it's like 10 minutes now-ish for on average for computers to find a nonce oh excuse me um so we add the nonce to our block we hash the block we see if it's under the current target so we see if that hash that's generated has 32 preceding zeros and then if it's not then we change the nonce somehow usually it's just to increment the nonce but it can be anything it can do the Fibonacci sequence you can change it however you want since every block will have new data in it the hash will always be different and so the nonce will always be different that will result in a target being less than or in the the hash being less than the current target so that kind of makes sense and the reason that this is random is that any number can be the nonce and so it's kind of whoever it's partially whoever gets there first but it's also like on average it will take a average amount of computation so you can generally prove that it will take a certain amount of work to generate that nonce yes is there going to be only one nonce or could there be multiple nonces that meet that requirement there probably could be multiple nonces I don't know for sure but probably cool alright okay so I've kind of already covered this but because each block contains a hash of the previous block this is the thing that I was kind of harping on you need to rehash and then update and then rehash and then update which is computationally infeasible at the rate that most block chains are being updated so we can say with certainty the blocks that are older than x so if I have a collection back from the end of the block chain I might be fast enough if I have enough computational power to update it or modify it and then update the rest of the chain and then have that become the authoritative chain but as that block moves further and further back it becomes less and less likely that I would have more computational power to be able to rehash everything in time that has been added in that time so as you pointed out it doesn't ensure that the data is valid at all all it can really guarantee is that it hasn't been modified talk a little bit about decentralization so there's a couple of benefits to decentralizing your block chain it means that there's no single point of failure so again if I only have one copy of the block chain I don't have any way of verifying that there's no central authority that needs to be trusted so again kind of the benefit that a block chain is giving us is that we can trust that the data hasn't been modified and we can either trust that we can verify that it hasn't been modified or we have to trust another entity which we might trust that entity maybe they're our employer and like we kind of have to trust them but it's much stronger if it's publicly verifiable and kind of provides some of that security through publicity and there's also kind of a related question of are block chains public and I really like this quote from Wikipedia it says an issue in this ongoing debate is whether a private system with verifiers passed and authorized by a central authority should be considered a block chain so basically should a private block chain and then there's five citations so I think that the citation list alone kind of connotes how this is a like sensitive topic personally I kind of think that block chains are just data structures and I think that there are valid use cases for private block chains I know some people are using them for like identity management kind of in like a workplace and opponents argue that private chains don't really support that data verification and so are not protected from operator tampering and so are basically just linked lists that you don't have access to so yeah words are hard there's this handy visualization I will open up so this might be a little hard to read but it kind of goes through different use cases for who can see your block chain and whether that type of ledger or block chain might be appropriate then so yeah definitely recommend checking that out okay when should you use block chains sorry I'm running a little low on time but block chain beyond the hype is a really great resource for determining whether you should use the block chain for your use case or not it goes through really step by step it's really clear it's really well researched and they cite all of their sources and it just goes through like if you I think I have a summary here if you need your data to be accurate secure and immutable like those are the three main characteristics then a block chain might be for you if you have an increasing data set and again can kind of only have that append function and that's okay for your use case if you have mainly digital assets and if you are wanting to remove intermediaries so if you're wanting to remove all the people who traditionally would be like verifying transactions or keeping track of transactions then the block chain might be a digital way to be able to do that to have your own ledger made it if you have primarily transaction data and if your contributors don't trust each other because again the main benefit block chain everyone can verify it you don't have to necessarily trust all of the actors so yeah couple kind of bullet points on pro block chain use cases there's a couple of industries then or sectors where block chain seem to be particularly of interest digital currencies probably the one we've all heard of there's also the idea of crowdfunding there's a crowdfunding site that uses ledgers and the block chain to like record everyone's donations there are prediction markets is a thing and then I kind of talked about the voting use cases of being able to verify that you have registered to vote that you're like registered correctly and then also is ballot counting and being able to verify that you your vote has not been tampered with okay when you shouldn't use block chains a quote that I really liked from a talk I listened to was block chain was able to solve a social problem not a technical problem so block chains are just linked lists and if you just need a link list like that's probably a lot simpler for you to implement so good reasons to not use the block chain are because it's a new shiny or because it's a buzzword if you don't want your transaction stored forever they will be stored forever or you have to just like don't don't use it just to use it I guess is a takeaway there okay so I wanted to go through a couple of different use cases that are outside of cryptocurrency so I made this board show called the blockies for different use cases so the most terrible use case that I was able to find is the world food program building blocks so this is how they're able to keep track of all of the aids they send to various communities and then people are able to like pay for various resources just using their smartphones and it's not like actual money that they're exchanging it's kind of a like world food food program points system so to speak so it's kind of a digital currency but yeah but that's how they like keep track of their resource movement and of people's need around the world the most democratic use case I found so this is follow my vote and it's a voting platform so it's something that you might use for voting a vote and that all gets stored in the blockchain and provide some of those features of people being able to verify who they voted for everyone having copies of the blockchain et cetera so yeah the most up and coming this is one that I'm personally really excited about and it's supply chain tracking so this is one that IBM is actually using and if you go over to their booth they have some people who are working on it but this would allow us to verify where goods are coming from which is really valuable for things like making sure that your diamonds are not blood diamonds or more recently really relevant is food contamination tracking so if everything was available on a public ledger and we could see like oh this infected Romain lettuce came from this farm on this day we wouldn't all have to throw our Romain lettuce out just everyone who had Romain lettuce from that farm within that time period would have to it's a little like we're probably a long ways off from being able to do that but it is a potential use case that I think is really valuable and then the QS use case crypto kiddies oh my gosh yeah this is just a website where you breed cats each one's unique they get added to the chain that's it yeah alright there's lots of other use cases I have like eight minutes left and we'll leave time for questions so yeah why aren't they adopted yet there's lots of reasons we'll pause here for questions I think this is about it yeah mic you up given that some of these blockchains grow rather rapidly yeah how big do they get and how do you handle that who can keep a blockchain of gigantic size yeah so I think Bitcoin now is at like 45 gigs I want to say if you're sufficiently motivated you can store it it is mostly just like text data so you're not storing like videos usually on the blockchain so that does kind of make it a little easier I don't know yet what's going to happen if it gets to be like petabytes I think it will just get harder to add to the chain but I don't know yeah I heard that you could use them for like contracts so if you wanted to share like your medical information but only with like the clinical trial that you're taking or something you can limit like share it from this day to this day or only share the result smart contracts are a thing I don't fully understand how they're related to blockchains I did learn a lot about them and they seem really cool of like kind of yeah being able to say if this than that basically for a contract so like if you go back on your contract then because it's digitally like verified then I will know about it basically but yeah I don't I think that what happened was I want to say Ethereum but maybe some other organization which does blockchain things was what came up with smart contracts as a way of like automatically verifying that a contract was upheld and if it wasn't then punishing the party and I think that's how they got kind of conflated but yeah but maybe they are more related and I just don't know yeah sorry he's coming okay so the proof of work yeah it sounds to me like the key thing is that target in terms of like I'm going back to like when people were saying oh you know I made 50 Bitcoin on my laptop and now it's like really hard so what I was getting is that the target we've been adding more zeros to it is that right and so it gets harder to get the target and so it started out and we were 256 and anybody could just you know boom we got it and then over time we've been adding more zeros so it's harder to get there so that's the proof of work is that the picture yes yeah so the target is totally artificial it gets recalculated every two weeks in order to like balance out how hard wanting it to get harder in order to limit the number of Bitcoin that are like out in the world because you get a little bit of Bitcoin when you mine it and so for other cryptos that's generally just because you don't want to have like infinite you want to reward people for the amount of work they're doing yeah so yeah totally yeah to reiterate what you just said just for the microphone the this that only happens for cryptos usually Bitcoin was the example I gave but it doesn't necessarily have to be artificial you can set the target however you want and you don't have to make it harder over time I'm sure this will do more to demonstrate by ignorance of the process than anything but is there a way to from a security standpoint slow the update of the chain in such a way that you could conceivably have enough computational time to alter the chain without the knowledge of the like I mean sort of for lack of a better term and this is going to be a important person a DDOS on the chain itself to slow the update so that you'd have enough time to alter it even in a small segment yeah I mean you would need kind of that 50 going back to the 51% attack I think but yeah you could try to add I mean yeah if you if you got enough nodes on the network and control of those then yeah you could take over the chain and update it but you would need a lot of computational power to do that for a sufficiently large chain yeah won't blockchains kind of be in danger with the emergence of quantum computing which is which is able which it has that power which has the power to actually do this sort of stuff say that again quantum computing oh yeah yeah yeah quantum computing it becomes more of a thing as I believe it could it could crack like reverse like a hash or something like that would take like a lifetime in like five minutes or something yeah so I actually talked to someone who works at IBM's quantum computing lab called Paul McKinney he's a wonderful human if you ever get the chance to meet him but he gave a talk last year on kind of where quantum computing is at and he talked about it as he compared it to quantum computing in 2018 so last year is to how computing like what we think of as modern computers was in like the 1930s so he said you know we're it's really early days they're pretty much useless now but yeah I mean 70 or 80 years from now that could definitely be the case and we would have to rethink a lot of how we interact with each other and yeah it would definitely break the blockchain among other things other questions yeah all right well thank you very much awesome I will also hang out so feel free to come talk to me about anything testing you guys hear me great okay so I think it's three o'clock so I'm gonna start if that's okay so my name is Dima I use the shell a lot and let's see does this work is that small yes it's too small okay okay I can work around that give me one second let's do how about let's make that work how about that is that better great okay so I use the shell a lot I wrote two tools to help me in my own work day to day and I've been finding very useful and so I've been trying to proselytize a little bit because I think the general and other people would find them useful as well so there's plenty of documentation on these things I could stand up here and read you the manuals but I don't think any of you want that so I'm going to instead instead of having slides I'm going to do some live demos so you can actually see the thing work in real time and so feel free to ask me questions as I go and two tools one is called feedback it reads data from a pipe and makes plots and the other tool is called vanilla and it's essentially a wrapper around OX so it is able to manipulate textual data in very useful ways as you will see everything's free software everything is in stable the plotting tool is much more mature so it's in real time they were both kind of propagated right so these are just a few more things so these are in the spirit of UNIX in that there's they both so they both have a large number of very simple tools that you can string together with pipes so it is very generic and it is they both meant to function with the rest of the normal UNIX tools like all the things but it's not a perfect choice for any one application but they're a good choice for a whole wide range of them and the whole idea is that it's standard so you can very quickly generate pipelines that solve your problem and move on okay so the overarching philosophy is to create as little new knowledge as possible so as you will see most of these they don't just work like the tools they wrap they are those tools okay so the plotting tool has much more clear applicability so I would like to talk about it first okay so let's let's visualize my standard input let me actually do that perfect okay so before we can yeah do that okay so before we can plot anything so you cannot see that right good so let's make some data to plot so I will just run sequence 10 and then we'll just generate numbers 1 through 10 and there they are so this is a normal tool we all have it on our machines and in its most basic form I can just take that and I can send it to the plotter there's nothing else and if I do that I get this and so it's scaling things in a funny way but it doesn't matter so there you go you have points you know 1, 2, 3, 4, 5 and so on so we didn't ask to do anything specific so we got kind of the default set of settings we can do fancier things if we want lines and points I can ask for that and if I do that there you go so you have lines and points so very simple so I should say that most of the interactions most of the options to this tool actually feed directly to GNU plots not any interpretation which means that if you know how to use GNU plots you already know how to use this tool and if you don't you get to learn two tools for the price of one but with very very common things like lines and points you know I do have specific options like dash dash lines the right thing so as a demo of that you know if I want something fancier like you know lines, points some particular point size and some particular point type or something I can just do this my tool does not know what this string does it just passes it directly down so which means that it's as powerful as the underlying tool so if I do this you know I get circles I get lines yeah okay so right so it might be cryptic but it isn't because that is just what GNU plot does furthermore if you if this tool isn't sufficient for whatever reason or if you need to or if you are trying to debug your command you can actually pass dash dash dump and it will generate the GNU plot script that runs inside so if you do that you're here just script so this is a normal GNU plot script you can just save it to a file if you want and just give it the GNU plot and it'll work you know or you can add to it you can do whatever you like so okay but since this reads data coming in from a pipe that means that all the normal pipe things work so for instance if your data if you're reading a pipe where you're getting real-time telemetry from something then you can plot the data as it goes so you can get a real-time plot that shows you something interesting so I can make up an example where I can have a little proof that just generates some numbers you know and sleeps after every number so it generates some data no matter where it hurts sends it to the plotter and I pass it to you know re-update itself as it goes and I give it some particular styling doesn't matter but if I run this what you get is if this works there we go so so this is updating if one hurts and there you go so that will generate sort of sinusoid you know this is made up obviously but you know as you can see alright so okay so that's that's made up but let's change it and make it into something that's less made up and more obviously useful so this is a laptop it has a bunch of temperature sensors in it you know they measure actually measure an actual thing so let's get some let's read those and let's plot those in real time see if we can make something interesting so if I read this magic file it'll give me the current temperature readings so if I do that I'll produce so there's it says temperature's colon then each number is some particular temperature probe so you know in order to plot this you need to get rid of this part but the rest is just numbers I can plot them so I can do exactly what you expect to do I can have a little Y loop you know in the loop I read this thing I get rid of the temperature's colon I print it I sleep and then I pipe it to this guy so which what you would expect that I would do so if you do this you get this so the exact is this time you know it's updating it one hertz I can ask for different updates but it's one hertz by default the machine isn't doing anything interesting right now but I will make it do something interesting in a minute but you can see there's something so you can see that a whole bunch of sensors sitting here minus 128 so those sensors clearly don't it's outputting bogus numbers so let's let's get rid of them so I can you know make it a little nicer so I can read this I can pipe it to to you know do to print only the columns that I know have data you know again I can do a sleep I can send the pattern we can actually do some nicer things we can label axes we can put the title in you know you can do as much you can get as fancy as you like so when you do that you get this thing is behaving not exactly sure why this is my e-max it's not happy hold on there we go okay fine the same thing is before but you know we have labels you know we have time you know so you can zoom in you can do all the normal things fine so okay yeah so that data you know I can prove to you that it's actually doing an actual useful thing by spinning a core so if I go over here and I and I just make bash I just make the thing do an infinite loop so now one of my cores is sitting there evaluating true over and over and over again you would expect that something is going to get warm there we go so that works you know the data's on me that clearly the tool is doing something interesting you know also tells us that the core that the that the sensor that the CPU that that temperature sensor the system the CPU the sensor is zero so I didn't know that before but now I know this so we'll learn something um if I can turn it off in a few minutes you're going to see it go back down um so you know that's that's a useful thing all right so yeah so when I did this earlier it was actually probe 7 so something is in the last hour my someone should look into that um so okay so you can imagine what you might actually want to do is you might have a long-running process that logs temperatures something and you that process might want to log the data to a file so you might have days or hours or days or years of the data if you wanted to you might be able to want to plot it in various ways so what I can do is I can you know in my shell I can do this make it bigger yeah this screen is not the resolution is different from what that was going to be so right so you have the same leaves that reach temperatures but instead of plotting it I just send it to a log file so this thing is going in the background so you can imagine this thing has been running for months you know it's just accumulating data in some in some way and you know I can go back here and I can plot it so as before so there we go so this is the data I just gathered um not interesting anymore but that's that's how it is um but since it's just normal stuff I can I can give the file to tail-f which instead of reading the whole thing it's actually going to read just the last little bit and as more data is written in the file it's going to plot it and I can do that in real time so so this way I can sort of get the best of both worlds where I have data to a file and I can also get the real-time telemetry from it so if I do this and there's something with the e-max there we go that's the same thing so you know so I can get both logging and visualization together with just normal tools I didn't make this up and that's pretty nice um so yeah so this is pretty useful uh you know this took no effort you know just did the normal thing um so one one question that comes up is like why are you doing any of the data analysis in the shell because you can use Matlab or Python or Excel or something and the reason is that we already live in the shell and it's super general you can hook it up to all sorts of other stuff that I don't know Excel wasn't designed to talk to right and and that is the big one here and you can get as fancy as you want so you can you can do remote telemetry because you get accessation to remote server I don't know web server logs and make statistics with awk or something and you know with SSH the pipe just transfers over and you can you know remotely monitor stuff and that's that's pretty useful um okay so um so the tool so this was just the bare minimum of the tool it can do a million more things you know you can manipulate various sizes of things the colors of things the symbols you can make plots in three dimensions you can make files that plot themselves which is pretty useful because you can just send someone's data file they just execute it and plots it with plots itself um you know you can make pdf it yeah it's it's fancy um alright so so this is useful and so on but you know if you just think about the temperature example a little more there were some things that would be really nice if it did so if I've been logging it for months and I go back and look at it I don't know what the data means anymore you know like what do these call mean the Celsius Fahrenheit what are my what's my logging rate like right now I know that this one hurts but you know I'm going to forget this um so just comment in general would be nice it would be very nice to have some sort of field labels time stamps would be good um and this is a good segue into the other tool and how they resolve this sort of these sorts of things um oh that's not going to that let's do this instead right so would you so would you have seen what I would just used to do for years before I wrote this toolkit which is just generate data use awk to munch input in some way and then do it and then do whatever you want with it so vanilla log which is this other tool toolkit lets you do that in a much nicer way um so first stuff what is this so so vanilla log is a data format um it is 90% would just work awk anyway so it's uh it's an ASCII table um you know the record to separate the new line in the field of separate white space which is what awk does by default um you can have comments so any line that starts with a hash mark is a comment um and the first normal comment labels your fields so you know it should be an example of something like this so the so this thing is a comment you know the first hash comment is still label so this file has a time column and a temperature column you know if there's some numbers here's some more comments and some more numbers so so that's it so if you so since since hash is a comment it means that you can give this to any tool that knows about this so a num byte of load text and it will just write you can load these into Excel or whatever it is you like you know you're not tied into a particular workflow which is useful um you know this text you can write these printf if you want it to so the tool could provide some nicer ways to do it but you don't have to use them if you don't want to um yeah and again as before everything is pretty close to being a wrapper to something else you don't have to learn anything else the learning curve is very very friendly um yeah so okay so you can say more stuff about it but I don't need to so the yes the tool could have some libraries to help read and write these things and some command line tools to actually do various manipulations and the tools are much more interesting than the libraries so I'd like to do some demos with them um all right so let's make let's do the logging in the same way as before but let's but before we actually before we actually read the you know the device let's actually write our comments so say echo comment you know here's my temperature things and then the rest is the same and I write it in this file so let me do that in my shell okay so that is going so we're now sitting there doing the data doing doing the useful thing um so the extra line is still a comment so I can feed this file to the exact same plotting command as before like this is not our fight at all and I can run it and it will just work but same thing as before I didn't have to change anything um but since it has this extra little header I can do more things so if I tell the plotter that this is a vanilla by intelligence labels so if you look at this this is the legend so before the legend says zero one to three because it was telling me what column the data was in but now it tells me what the data actually is right so you know here the data is just temperatures but you can have a million different things and useful labels are very important and this is just what I didn't have to you know I didn't have to tell the plotter column zero this, column one that was nice um so I can do much fancier things so for example let's say I want to so this is a made up example let's say I want to pull out the let's say I want to pull out the just the column that was for the temperature that was sitting in my CPU let's say I wanted Fahrenheit for some reason and let's say I want to rename it this I'll tell you a little more detail about that later but in its most basic form it's able to select particular rows or particular columns um you can do these manipulations you can rename things so this so that piece selects columns so here I select a column called CPU tempf and I set it equal to this expression this so this tool is actually a wrap around AUK so this expression goes into AUK directly and the only change that it makes is that the row is column I don't know zero whatever it is and it'll convert that to dollar zero so and that's it so this thing is as powerful as AUK is you know if you want to know what it supports the answer is man AUK and it'll tell you you know and if I do this you know there you go so you get you get the one column and here are the numbers um and note that the format of the output is exactly the same the format you can actually change these things together right because this thing still has the comment that tells you what the fields are they're different now and it still has the data and so you can feed that to whatever the next thing is and for example I can take my file that I have now I can convert pull out that one guy to burn it to burn it and you can plot it so this is 100% the same plotting command right it just will see a different thing that's the name and you can see that it's you know 100% which is what which is what I did in Fahrenheit um okay we're good great okay so I want to do another test another demo of a this complete difference is going to show a few more of the analysis tools and then hopefully I will finish quickly and then I can take questions um so um let's see okay so it's not about April so I am I'm a roboticist so we have a lot of you know in the day to day there are a lot of you talk to a lot of robots that move around they produce telemetry you might have a vision system that tries to look at them and find out where they are in some way so one common set of tools and these aren't mine but I'm going to do the demo one common set of tools is called April tags which are similar to QR codes so here is and um there so here's an example set of April tags this came this image is from the website of the of the lab that developed it so they're similar to QR codes the difference is that they encode a lot less information than a QR code but they do it much more robustly so this can be can be messed up there can be a whole bunch of noise but you can still see the tag and figure out which tag it is and and compute exactly the coordinates of the corners in the image and these things are great I have nothing to do with them I didn't write them but they're useful so so what we can use my tools for is to figure out is to try to figure out just how robust these things are to noise you know because that's if you're going to use these things that's actually pretty important how dark can you make the image so let's so let's do that so they have a lot of demo applications so this is the April tag detector I can ask you to produce V-LOGs I'm giving it an image so this is the image that I showed you guys and I can run it and bam there it is so so it produces a log file you know a whole bunch of stuff so I should say that in my day to day now I use these log files for everything I routinely make log files with hundreds of columns and you know I can't read them but the tools can they can pull out whatever they need to but this isn't that bad relatively speaking so you get the path to the image so here there was just one image you get how many detections so there were seven here and these are various metrics the hamming and the margin are the various metrics the detector gives you the ideas which tag there is which tag it is and these are the various pixel coordinates of it so right I don't need to say in that okay so one tool I have so one kind of problem with this is that as a human this is not very easy to read because the columns are unaligned so I can feed that to this tool and that will keep the data the same but just add some white spaces and align it so if I did that you got something like this so same data but you know I can sort of see what's what now which is nice okay uh useful okay so the so another thing I can do as a human is that I can tell this is right so you know there's some numbers with coordinates they sort of look fine you know I can manually open the image and sort of poke around in the coordinates and sort of see if it's correct but it would be nice to just be able to visualize the whole thing at once you know and then you can see that you can you'll be able to directly see if it's doing something incorrect so I pull out just the data I want to visualize in this case is the the ID and the coordinates and the center coordinates of each tag um there's a reason I'm doing a disorder um I'll get asked me later if you want to find out it's all in the documentation but trust me this is what you want uh so when you do that these are the results again you can see the you know some X coordinates some ID you can imagine I think you can see where this is going and then I can take this thing I just did I can pipe it through the plotter you know same tools before I tell it to I give it that image which will overlay the data spotting on top of the image so I can clearly see that it did the thing that I wanted to do and a couple of things that don't matter but the result is this so this is this is the image these circles are the things that plot it so they're kind of hard to see and actually turn this thing off you know so these are the circles that are the center of the detection the colors are which tag it is so for example yellow is tag number ten so this is kind of funny like why are there two of these and put the image back ah you can see that these things actually are mounted on this cube and there actually are two of the same tag on the side so this actually did the right thing and here you can actually see that it did and identified properly put it in the right place so April tags are good um yeah and you know and another thing that these sets of tools give you is that if you didn't have them if you were doing this manual you would maybe manually check two of the detections but here I checked them all right so if some of them were wrong if I was doing this without the tooling I might miss them but I didn't miss anything here I know that they were all right so it's usually you know often to clouds um okay so I don't want to get too deep into the details it doesn't matter so all right so let's see how robust this thing is to noise so in real life what I if I was doing this for real I would have gathered a whole bunch of images into different lighting conditions so if the dark thing is I can look at those things together so here I'm going to fake it I will just I will write a little loop to use image magic to actually change the brightness in the contrast of the thing and just add some noise so I will write my thing right and so here's the images made and you know I can go through them so I just I just made these right looking at this as a human like this looks like a pretty terrible image if it failed in this I wouldn't hold it against it but we can check so um so I just I have a separate image for each each one of these guys and the actual contrast number is in the file name so like this guy has a contrast of minus 30 and again since I'm faking it what I can do is I can I can write a little bit of a of a contrast so when you run this little thing you get this vanilla which just tells you the path and the contrast number so you realize I would have a little thing to actually look at the image look at how bright it is but for this for this talk this is fine um okay so I have that and let's also actually run the e-platform texture over all the images you know all my images I'm gonna do this and I could make a little loop and have a separate file of detections for each image but you know it's a choice the choice I made here is to just make one file without with everything in it and I can do this and it's gonna crunch for a while there we go so this is long this detects this tells me about detections for in my whole dataset and fortunately I have tools so as a human I don't actually need to read this because that would and yeah okay so I have two logs one has detections one has contrast numbers I want to be able to to compare them together in some way so I am going to join them uh GNU Corbytels has a join tool how many of you have actually used it nobody awesome anyone one person I have three people three once yeah so before before I did this I knew the tool existed because it's kind of a pain and you have to match the columns yeah so now I use it all the time because I have this guy Vino join so it's a wrapper around join so 90% if you want to learn how to use this tool you will 90% of what you need to know will come from the join man page which is not mine thankfully um and as with all these kind of wrapper tools the main difference is that any place where you would give it a number give it a name instead of saying take this file join column 3 and column 7 you would say here join on time whatever you're doing um another nice advantage of that is that you know if your log file's changed and then time used to be in column 3 and now it's in column 5 my command can stay the same and like that actually turns out to be very important because you can it's very easy to pass up so anyway so I'm going to call join I'm going to join the thing that tells me where my image is here are my two log files and then I will send it to this file so if I do this bam so the pad detection and if it goes over end I have one more column contrast so that data came from the other file so now I have one log file with all my data in it so now I can visualize stuff together um yeah this is an inner join for all your database people the join tool can do outer ones and various other things read them on page it's surprising and powerful actually um yep and you know as before the output is also a valid vanilla log so you can chain it and send it to other tools and all that all that wonderful stuff uh yeah so just because I can let's run it through a sorter so same thing it works the same as the other one except I'm sorting in contrast not on dollar 2 or whatever it is um and there you go so I'm not actually using this phrasing but this is just to show that that you can do this and I do this all the time now which I did not anticipate doing for um alright so let's actually let's see how well the detector works so let's look at the contrast so here I'm doing a filter so I'm going to pull out just two columns the contrast and the detection count um so the plus means throw out all all rows that don't have any data in the column um yeah and send it to the plotters so we can see it so if I do that I get this so there we go so this so the x is the contrast and the y is the detection count so it looks like here for contrast values either way too low or way too high you see only six tags but for the other ones these seven so it means that actually mostly works across the whole range you know even in the darkest image you didn't find zero or one tag you found almost all of them which is kind of shocking so so this already gives us an indication of how good this thing is but we have we have a lot more data so we can do more we can look at more things uh so I'm not going to get into details so first of all let's just confirm that it really did work in the darkest image because that image is awful so let's let's take a joint file let's pull out the detection coordinates but let's also so this pulls out particular columns but let's also pull out only rows for which this is true so you know I'm looking at the darkest one which is minus 40 so this guy you know will just take a file and give me only data about the darkest image and then just send it to the plotter so this is just a sanity check so there we go so indeed we have a terrible image and it actually looks like it kind of works and you can see that the the missing detection was this horrible one in the corner the one that looked a little ridiculous that it was able to find in the first place so if I zoom in on this guy like with the with the original image it was able to find the detection here as well so that's the difference between the 6 and the 7 and that looks awful but I'm okay with it not being able to find it okay so apparently it worked so one potential question is that how consistent are the hits so these are all really the same image so if this thing worked perfectly when it gives the coordinate of the tag they should all be exactly the same right so there's noise they won't be but we can we can visualize how bad it is and I can do it this way so this is pretty much the same guy as before except here I don't care about which tag it is I just care about where it is and upon I'm taking all the images together so if I zoom in on an arbitrary one ah so there's a spread you know for this particular tag so it's not perfect but you know this isn't terrible so the this resolution is awful so you know if I want to measure like the distance from let's say from here to here that that is 0.2 pixels you know that's pretty good and yeah and here I can turn out the image if I want yeah so you can actually see the color tell you which tag it is so one thing you might discover if you did this you might see an outlier you might see that there's one detection that's way off to the here that all crosses so you can't really tell but I can plot things with I can plot things with labels so if I do that and it helps if I do this okay sorry yeah this is a screen thing you don't need to worry about that so if I do that this is the same plot as before only except for X's it will print the idea of the thing so if it just will happen to have an outlier just tell me which one it is and then it can I know what to do so right so my mom was done so the okay so we have we saw that across a pretty wide range of contrast the detection counts are pretty good we saw that the spread looks reasonable and we can do more analysis to actually draw histograms and look at the distributions if we want I think that's fine for this so one last thing we can do is that if we look at the actual if we look at the actual output of the detector we can see that it actually gives you some error metrics like it gives you hamming distance and the margin so we can look at like we can ask it how well it thinks it's doing by looking at its own metrics and we can do that with the contrast and the margin and then I'm just plotting all together and you get this so you know this looks at all of the images and all of the all of the tags so the x-axis is the margin so how well the detector thinks it did and you can actually clearly see that there's a parabolic trend so in the images that weren't too dark or too light you can see that it thinks it did a little bit better and that's it so just sort of demos as as before there's these things more powerful and there's more things you can do but you guys feel free to ask me stuff or look at the documentation there's demos that are online and that's it any questions? let me get you the mic if you have questions will you share your org mode file on a will you share your org mode file on github I can where should I maybe in the repo yeah yeah let me do that I'll put it in the repo yeah do you have any examples where you're doing like top 10 in a column or in counts or any you know averages something like that I don't have any examples but you would do it exactly how you would do it so like if you didn't have a nilog how would you do it so if you wanted top 10 you would do a sort and then you would do a head that should do right if you wanted averages so what I would do without this I would write a little bit of org to actually do running some at the end divide so you I should say that the vnl filter tool all the all the modes you've seen using it so far are I've only showed you how to pick out rows and columns but since internally it is just it generates org and just runs it that's what it actually is doing you can just run it in that mode if you say desert eval then you quite literally just give it a bit of org where the only difference is that instead of say $5 you say time or whatever it is right so that's how I would write an org but inside the vnl filter so I don't have to remember which columns which could you display the top of the document that shows where the repo is just that that is a great idea right there can you do histograms and oh yeah oh yeah every day and I mean you know this is a trivial thing I would just do a stupid one for you sequence 100 pipe I can't see both sides at once that is a histogram so you tell which column is the histogram so columns here that's what we have and say bin with one so what do we expect yeah boom great here we go and actually while we're on the subject let's make it slightly more interesting let's say I want to say $1 times $1 I'm not totally sure what I expected let's find out probably I don't know $5 what did I do yeah wrong quote there we go okay so this is not interesting lots of accounts I need a bigger bin with hold on $50 ah okay I don't know why not anything else well thank you very much oh my god just kidding I've I'm kind of a plant I've heard that this is also in Debian oh yeah oh yeah yeah so both in Debian for a few years it's stable when the log is in said we are going to do a new release in a few months probably and it will be in that one yes yes yes yes thank you very much thank you the last talk for this track will be in 50 minutes back in this room it's swimming in the countywide data lake it is how LA County gets all of their data ready to process one two hello okay can you hear me okay good awesome we'll give just another minute or so here I don't know if you want to give anything special to say at the beginning or you want me just rock and roll however you want to go we can rock and roll all the other all the other speakers when I asked if they wanted to be introduced they looked at me like I had grown a third head and said don't why would you do that so me too I didn't bother to ask John so everybody this is John Neil he's going to be talking to us about the countywide data lake and I will let him proceed he did update these slides to be technical as opposed to governmental so hopefully this will give us the perspective we need yeah thank you yeah my name is John Neil I work for the county of Los Angeles as a program manager over a data science analytics team as well as electronic forms enterprise content management DI all kinds of stuff like that I'm going to talk a little bit about a journey that we're on with trying to provide an analytic platform yeah so I've done this presentation it started super technical and then over time I've been doing these presentations as a an executive type stuff and so I had to kind of put some of the I mean it's not like super duper technical but for the executives I normally talk to when I start talking about product names you can see so I'm going to kind of walk through this stuff and kind of give you an idea of what we're doing and what our challenges are and hopefully you can get some interesting information out of it so first of all just really quick about L.A. County L.A. County is really big and even I sometimes kind of lose sight of the fact that we service 10 million constituents we it's 4,000 square miles 88 cities so we had a conversation yesterday with some folks from the city of Santa Monica and then they were like oh but then we're working with the city of L.A. which is so much bigger and both L.A. City and L.A. County L.A. City and Santa Monica are completely absorbed by all of L.A. County so it's big we have a huge budget we have a lot of employees etc the department I work for actually is it does technical stuff development we have a data center for the entire L.A. County we also in my particular apartment also are the ones that have the department of public work so we fix potholes we are the pay master for the entire they're not pay master they're the contract person for the entire county so there's a lot of different areas that we that my particular department does but and it's really important that we are trying to become the the trusted partner and provider of choice and that ties back in because about what I'm going to talk about because if you don't offer something that people can use and want they go someplace else and the places that they're going aren't always the best places so so let me just talk a little bit about before I dive into this the nuts and bolts of what this is let's talk about the challenge of what we were trying to solve here and I think if you start with the technology first because we like to because it's cool you miss the point the technology is only there to serve and service and solve problems and so I have to keep that in mind all the time because I just want to dig right into the bits and bytes but so this was the issue that we dealt with at the county and still deal with at the county and that is that we are that we're data rich we have data everywhere but analytics poor and I point this out because if you look at the variety of types of divisions, departments and so on that are that we have to service we have family and well-being we have public public works parks and recreation beaches and harbors we have mental health facilities we have our general government operations the treasure tax sector the registrar recorder who does voting systems we have our public safety the sheriff district attorney fire department I mean you name it across the entire gamut it's crazy and every one of those is silos of data and problems that can be solved with data so this was the problem statement how do we solve this problem so see sorry I there we go so one of the ways you do it is think big that's a big problem with big things and this is sort of the agile you know manifesto approach is think big start small and learn fast and iterate quickly so that's a challenge at the county of LA I'm just going to tell you right now I mean think big I'm going to tell you that and the start small is kind of sort of what this is so this we have we have sort of a vision statement if you will and that is we wanted to create a flexible scalable and cost effective analytic platform to meet our customers needs and the unwritten piece of that it's not in there is I didn't want to do it with a one vendor solution or a we were going to try to do this on open source it's not written into our vision but it it informs it quite a bit so our vision is to take all these different areas every area has its own silo of applications data and so on along with their own data warehouses their own analytics if any their own reporting all this stuff sits on top of what it does or most of it does but that's this is what things kind of look like now and what we're looking to try to do is and again this is a vision is to have one place for all this data at least for the purposes of analytics and some sort of a sandbox or an area that you can do exploratory data science data analytics and so on and not disrupt the existing data ETLs data warehouses and so on that's the vision here's another view of that vision essentially bring data into and by the way you know having something called a data lake offers you all kinds of opportunities for every water-based metaphor you imagine so I've managed to find as many graphics and pictures I could that have water in them or in some way and of course you know oceans, harbors, reservoirs it never ends so the idea is to have a place where you have a large you can adjust data of all kinds a confidential area that you can do exploratory data science and so on try things out and then a place where you have published results stuff that your customers or consumers of the data can rely on so that's the vision here's the reference architecture that we use to try to meet this vision you're going to see this one a lot in this presentation I'm going to kind of point out all the different bits of it and so let me just kind of lay out the different pieces kind of high-level pieces of this and then we'll dig in so you have sources of data and by the way just about any data platform is going to have this these general pieces you have data sources you have ingestion you have a way to get the data in you have a place to put the data you have visualization and analytics on it you've got to have security on it you need to be able to govern it you need to have a catalog so you know what's in it and ultimately this stuff needs to be consumed both by modern tools and of course the county's got nothing but legacy crap everywhere so that too so that's and by the way this model it doesn't matter you know I haven't said where physically this is it could be in your basement if you want but these general pieces are transcend where it is so this is an eye test I'm not going to you know try to read every piece of it but basically we came up with 13 functional requirements in order to deploy this and we're going to talk about most of these in the ensuing slides but here they are all in one fell swoop the green first ones we tackled so the question is what do you need of requirements for I mean heck this is starting going it's so that you can have this and not that you don't want a dump a place where you just dump stuff without any any kind of no wow wow west kind of situation so let's talk through some of these individual requirements and our individual pieces and what we used or are using and and we'll kind of walk through this and try to not be too fast on this part usually I skim through these technical details or the the specific requirements pretty quickly because I have to do this talk in 20 minutes since I had more like 40 45 minutes I just pushed a bunch of more slides in so I still have to talk fast so the first one is this part right here of that diagram data storage so that's a key component and so data storage it needs to need to be able to put data sources in a single location for analytics use you have to have the infrastructure and flexibility to scale this out quickly and it needs to be cheap and plentiful I mean we're talking about not you know we're talking about potentially terabytes and petabytes of data and you need to be able to scale that quickly and it so what we did with that is essentially we have we use HDFS Hadoop it's kind of the big data this distributed file system with and you could use I mean some of these are what we use and other ones are just what you can use so open source Apache Hadoop direct the task storage with commodity hardware option so super cheap hard disk and lots of them Hadoop has built in it expects you to be using crappy hardware that's going to fail all the time so it uses triple redundancy on the data it's kind of it's self healing in the way it works but let's be honest there's cloud storage too and we have to address that you can't ignore it so in each of these cases I try to call out the common options where I can and where I can speak to what we're using specifically for our particular case we use Hortonworks Hadoop and you know it's going to be interesting to see what happens because cloud era and Hortonworks just merged and you know so but Hortonworks is probably the best distro of HDFS and you know Hadoop distributions but Hortonworks is to Hadoop like Red Hat is to Linux or Ubuntu is to Linux right nobody goes out and downloads the stuff directly from the Apache servers and stuff people go they get a somebody has built the distribution and you use that one because it's just easier and that's the same way with Hortonworks and the reason we use Hadoop is because we don't know what is what is in and you know we are not knowing what is what is what is what is what is what is our our everything is our the to be able to see my data, and I don't want to do that. And oh, wait a minute, you're going to put that in the cloud? Oh my god, you know, everybody's going to see it, right? So governance helps tackle this problem. There's several things in this presentation that talk about that. But you need to be able to provide governance on these data sets. We want to use our existing data stores. We want to be able to define tags that can tell what users can access what data, what information is accessible. You want to be able to create a catalog that has user-defined tags in it so that I can make up my own pieces of this as a user. And you want to be able to have data lineage. So a lot of what happens in a data lake is you take some data from over here, because you can put anything you want in here. You take some data from one type of data set and another type of data set, and maybe you get, I don't know, external information. And you've mashed that together and create value out of all that data. But at a certain point, when you look at that and you get your response, you want to be able to say, well, where did that come from? How did you put these things together? And what data lineage allows you to do is that you can still be the root place of where it came from you can track back to. That's really important when you start expanding the value of the data within there by enriching it with additional data. So that's kind of the governance aspect of it. The pieces that we use for that in the open source pieces we use in the data lake are Apache Atlas and Apache Ranger. So Atlas lets you define metadata about all your data, about your data. And Ranger gives you access and entitlement. So it lets you in to the data lake. It lets you know what you're entitled to. And those two things work together so that you can tag the data and say, I only get to see data from these three sources. So when I log in, even though all the data is there, all I see is what the data's been tagged for me to see. And it gets even more granular than that. Because, and I'll talk about this. Actually, I'll wait because I talk about that in another slide. But your ability to have three people see the same pieces of data or the same records but only see certain parts on it based on their entitlement. OK, so you've got to get the data in. So you have to have an ingestion. So first, let's talk about the sources. So prior to this construct that we built, we were pretty good with structured data. The county has a lot of structured data that we have databases, data warehouses, extracts. We have data all over the place. But it's almost exclusively structured data. So operational data and our ERP, our data warehouses, all the departmental apps, even external data that comes in from places like the DMV and other external entities. But we really weren't really good with things like video images. We store a lot of PDFs and documents. But we don't really have any good way to do stuff with that information. And when you get into more modern types of data sets, like social media interactions, mobile apps, no way to deal with that. And no way to deal really with streaming data. So sensor data, Internet of Things data. We run hundreds of buildings in the county of LA. And we have no visibility into the environmental systems that are on those because we don't really have a way to collect and deal with that data. So this platform allows us to do that. So those are the sources. Then the ingestion piece. So this allows you to onboard and bring in data. It allows you to set up a continuous ingestion of data streams. So you can bring in video feeds. You could bring in Twitter feeds. You can bring in regular batch updates from your data warehouses, from your data sources like your ERP and so on. But the other part is this ability to deal with large and viewer real-time data. So big data kind of is defined by, they call it the three Vs. So it's variety, velocity, and volume. And the data, the county has volume. We're okay on dealing with variety. We, our ability to deal with stuff and volume maybe not so much. But that's what those three things is what helps you define what big data is. So you need a way to ingest that. For the purposes of our particular use case, we use a scoop, a lot. Scoop is open source. Really great for batch ingestion, for setting up workloads and so on. We have the capability to support Flume and Kafka as part of our ESB, our enterprise service bus technologies. I don't know if anybody's really using it yet. So, yeah, let me just do a quick disclaimer. I don't wanna represent that our particular organization is using to its full capacity every piece of these. In some cases, it's a matter of, I'm providing the capability of doing this. We have the tools, we've tested them out, they work. Do I have use cases that are really using Kafka or using Flume? No, not yet, right? That's the key, not yet. But without it, without this, you can't do it at all, at least on our particular and inside the county. The one we use a lot is NiFi. It's also open source. Originally, I believe developed by the CIA, I think. But it's really good with creating visual workflows of data. It's open source, but really supported mainly by Hortonworks. And that actually is a theme you see a lot in the open source world is, and I'll talk about it in a minute when I start talking about cloud stuff, where stuff is quote unquote open source, but literally the only contributors to it are one company, right? NiFi is not that way, but you do see that a lot and it's a caution, cool. So data security and controls. I talked a little bit about this as well, but you notice that this is kind of the security zone. So data comes in from these sources and the people who own this data and the people who wanna get access to this data need to feel comfortably secure or feel comfortable with the fact that their data is secure. So all of the security in this, in our platform, it sits over the top of all of these pieces, the data storage, the analytics toolkit, the consumption zone. We're gonna talk about data obfuscation in a minute and the other two, but I think it's super important because there is a perception you're throwing all my data in here. I need to make sure it's kept safe. So we need to be able to define specific roles, so role-based security. So this is some specific stuff to the county, so we have single sign-on. People don't wanna have to sign in 15 times, so whatever we do, we have to be able to support single sign-on on our existing active directory. So the county, again, there's a little county inside stuff, our employee IDs all start with an E, our contractor IDs all start with a C. So this slide is still left over from when I use presentations to say we're using existing credentials. So if you look at these slides later and go E number, what's that? So that's what that is. So, provideability for security, for view modify, for the data they only have rights and access to. Again, a big one because you hear data laggy, they think, oh, we're putting all this data in here, right? But you need to set up rules and roles and rights, just like any other data structure. So, and this administration needs to go over the entire stack, this entire, the entire middle part that the circle went around it. That's what, when I showed this, it needs to control all of this. Obviously audit logs, tag-based access for user interfaces and row-based security. So, the example of this is like PII and PHI, right? So, health information, personally identifiable information like social security number, that's information that we have about different organizations, about different people. So how do you deal with that? And one of the ways you do that is on ingestion, you can tag different pieces of data as certain data type. So, you can say PII and you can say, these are all the fields that are PII, social security number, birth date, and then based on those tags and the tags that you're allowed to have. So, our payroll people are certainly allowed to see somebody's social security number for various reasons. But Joe Blow, who's just doing a report on all of their employees, they don't get to see that. And the tags are what define who can see what. In a sec, we're gonna talk about obfuscation, but there's a way to do it in such a way that you can see that it exists, but you just can't see what it is. So, soup tag-based security is a big component of the Hadoop ecosystem and the tools that sit on top of that. So, essentially, you've got Apache Knox, which is a single point of obfuscation into Hadoop. You have Apache Ranger, which obviously we talked about before, which is what provides the access and entitlements within. And then, we have data at rest, and we have Kerberos for our cryptology, cryptography, to keep our platform secure. So, yeah, that's our security. So, obfuscation, a little piece right here. This was really important to us. One of our first use cases was around, we wanted us, we were doing a fraud use case, and one of the things was to find out was, was a county employee also a vendor, right? So, in some cases, that's actually a perfectly legitimate thing to do, but in other ways, if they happen to be working in the procurement area, and they're also a vendor to the county, that could be a problem. So, how do you deal with that? We still want to be able to search to see if an employee is a vendor. You know how you do that is, we have their taxpayer ID, we have their social security number, and taxpayer ID is social security number for a person, right? So, one of the searches we did was to match, and this is not like massive data science here, we're matching social security number with the taxpayer ID number. But, the people were really concerned, oh my gosh, you're gonna let this, you're gonna let the programmer, or you're gonna let somebody who's working on this see that information. And so, what we did was we used obfuscation, which basically took these keys, these numbers, your social security number, and turned it into an unreadable one-way hash. But it always hashes in the same way every time. So, that way, it sits in the data lake as your social security number, but as you're searching it and looking at it, it's obfuscated on the way out. So, that way, you can do searches and match that way without exposing that information. You could also choose to say, I'm not gonna make this available at all. There are ways to build in to do only the last four, so, but that capability is super important. So, and that is done straight into in ranger. So, row level filtering and dynamic masking is part of ranger. And then also as part of the Hadoop file system. Okay, we're pretty close to the end of the different components, just a couple more. Data consumption zone, this is one of the bigger pieces because this is where your data ends up after you've done all your fancy analytics on it. You have to have a place to report on it. So, basically, easy connectivity for analytics software. So, this is really some of the things I had a hard time explaining to my customers and partners at the county was the data lake itself, honestly, it's not an end user system. It's not a place that you go and wander around in freely and so on. It really is an expert system. The place that most developers, most BI people, visualization reports up, they're dealing with this data consumption zone. This is what comes out of data science. It's the end product. So, ability to get into whatever data you need, that you need to get access to. Access zones for different roles. So, again, what you can see, what you can't see. Self-service access to various areas. So, and you're gonna find these pretty familiar probably because, so this is this world, right? Your data consumption zone is Postgres, Mongo, MySQL, Apache Hive is basically a SQL front end on top of Hadoop. But, I mean, this stuff is pretty familiar, right? I mean, hopefully MySQL and so on. So, this is the world most people live in. And then finally, the analytics toolkit. So, this is where your, the actual applications that you write live. Multi-tenancy, so we've got 37 different apartments. So, if they provide a tool kit that lets them point their tools at the data. A sandbox, as I mentioned, sandbox really, this is a place where you can play around and make, you know, figure things out. So, provide analysts the ability to select from a variety of county-wide approved analytical tools. So, this is one of these ones where it gets a little interesting because, because I don't think it's my job to dictate what tools that you use to access the data on one hand. On the other hand, it can't just be every tool, any tool you can think of, because at a certain point we've gotta draw the line and say, here's what we're gonna support, here's what we're not gonna support. But, I want that decision to be made by the people that are using the platform. Not by me. So, this is, again, if your tool can talk to Hadoop, you're good. That's my feeling about it. So, and that's what we're trying, that's what we basically have built here, is it's a place for people to come, point your tools at it, have at it. If you can deal with Hadoop, and you can deal with this platform, you're good. And I would say most tools can these days. So, this world, I'm not even gonna try to like go too crazy, but this is our Python, and the various libraries that go with it, MathLab, PsyLab, Numbee. We've got a whole bunch of visualization tools. GGplot, plot we, we're also using Jupyter Notebooks. Zeppelin and or Jupyter, I haven't really let people, I haven't forced people down one pathway. Our particular team is using Zeppelin, but you could use whatever tool you want for packaging up your analytics and using Notebooks. This is, this slide could change every week, seriously. These tools, I think R and Python are probably, if I was gonna give you a directive to go learn something you don't know, and if you don't, and you wanna do analytics, if you learn R and Python, between those two things, for now, you could probably take over the world. So, let me just talk, so that's the architecture, how am I doing on time? Yep, perfect, come more minutes. Let's just talk a little bit about the, so we talked about the architecture and the functional requirements and the pieces of it. I'm just gonna show a little bit about what we actually did in our particular organization. So this, so of all those tools, here's what we have and what we selected. Again, this chart, this is our reference architecture. So, so we built this platform. It, we use, it's all, we support all the open source tools on it. We essentially bought a eight node cluster. It's kind of a big data starter kit from Cisco. You know, in the, depending on what world you're in, it was super cheap or God, it sounds kind of expensive, but I think the whole thing costs us like $80,000 for this. And it's bare iron hardware. It's not virtual. Literally, it's eight, like nodes. It's eight nodes that add up to 90 terabytes for like $80,000. And you can add terabytes to it at like a $1,000 a clip onto it. And we'll talk about the cloud in just a second because that's the question that comes up. Like why would you do it there? There's a lot of questions around that. But it's actually, it's an eight node cluster, five compute nodes, 160 cores, 512 gigs of RAM. And another thing about the way this works is it's, I mean, an analytic job that runs has access to all of that computational power if it needs it. It's not like, oh, you know, you only get this one piece or you can run a bunch of jobs. If a job needs 160 cores and wants to load a 500 gig database in the memory, it can to do its processing. So this is a little bit of a, this is kind of the diagram. Here's our Hadoop cluster. We have data sources coming in. Our key components. So we use Hortonworks. We're using a NiFi and Scoop for ingestion. We're using Atlas and Ranger for governance and security. Nox for security around the whole thing. The whole thing is perforized. And then the way this works is you've got this data in here and then you just, we have these VMs that attach to it to do their work. I called out these tools because these are some of the ones that are being used. There's actually people using SAS and we have a SAS connector for the crap load of SAS that we have at the county, which is kind of frightening. So, so here is our analytics portfolio. Again, Hortonworks Hadoop. Our studio and Python. We're using both Zeppelin and Jupyter Notebooks. We use Apache Spark. We're using Hive for just straight database lookups and SQL type stuff. We're using a lot of Power BI right now. Microsoft is, the county uses Office 365 and Power BI kind of comes along to the ride. A lot of people are using it. We're having issues with it though because Power BI is in the cloud and this stuff is not. And so there's some issues there. Again, Apache NIFI for ingestion. And then I have the SAS cognitive web focus. I have to put these in here because the, there's a bunch of this still at the county. A lot of legacy stuff. So the ability to figure out how to support that's super important. So what about the cloud? And so my question is, all right, well which one? And this is the problem we run into at the county is, who's gonna win this discussion, right? And so the approach of going with an open source platform is that we could, whatever, pick which one you want. And so by using tools that are open, you can, you're not locked into one cloud or another cloud. And so our, we haven't done it yet but in our roadmap is the ability to burst and scale into the cloud where we need to. Because it does take me a while, if I need to get more than 30 terabytes right now, I mean I can get it in a month or two, a month or two, right? If I need 30 terabytes like tomorrow or I need it this afternoon, I can scale up into the cloud to do that. And that's kind of the idea behind our particular approach. And, you know, part of the concern is we need to start in a spot where my internal customers were comfortable. And initially they were not comfortable on the cloud at all, at all. And so okay, here's an option for you. But now it's time has gone on, they're a little more comfortable and, you know, we're probably gonna have more and more people pushing that direction. So you can do all of the stuff that I described in Azure, right? So you have Spark, you have, they have what they call HD Insight which is basically Hortonworks with a Microsoft wrapper on top of it. So all the different pieces of this, you can do. The functional requirements that I laid out, that does not change. Regardless of where you do this, you've gotta have that. Amazon's the same way, right? You go to AWS, they have a data lake, they have machine learning, then you can move data in and out. And you can do analytics, of course. I mean, obviously all this stuff is super huge. But, let me just put a butt in here. And I kind of talked about this earlier. Open source, right? That's what this is all about, right? Just whole conference basically. They will tell you, both Amazon and Microsoft will tell you that their tools are completely open source and they are. But what they are not is interoperable, right? So you can do whatever you want on Amazon and just recognize, well, this is Hotel California, right? You can check out but you can't ever leave. Once you get your data up there, once you start running your analytics up there in their structures, that's where you are. And that may be okay. I'm hedging my bets a little bit because we've got a religious battle going on back at the office around Azure versus AWS. Google isn't really in the play right now. Neither is IBM, neither are the other ones at the moment. But those two are like kind of shaking out to be the winners. And I'm trying to keep us open enough so that at some point we may just give it all up and go in there. Okay, so challenges. Number one, this is driven by use case, right? The technology is fun but you've gotta make sure that you're solving a problem. Otherwise you're just messing around, right? Again, wary of monolithic platforms and lock-in. All the vendors are now, they all have their data lakes, they all have their own analytic platforms. I'm dealing with a company right now who has basically stitched together a bunch of open source tools, packaged it up and is selling it as a piece of software. And I tell you right now that it's like, what about the analytics? Oh, that's proprietary special stuff that's our secret sauce and all that stuff. And it scares me because once you're in that world it's really hard to get out. Data sharing agreement. So challenges I had before work, it wasn't technical, it was, I'm not letting you see my data. You know, your data's basically public. It's already in the open data portal. Well, I'm sorry, I'm not letting anybody else see my data. I mean, getting people to put their data in there is a big challenge and the part of that is, can you share it? That's a whole other talk, by the way. Data quality is a problem, right? We put a bunch of data in there around the county, you know, they re-legalized cannabis. The county was talking about licensing cannabis. We tried to grab data about cannabis before legalization and the data's terrible. I mean, it's just terrible. We built beautiful dashboards and they're completely live, right, so you gotta have good data. The cloud, it's an opportunity and a threat. I mean, if you're in a big organization that already has a data center and already has infrastructure, why not? Why would you, why not use what you have? If we're just folks in this room, we're gonna start our own company, why the hell would we build this on-premise? Why would we go buy servers and manage this? We would totally do this in a cloud, right? So it kind of depends, it's easy to say it's an opportunity and a threat for us anyway. Data science is iterative, our customers are not. And so I read into this as well, where it was like, okay, we did a run through your data, can you tell me what you think? Well, we'll meet with you all in that in about three weeks and we can talk about it. Okay, three weeks, here's your data. Okay, a month from now, we'll meet again and it's like, if you want to be able to iterate and go fast, your customers need to be able to do that too. And the county is old and stodgy, so. So think big, learn fast, et cetera. So my takeaway is think big is easy, start small is easy-ish, it's important to find a champion, somebody who can help drive this. Moving fast, I started this project two years ago and so I'm kind of where I was a year ago right now so it gets a little disheartening. But it's there for people that want to use it and I think that's the important thing. So, I found this this morning, I had to put it on here. Have a Super Mario, March, March 10th, you only get to do this today, I thought, sorry. So actually before I do questions really quick, I want to do, you know, if there's some donut stuff in here, I want to point out data and donuts. Nina Kin and I work with these, with data and donuts. It's a monthly speaker series. Also, Hack for LA, Map Time and LA Counts. So check them out, there's fantastic LA resources. So with that, woo-hoo, all right, I took my 20-minute presentation and turned it into 40. We're good. Well done. Thank you. Any questions? Yes, my questions, are you nuts? Yes. I don't want to put you on the spot. Go for it, go for it. Do it, no, it's nothing crazy. But do you see potential in this tying into the open data portals to have closer to real-time updates of government open data? And have you seen any interest in county departments who would want to do that? So the answer is the open data portal, I look at that as a source for us, like one of the data sources. And a really, it could be output, right? It's a matter of finding departments that want to do that. And so the answer is that we have had interest in people loading their data in, and you can tag, that's where the tagging comes in, right? So the data comes in, you can tag it as being open, right? So now all you gotta do is expose all the open data and push it up to the open data portal. It is in the roadmap, is it being done? I would love to change the approach that your department, Nina works at the county, that's why she sends you her notes up, does for their open data and do it exactly what you just described. Put all the data in the data lake and pull just the open data out and expose it to the world, so yeah. Department of, they want to do it? Possibly Department of Public Works, yes. So, yes, exactly. I just want to get your opinion on where the project stands as far as maturity and adoption, like where you are, where you want to be. So the program right now stands, I'm trying to figure out how I want to describe this. It's, so the fact that it exists has opened up the conversation for us, right? So before they were like, man, nobody understood it, nobody knew what it was for. Now I've got that part, and as we've shown value in it, you're getting more and more customers. So we're starting to use it as a sort of as our sandbox. Well, it's more than a pilot, it's sandbox environment. And because everybody's pushing like, well, why don't we just use Azure, Azure for all this stuff? And that's kind of where my challenge has been, is that in the time that I stood this thing up, I really want to move off to the cloud. And some of the folks that are looking for security are now looking at us going, oh, wait a minute, we can do this on premise and it is actually secure. So I would say that we are in the adoption phase right now for a couple of departments, and it needs a lot more. So I wish I could say it was like a massive success, but I'm not gonna just, oh, I'm not going to lie about it. And like, oh, no, it's fantastic, but most of it is because of people which is not understanding what it's for. You mentioned that the county is data rich and analytics poor, so are you hiring for analysts and data scientists or other positions? Yes, so there's a two part answer to that. The problem we have is if you look at the counties, all the jobs in the county, and trust me, they go from accountant to scuba diver, I mean, to, you know, undertaker, really. There's no data scientist in there, there's no data engineer, there's not really even a business analyst which is also kind of part of this. So we're in the process of creating new positions to do that, so we are hiring for those roles, we're hiring in and sitting them against positions that really don't describe that's what they do. If you look at the goals of the county, one of the biggest goals is to become a data-driven organization. Data-driven metric, data-driven justice, data-driven, we're doing the homeless initiatives, so the major initiatives, homelessness, justice reform, child care, child endangerment, and stuff like that, and it's all driven by data. So they recognize the need for it and now, because the county's just big and slow, they're actually figuring, oh, I guess we should have positions that actually would tie into that. So the answer is yes, we are hiring for that. Unfortunately, the hard part is that you wouldn't see the term data scientist or data engineer or data analyst in any of the job listings. And so that's the hard part for us, so it's messaging. Does that help? A lot of them are under apps developer, IT specialist, systems analyst roles, right? And DBA, right? We do have a DBA position that we're turning into data analyst positions. Most of them though are systems DBAs, which is good, but I'll be honest with you, people who can tune a database, we appreciate them, but the way the world is going, we need application DBAs, people who can mung data, who understand data itself, not the inner workings of the database, so yeah. With such a wide set of departments to serve, like where are you on, kind of like where your organization has to partner and do a bunch of upfront work to get somebody on boarded, or they're self-serve and then you have tools that look for social security numbers and flag stuff. So I wonder how you were gonna scale to onboard. Right, so the, it's use case driven, right? So it's not like, hey, we're gonna go get the department and do all their stuff. It's like, what specific problem are we trying to solve? So with the auditor controller, we did a fraud use case where, again, we pulled in procurement data and accounting data, and now we have that data, right? And we secured it, that was a use case we had to match on vendor's taxpayer ID, and so when the use case needs it, then we deploy that piece, right? So we're working on some risk management analytics as well, so we're loading our CEO risk management, so it's like data from our, from, again, a lot of them are old sources. So the answer really to that is the onboarding comes with a use case, as opposed to like, hey, come dump your data in here and see what we do with it. And there's a, so there is, and I didn't include the slide in here, but there are just an absolute crap load of data analytics use cases at the county. I mean, we did a really high level survey and I came up with like 40 or 50 that you could just fill this thing with data on. So we're picking those use cases off with people who are willing, you have to be a willing customer, and we tried forcing some people and they don't wouldn't, they had a hard time with it. So, anyways. Sorry, miss, but there was one question I wanted to ask you even before you got to the talk. As different departments are producers and consumers of data, how do you guys calibrate your interaction between departments? So there is a, the way the structure of the county is, there's clusters, right? So you kind of saw it on the earlier slides, but there's like a justice cluster, there's a general government, there's a public safety and so on. And the CEO's office kind of sits over the top of that. We do have a CIO and an office of the CIO and we do have a data management, information management program which kind of lays the rules of the road for that. And we're running on like a data maturity model, but that interaction between departments, it has to be use case driven, right? So in cases like homelessness or I'm trying to figure one of the ones. Well, so child safety is a big one, right? So you have to deal with folks with mental health, you got folks that are in the, our Children and Family Services program, like how do you put those things together? And there's a lot of pushback on sharing data and people will go, well, by code, we're not allowed to share that. And what's happened is we have to bring County Council in to sort of be the arbiter and our key data officer to be an arbiter of what's shareable and what's not shareable. Discovered is in most cases, they just say that because it's a pain in the ass to share the data. And so we don't want to, it's my data. I'm afraid if you look at it, you're going to find something that I didn't find but there's a lot of reasons why they don't want to share. But in a lot of those cases, what has to happen is our CIO, our CEO and our County Council has to come in with a hammer and go about shout, right? So we get data sharing agreement on our use case basis and that's actually happening with our homeless initiative right now. So I don't know if that helps. It's a very contentious, can be contentious. Some departments are more open to it, and other ones, man, it's like World War Three over this stuff, so. I see Nina has your hand up, but I have a quick question about that. Do you share with the state because I can immediately imagine, like in your broadcast case, if I file an LLC and I get that taxpayer identifier and you would need to cross-reference right through the state in order to catch that kind of fraud, right? So how do you guys do that or do you do that? So we don't. I mean, we, so the. Now I know how to make money. Yeah. Well, so no, the answer to that is, at least for the purposes of this, is we don't currently, but until now we really didn't have the capability, right? So now we actually have the capability and now it's a matter of bringing the business processes along for the ride, right? So I mean, there are, I'm not saying that doesn't happen at all, but it doesn't happen in any systematic way. So there are, we have a fraud investigation unit that would find that kind of shenanigans. I just want to let you know. But it's not automated and it doesn't do it proactively and there's, again, there's so many potentially really cool use cases around that because one of our things we're trying to do is we're trying to promote local small business. So we don't want to hire the big giant mega, mega corp, you know, 9,000 company. We want to bring in small minority owned, veteran owned, women owned, businesses, local small. And so we want to use data from places like you just described to help drive that. You know, and that's the way you get that data. It's not, you know, again, the pushback we get is, oh my gosh, you're gonna spy into my world and my life and you know, we're not trying to do that. So when the use case dictates, it makes sense. So there's been some local governments, as well as state governments who have codified data sharing and open data into actual policy. Do you see anyone pushing to do that for LA County? I haven't actually, unfortunately. I know our chief data officer or our acting chief data officer is that has awareness of it. We are actually hiring a chief data, I think there was an opening for our chief data officer right now to hire. And I'm fairly certain they're gonna bring somebody in from the outside to do that. But I think it needs to be brought up more. I wish it was, to my knowledge, I don't see that. I don't see somebody driving that as a function. Maybe you could do it. No, but I think our chief data officer, he's got his hands full right now just trying to get our information management program laid out. I have a question about healthcare in particular. I just joined a small startup that focused on medical records. So I've heard the data's very, the whole, it's kind of all over the place of like mumps and fire and HL7 hasn't really been very widely adopted yet. So what's your experience been like with that in terms of integrating with all those? Second part of the question would be, I've heard a lot of the data tends to be, you mentioned having very well-structured data, but I heard when it comes to medical records, a lot of the important information is unstructured. Doctors' notes, reports, lab results, so I'm curious about those two. Well, and again, this is where it gets a little aspirational on my part in that, until now we had no way to deal with that. And now at least we have a platform that we can do that stuff on and people starting to think that way, which before it was just unreachable. The healthcare stuff, I mean, they kind of are in their own world. I mean, a lot of times they fall back, they fall kind of hide behind HIPAA, right? Oh, it's HIPAA, oh no, no, and I always like to point out in our meetings with them that one of the A's in HIPAA is accessible, right? I mean, so it's all about accessibility. It's not about we're gonna put giant walls up in front of our stuff. I have not been dealing with the healthcare area at the county so big. I mean, we have the largest public healthcare system on the planet, I think, a hospital system as well. That it's just, it's one of those, like, you know, think big but smart, start small. Our healthcare area is not small and so we've been focusing on more biteable chunks, to be honest with you, so, yeah. Cool, all right, cool. Thank you very much. Awesome, hey man, thank you. Thank you very much for sticking out to the end, man, I appreciate it. I don't know where the local bar is, but that's where I'm gonna be.