 Thanks, Michael, for inviting me to speak. Let's give you an opportunity, thank you very much. So I want to just, well, give a quick primer on data science. The idea is that, I guess, there's a lot of developers in the room, and actually, before the last talk, Michael was asking how many web devs, how many app developers, he didn't ask how many data science people. One, okay, that's good. Don't tell them that I'm saying everything wrong. And there's at least one data engineer, right, Ann? Any other data engineer? Okay, that's good. Also, don't tell them, okay. So I'm actually not really a developer. I've been coding in some form for about 20 years, but I wouldn't really call myself a developer. I wrote my first unit test like three years ago. So it's been, I come from an academic background, and there it's very different, right? You code in order to get something done, solve a problem. And usually, yeah, I've never heard about testing your code until a few years ago. And so what I want to talk about is what, so first, give a primer on what data science is, a very general description, a few examples. I'll talk about the standard approach that a lot of data science teams do take, and then the way that we can have a slightly more developer-based approach. So a lot of people in data science are not from a developer background, right? So a lot of people are academics like me, and they don't have very rigorous approaches to the coding that they have to do a lot of coding, but they're not very rigorous about it. So I want to talk about a more developer-based approach to data science, and then there's also a kind of shameless, well, more kind of shameless, not really a pitch. I'm not selling anything. It's a plug. So it's a sign project that we've been working on at our company, and hopefully there are people interested in it because we're a partner. So a warning for the talk. My advice is opinionated. It's fairly non-standard. If you follow my advice, your mileage may vary. So take what I say with opinion results. So first, what is data science? There's a famous or well-cited Venn diagram from Drew Conwood that it's the intersection of three pillars, tacking skills, math and statistics, and substantive expertise or subject matter expertise. And so the intersection of any two of those, and you don't get the full view of data science, but when you intersect all three of them, then you get into the data science realm. And it's interesting that here it's not like computer science or developer skills, it's really hacking skills. So a lot of data sciences are really just hacking at code and not in a good way sometimes. The other thing about data science is that if you have anything that refers to itself as a science, it is not a true science. So chemistry, biology, and physics are true sciences. Data science is not a science. Computer science is not a science. Material science is not a science. OK, so it's been a very good branding exercise for data science that these people existed for many, many years, but in the last five or 10 years, they've created a nice little bubble that's expanding. So some applications of data science, how many people here work in e-commerce? There's a few, not too many. OK, well, you're all familiar with e-commerce, right? You're consumers, you're yourselves. The type of data science problems that you encounter within e-commerce recommendation engines, so when you go to Amazon or Lazada, they say people who browsed this item also looked at this work. People who purchased this item also purchased this. And a slightly different aspect of that is trying to analyze your customers' decision journeys and try to nudge them to more frequent purchases and maybe higher value purchases. So one example I've heard is with red marks, there are customers who will have fairly infrequent purchases and they are purchasing on red mark because for heavy items. If you want to buy a bag of rice or 24 pack of beer, if you don't want to carry it home yourself, you want that delivered to you. How can you convince those people to make more, let's say, weekly purchases instead of just the very heavy items? An interesting application for web devs. So as I guess most of you are aware, JavaScript bundles are getting larger and larger. And if you don't split that, then the initial page load when somebody lands on your page can take a long time. If you're on a mobile network, if somebody has to wait 30 seconds for the page to load before they even see anything or if there's any kind of interaction, then a lot of people will just shut the web page and go somewhere else. And so you can code-split your JavaScript bundle in order to avoid that. And the question is, how do you decide where you split the bundle and what do you pre-fetch? So if you're on page A, how do you know should you be pre-fetching page B or C in anticipation of where the user is going next? One way you can analyze this is through Google Analytics data. So you can get this kind of flow diagram where you start from the roots of the page. And as you go towards the right, users can either go to the videos page, stories page, the pics page of this fictitious website. And then from there, there's some probability of users going to other pages on the site. So the question is, how do you decide what do you put together and what do you pre-fetch? So in this case, it's kind of, you can see pretty easily like you might want to just bundle the root and the videos page together. And then when they do land on the videos page, since the majority goes to the pics afterwards, then you just pre-fetch pics once they land on the videos page. And somebody put together a pretty cool library package that will actually do this. That will analyze your Google Analytics data and then do the code splitting for you. It's an alpha. I haven't tried it yet myself. There's a nice, very nice blog post about the types of considerations that went into this library. And then there is a, so the previous examples are kind of for more commercial uses. There is some interesting data science work that can be done for things like sustainability and social responsibility. So I work with a non-profit called DataKind. The idea is to bring data scientists to on a pro bono basis, so volunteering for social impact organizations. One of the projects we're working with or the organization is called Pelagic Data Systems, they install GPS devices on small fishing boats. So they go into usually very rural areas and supply fishing fleets with GPS devices and they can get pings up to once every second. With that kind of data, that kind of resolution of data, they can see, well, it's a data science problem to see when are boats actually fishing, right? When are they going out to their fishing place? When are they actually fishing? And when are they coming back? How do you classify that kind of, classify that data? And then once you classify that, then you can look at questions like, what kind of, is there any overfishing going on? Should there be a more distributed pattern of fishing that can be recommended to the fishing? Okay, and finally, just so the company that I work with, it's a very small consulting company. My partner here is Eddie and he's the software engineering guy. He keeps me on the straight and narrow on that side. We've been working with one of the world's largest consumer electronics manufacturer and we're working with Southeast Asia and Oceania and also the mid-East and North Africa regions and we're focusing on marketing. So things like who are the customers that are most likely to buy a device within the next, or specific device? What is the next device that they're likely to buy? So that kind of thing. So yeah, I guess through this project, we've worked with a lot of other data science teams. So not actually with the clients because they don't have their own, they usually work with vendors but we're working with other vendors. So we've encountered a lot of other patterns of working and the kind of standard approach that people take and may not be very familiar to a non-data science or data engineering crowd is using what they call notebooks and the most popular notebook now is Jupiter notebooks. Do you cure it with a white? And the idea of this is it's similar to a REPL, right? Except that you have a, you preserve what, so this is a very small picture, but you have different cells where you have a code that can be executed. The difference with the REPL is that you can, you're preserving those cells, that you can go back to that cell and execute it again. You can have presentational elements like a plot and this has been a very standard way for data science people to work with, work on data. So some of the issues, so we played around with this for a while. Some of the issues that we had with that are that these notebooks, the actual, like behind the nice interface, these notebooks are defined as a JSON file, right? So JSON files are, well, the problem with them is that they don't show up very well in GitHub, right? So if you have an existing notebook that's checked into GitHub, you change a few cells in that, right? It's kind of hard to see with the Git diff what changed in a certain commit. It's hard to have modular code, though that is getting better in the beta, so it's called Jupyter Labs, I think, so there's a beta version that makes it easier to call external scripts from your notebook. And it's hard, or you're not very incentivized to have unit tests, right? Okay, another aspect of the standard approach to data science teams is that you start with data engineers. So data engineers engineer the data, so that means they clean it up. They set up pipelines to transform into a nice, clean data set that data scientists can work on to put into a model. So some machine learning model, for example. And once the data scientist usually working on a notebook has a model, right, then it's thrown back to the data engineers who productionize it, right? They take that model and say, okay, now I have to bring it to scale, allow it to work on gigabytes of data, gigabytes or gigabytes of data, or maybe have real-time interactions from the model, right? And this is, I guess, pretty similar to having a sysadmin and a developer, right? So you have a lot of dependencies, so the data scientist is waiting for the data engineer, and then, and if it's not done right, you have people that aren't fully utilized all the time. And if there's any kind of miscommunication between the data scientist and the data engineer, then you have frictions of that. What we've found is that it's much more expedient to have those people combined in one, right? So machine learning libraries are becoming easier and easier to use at scale, right? All of these data computation engines, like Spark, are becoming easier and easier to use. It shouldn't be that a data scientist cannot code, right? The data scientist should code well and should be able to code for production use. And another aspect of a more complex approach to a developer-driven approach to data science is actually having unit tests. So all of the data science teams that we've worked with, it's, I don't think we've talked to a single one that actually wrote their own tests, wrote tests. So for a developer community like this, it should be shocking, right? So you have big pipelines, data pipelines that are processing a lot of data, doing really complicated stuff, and it's very common not to have tests. So yeah, I encourage that. So to tie it in to this, to the discussion I was just talking about. So maybe to go back to this point that I kind of missed. So not only is it strange to have, or rare to have data scientists write unit tests, it's also rare for data scientists to use Git properly, right? So every data scientist or data engineer that I've talked to, they know that they should, but a lot of the teams will have, at the end of the project, a single commit to master, right? As the, so it's good that we have these, so they're learning well. Yeah, that's the bad practice, right? So a single commit to master, and then passwords, credentials thrown in there, but it's very common for them. So one thing is like data science, people should use GitHub more, use any kind of version control more for their code. Moreover, it's strange that there is no good version control for data, right? So GitHub has become the standard platform for collaborating on code, right? But there is no standard platform for collaborating on data. So there are many attempts at data platform collaborations, data collaboration platforms. And what, well, one question you might have is why not use GitHub for data collaboration? And at first glance, you think, okay, well, this might be a viable way to go. So this is a data set that we've had from our client work. So MTCMNC is the mobile country code, mobile network code. So if you're an app developer, you can get this from your users to see which operator they're using. And when you put a CSV file into GitHub, it actually parses it correctly, right? You see it as it looks like a table. And so here we have like the, in the first column, the two-letter country code and three-letter country code, then the MTC value, MNC value, and then the carrier values. One thing about this is that, let's say you decide, okay, actually I don't need the two-letter country coding, right, so I'm gonna just delete that column. What will that look like in GitHub? It means that every single line is changed. It's not, it's first like space inefficient, but it's also, it's not quite right. You're just deleting a column, you're not, it is deleting, or making a change to every line, but it's different, right? So data diffs are not the same as code diffs, right? And in many cases, data does need to be visualized for anybody to get a sense of it. With code, it's slightly different. You can describe what code does. It's hard to describe what data is unless you can visualize that data. And the other thing is that GitHub is a bit too, in our thinking, high and technical bar for people who work, for everybody who works with data, right? There are a lot more people that work with data, right? Many business users work with data. So if you think of like HR data, finance data, all of these types of data, and you have non-developers working on that, that is a high bar to use GitHub for, if that is to be the data collaboration platform. So some of the applications for GitHub for data, so within data journalism, that's, it's been growing around the world, it would be nice to have a kind of standard place where data journalists can store their data and refer their readers to that data so that they can play with the data themselves. Within the academic world, in the last few years, there has been a lot of cases in sociology research, psychology research, nutritional research where there has been academic fraud because people are falsifying data, right? They're falsifying data or they're cherry picking results from their data. They're data mining, they're experimental results in order to kind of engineer findings that are more interesting. If there's a higher bar for reproducibility that academics need to show their data, then that should make academic fraud less common. Another thing is crowdsourced data sets. So, there are a lot of data sets that are, I think, important to have but are difficult to collect for individuals. So an example of this, so Anne is from coding girls, right? And so one thing that's important is diversity in tech, right? One measure of diversity in tech is how many, what is the diversity of speakers within Meetups, right? So you could compile a data set of this. So all of the junior dev Meetups and all of the Python user group Meetups and data science SKE Meetups and JavaScript Meetups and see, you know, have a table of speaker names and if they're female or male, right? I'd be a lot of work to compile, right? But it's much easier if you have a way to collaborate on that kind of data set. And once you see that data, right? Once you see, well, okay, it is, I don't know, 20% or 30% female, then that really, you know, then you have to confront that reality of what are we doing wrong that there is such a lack of diversity in this community or maybe it's female or even. And then that's a, that's something that can be a learning. So thank you very much. I'll take questions and we're also hiring if you're interested. Thank you very much.