 Oh my, well I hope you feel like that at the end as well, I feel like of all the people talking on the pulpit, I'm the one who's not preaching to the choir because I'm talking about corporate data science as opposed to indie data science which I think is what a lot of you folks here are more interested in in terms of vocation but mostly in terms of what works and what doesn't. Unfortunately that's not my background, my background is very much in the corporate world. I have several reasons for it, some of them I'm not super proud of, they're a little bit cowardly, some of them are well if I'm gonna go down this route because I need health insurance then I'm gonna make a change from the inside and become the fifth column. So here is sort of my feel good, here's sort of my feel good slide about why I think that this is an appropriate approach for this division of labor. I think people are really really quite happy when they can afford to buy the Asian pair instead of the Macintosh Apple just because they want and not because it's cheaper or more expensive and I think that's they're different to drivers for different humans. Some humans really are driven by that competitive monetary component of data excellence and technical excellence and some of them are not and after much therapy I'm kind of in the middle I kind of like being able to afford things but I also don't think that that should be incumbent upon whole swaths of other humans not being able to do what they want. So it's a challenge I'm in therapy I'm sure many of you are as well. But my point here is that there exists a sweet spot where you can find somebody who has a need for what your talent is. That person is also willing to pay you those normally don't go hand in hand and it's because you have a vocation that solves a particular problem and it's a problem that somebody has said this is valuable enough to me that I will help solve it be that the Sloan Foundation, the Gates Foundation or the Republic of Brazil which is where I'm from. So this is sort of the genesis for why I think that I wanted to tell you guys a little bit about what it feels like to live behind the curtain and I'm gonna let you guys see Oz, the corporate curtain. So anyway this is one of my favorite slides. I don't know if you guys know who the Grook is. Any hands? Oh, this is awesome. So I will be the person who introduces you guys to the concept of the Grook. The Grook is a security engineer. He talks quite a bit. He is very well known in the field and nobody knows his actual name. I tried quite hard to figure it out but of course I was very unmatched. So I'm gonna leave his handle there and exercise to the reader figure out who he is. But he's very smart and he does a lot of counterintelligence if you're into that. But the point of this slide is that context in data is everything. So how many of you can see the cow? Don't tell your neighbor if you see the cow. Okay, let's flip it. Who cannot see the cow? Right? But here's the thing. Once you see the cow, you won't be able to see it forever. I came across this picture 16 years ago and I still use it because it's phenomenal. For those of you who can't see the cow, that's an eye. That's another eye. That's his or her ear. Ear right here. Some nostril action down here. And so that context is really what you're seeking is the ability to see the whole and identify the parts that are descriptive that allow you to get a greater picture of the problem or of the solution or of the data set to stretch the metaphor. But this is in essence what we're doing behind the scenes in corporate America, corporate Europe, corporate geography, which is very similar to the work that happens in the indie world. So since data without context is useless, why don't I give you guys a little bit of context? So my background is I'm a recovering data scientist. I have been doing data science for many, many years without a PhD. It's possible. And what what I have learned over the years is that controlling numbers and controlling code, it's impossibly difficult, it's doable, but it's very difficult. But the thing that got me really excited to get up in the morning was was getting people to do those things, programming people, which sounds Machiavellical, but it's really not. It is what's the meta level at which I'm not the one telling zeros and ones what to do. But I am bringing together a team of people who can produce a whole that is bigger than the some of the parts. Now the difference between indie data science and corporate data science is that you work for the man. The it's not personal, it's not academic, it's a business, and it's cut throat and it's cruel. And it takes it takes an inhuman amount of effort to live it and still maintain your kindness and your integrity and your and your soul to really get at the point of what it is. But how can you do this in a way that is that maintains your integrity? So these are the three things that I'm going to talk to you guys about. How is it that that we data science in industry and what are the similarities with indie data science? And then how do you build effective teams that can solve both industrial problems as well as civic problems, government problems, legal problems from a data with a with a data bend. And then lastly, I'll talk about what you do to keep that going. And so that it isn't something that disappears at every cycle of regeneration of whatever administration or whatever management, upline you're dealing with. So some things don't change. That's the that's the the first thing between the two. You still have to deal with data collection, you still have to deal with data exploration, you have to clean it, you have to transform it et al, it model it validated, you got to communicate it somehow, you got to deploy it, you got to understand it after you've deployed it so that you can undeploy it and redeploy it after you've done all sorts of modifications and edits to it that when you push the notification, small bugs fixed and it's like 122 megabyte update. It's still the same. The I don't know if you guys know who Ed Eduardo Arrino de la Rubia Ed is from he's with Domino Data Lab and he gave a really cool presentation at TDWI a month ago actually. But this is it's recorded. You should watch the whole thing. He's brilliant. But this is the part that I want to focus on. And it's not even his slide. He stole it. So I'm stealing it from him. This is Gillard Kafka's slide on the cycle of understanding and all the bits and pieces that data science teams end up being responsible for whether they want to or not. It sort of falls on them. And if you're really lucky, you have a team of data engineers. I've heard some awesome names today, data plumbers data. What was the best one? I'm not going to remember. I'm sorry. But there were some really good names. But this is essentially what we're doing, either in industry or or independently. Here's where it changes a little bit. The piece that is a little bit nuanced in the industry world is the introduction of production software. Either this be production software that you sell that you put a price and put on shelves or production software that powers the internal processes within your department or within cross departmentally. In an example here would be improving the logistics with your contract manufacturers when you're talking to folks who are putting your robots together across the world. So where does data science need to fit once you're behind the curtain? And I'm telling you guys all of this a little bit because I think maybe you want to go work for an industry at some point, a corporate entity at some point, and I'm absolving you of the fact that there are good guys out there. So find those. But also because there's a lot of information that is nicely transferable. So this is a particularly transferable one because let me tell you a story about silo data. What a lot of independent data scientists have to deal with and we've heard lots of stories about this here today is that a particular organization has some of the data and then a particular department within the civil organization has some data and then the town hall has some data and then the police department have some data. The exact same thing is mimicked in the microcosm that I work in. You have finance, you have IT, you have ops, you have legal, you have marketing, you have engineering, you have product and this is sort of a poetic licensed amalgamation of all the bits that are present in a post startup stage and this is where data science has to work once you are inside, which means whatever domain expertise you think you had goes out the window 70% of the time. You may be really, really, really good at parsing and doing NLP on contracts so that you can identify areas that increase your risk of takeover or any kind of financial problems. But that's not the first project that you're going to be working on because that's not the highest priority to the business and the business dictates that you're going to now be working on improving legacy product and making it more resilient and stable. Yay! All that money that went to law school. You can't use it anymore. At least not for a while. So almost every organization that I've worked at, there's PDFs and emails and CSVs if you're lucky. Proprietary spreadsheets and they're strewn across departments, locations, fiefdoms, levels. There are people who left and store them and somebody forgot the password of where the thing was stored, which is great because there was a password because you wouldn't believe also the amount of data that is just floating internally and everybody's just like, oh, well, we're all one company. What could go wrong? And then you have to tell them that people sometimes break up with a significant order of a significant other and they become slightly disassociative and then they have really interesting ideas that they can follow through and the enemy is within. So you have to protect your data from your own people. When you're working on altruistic projects that's less of a concern, when you're working in corporate America, you have to lock this stuff down. And I would have cursed. This is magical. There's a power here. All right. Some things change quite a bit. So once you go into these large organizations here, it's complicated sort of on purpose because that's the message that I'm trying to send, is that it gets complicated very quickly because of Gaul's law, you have things start simply and then you start adding appendages to it and adding dependencies and using it for purposes that it wasn't intended to be used for. So up there I have the architectures of Facebook is there, Pinterest is there, Quora is there, Airbnb is there, Twitter is there. So very, very, very complicated data infrastructures require data analytic skills that allow you to navigate through all of those different mappings of how the data operates and there's some skill there. There's definitely some programmer and data analyst skill, but there's also some overarching skill of how do you build the best symphony with all of your folks who are your violinists who are really great at dealing with queuing data ingestion pipelines into your computational lambdas. I live in the serverless world now, so there are people who are really good at understanding all of that piping and plumbing and those people are worth their weight in magical, very dust, very hard to find. Then you have folks who are incredible analysts, but they come from the school of thought that I came from, which is who cares, truth is truth, just give me the numbers and I will find truth. And after a couple of years you eat a lot of crow and you realize that's not how it works. Because what ends up happening is you start having these unicorns. Did it flash? I put the special animation there because unicorns deserve special animation. So what ends up happening is you hire this unicorn who is capable of doing everything and you don't change any of the culture beforehand and so the unicorn shows up for a meeting and everybody in that meeting is like why is the unicorn here? We're not ready for the unicorn. We have all of these other problems that we have to deal with and this unicorn is just going to make everything complicated and difficult and it's all going to be about this one person and they're going to be the 10x developer and they're going to be the stopgap through which every decision and all of a sudden you've created an incredibly dysfunctional team. In fact what you want to to acknowledge is that you have blended objectives. So I'm going to wonder a little bit and I think that might have a problem with let me see if I can not wonder. This is so hard. I'm such a wonderer. Okay so up here I have a two by two where most most projects are are divided in these in these quadrants where you either have your team working on novel things things that are not on the market that are super secret that you can't open source they're the crown jewels and then you have sometimes the same team at a different point in time or sometimes half of your team working on something that is very legacy very understood it's your money maker is the thing that you issue pager duties for right like this one is the one that cannot go down ever and then you have this other dimension to to this two by two which is doing R&D which are things that are inherently uncertain they're iterative it's these are this is the side that's much harder to to explain to senior management why so much money is getting sunk in there very smart senior management understands the ROI the return on investment that you get here but junior ones don't and they continue to challenge what the point is and then you go from very ad hoc processes of of these these iterative processes all the way to production scale and when you hit production scale it has to be resilient it has to be reliable it has to be able to scale and it has to never ever ever crap out unless unless a vendor of yours will stay unnamed because I love them and they give me discounts when when they go offline then you know that's not good so the first thing that I want to say in terms of the differences but things that that learnings that can be applied in the in the independent data science world is that when you are working on on something that is at the production scale you tend to work on it from a very legacy mindset of don't break it for the love of God and just make it run at scale and my position is that that's a that's a huge waste of resources because that is something that is really well understood and you could push the limits there and you can bring that very well understood process and turn it and reframe it as an r&d process so that will generate things like improved optimizations of of the code improved performance of the code when was the last time that that code got refactored for for the 13th instantiation of the architecture that you have no control over because if you're lucky you have a dedicated team of data engineers fiddling with that and on the other direction you also have legacy projects that you could push into production and you can improve the delivery into production you can introduce things like continuous deployment because obviously the first time this was done it was a version 0.8 and it works it makes money and it's going to go into version 1.0 in two fiscal years no you can actually have all of that that embedded knowledge and turn that into novel approaches for using the legacy understanding likewise the same thing happens on the top left where you're usually doing things in r&d that they're going to end up becoming intellectual property that that the company is going to hoard but what you can do is you can push it towards intellectual property that is not core to the business but that is valuable to the community and that's where you start to open source so there are there are definitely places where combinations between you know the Darth side of the force me and and the Jedi can work together so that's what doing data science in industry is i have all sorts of other things that i am very excited about none of which i can talk to you about because we haven't launched it yet and it hasn't been vetted by PR it's the name of the game i signed up for it okay so how do you build effective data science teams that that are capable of delivering on that promise who's playing the bingo the data science bingo anyone you can mark drew conways then diagram on your on your boards there you go so the the data scientist archetype is this is archetype of this magical beast that can just do all of the hacking and knows all of the math and has studied statistics with rigor and has substantial domain expertise from having spent 65 years trudging through the jungles of soybean and trade into aggression that's what it turns into did it flicker again yeah okay so and and my point there is that this is bull i mean no offense to drew drew drew drew drew that venn diagram for a very specific reason which i've stolen to bastard eyes so this is not on him but it sets up my joke really nicely which is that unicorns have a very specific intuition which it is not accurate because it is not the cumulative experience of the rest of their so when you drop a unicorn into a room of data normies by what i'm a data normie so i mean that with the utmost respect it just simply doesn't work very well because what what ends up happening is you unicorns don't exist they don't you have llamas which we might are you know not to be able to pet soon enough in less than 24 hours but you have zebras and you have horses and you have giraffes and you have gazelles and you have all sorts of different beasts that offer different perspectives that bring different gifts to the team that look at things in ways that you'd never imagine which brings me to yet another piece that of of insight that's really key who here has heard of super chickens oh wow so xkcd has this one magical little drawing that he does where he talks about the joy of introducing a room of every smart people to a concept that they haven't seen yet just by virtue of serendipity so get ready and i'm gonna get to do it this is great so super chickens there was an an evolutionary biologist i think at purdue murr william murr is his name and he studied chickens like you do and he was interested in productivity to see if there's something that he can do about the combination of different personalities different genomes phenotypes interacting together and his measure as you might imagine is the amount of eggs that that they would uh yield is that is that the great verb for okay i did soybean stuff i did stuff with plants i never worked with actual feathered soybeans anyway the chickens live live in a group and um if you put a lot of high egg laying chickens in a group of moderate and low laying chickens they sort of the average goes slightly up and everybody's fine and no big deal but if you select the best laying chicken out of lots of populations and you create a population that is artificially inflated with only super chickens they kill each other in a matter of weeks it's true research so it takes six generations of of this culling and and all super chickens are dead so my takeaway from this is that the 10x engineer is real there there exists such a thing as a single super chicken that can elevate uh the the the output and the morale and the integrity and the the personality and the joy of a team but if you put too many super chickens in there then then you create these pernicious uh relationships between them that actually detract more than they add to what it is that you're trying to achieve i'll give you another example uh i didn't want to be cruel but that's Charles of Spain Charles of Spain had no voice in how he came out right like he was born and and just like all of us he just fell into whatever his vessel was going to be i'm not going to complain i i know like i i'm white and i pass and i have all sorts of of opportunities that are granted to me because of the way that i'm the whitest brazilian you will meet um and yes i am brazilian born and raised so what happened here is that um the same thing that happens genetically and phenotypically to to royal families that do a lot of of inbreeding for wanting to maintain purity they're doing the same thing that they did the super chickens essentially and so what ends up happening is that you end up if you do that in your team so not chickens not inbreeding um but in your team if you only bring in people from the same uh universities from the same programs the same disciplines the same domains the same age groups the same demographics the same original geographies who speak the same computer languages who speak the same natural languages who have never gone through what it means to have to transliterate what's happening on screen to their mom right so when when something crazy goes on the news and my mom is asking me what's going on I don't tell her exactly word for word what's on the tv because they're going to be using idioms and they're going to be using uh jargon and shorthand that doesn't map to what her mental map of a vernacular is and I have to do those ETLs on the fly um that's something that you needed to have at your disposal in a team-wide environment so that every single player is bringing something or not necessarily I mean it's a traveling salesman problem it is not trivial to compute if you actually want to go out and see how many of each kind and at what level you start having diminishing returns I'll leave that as an exercise to the reader I've not gone through that but personally my my heuristics are of the last four teams that I have built from the ground up to deliver on data science promises that companies make to the to the to the street this is this is by far the method that has worked best for me and I think one of the things that you want to avoid obviously you don't want to cause intellectual inbreeding that's usually not something that anybody sets out to do but what does end up happening is you have these meritocracies which aren't necessarily meritocracies which that's a whole another semantic discussion of whether or not that's a curse mccurse word that I would use but I'm not because holy ground I don't believe in meritocracies I think they they are a ridiculous exercise in self-pretention but I'll talk over drinks with you if you want to discuss it but the the survivorship bias is I think something that we really don't don't talk a lot about much because what you're trying to do as you build a really solid kick-ass kick butt team I'm from Boston this is so hard is that you know there's this this cartoon by David Whitaker which I find just incredibly insightful which is that what we think we know we think of it as just this fraction of all that is knowable and then we just join everybody else as knowable as if it's this matrix like goo soup that you can just plug in and like oh I can fight kung fu now that's not how it works everybody has a finite amount some of the bubbles are a little bit bigger some of the bubbles are a little bit smaller but really what you have is some little little small bubbles here about one topic and another bubble here and another bubble here and when you have lots of people together what you're trying to do is to plug the gaps you're trying to figure out what those edges are and if you have any edges that don't overlap that tells you who's your next person that you need to hire and you personally I don't hire for race I don't hire for gender I don't hire for mathematicians I don't hire for for engineers I hire for what is it that my team meets and then I bring that person in and I interview lots of people who meet those but that goes for me that goes into the job description I want somebody who has experience managing a power plant in Saudi Arabia don't ask me why I needed that I did I found him but you need to really figure out of the things that you're being asked to deliver what are the the gaps what are the lack of overlaps and build that now as you build your teams there's more than one way to skin a cat and why would you want to skin a cat because they're adorable but the joke aside and it's it's CSV conference I needed a cat picture I just did but this is to remind me there's a there's a really interesting research paper from Amy Amundsen on she did it with the there's a a Google research team doing things on just people skills and Amy Amundsen was was heading one of these and she found that the thing that makes teams more competent with the definition of competency being meeting all of the measures as defined by the overlords of Google which it's it's it's a well-established test whether I agree with with the parameters or not but they did find that the one thing that court so size of team didn't correlate with improved outcomes the seniority of the average seniority of the people or the the variance of seniority within teams none of that matter it turns out that you could control for all of that and not have any significant impact in the outcome the one thing that made a difference is psychological safety and low anxiety so what you're doing is you're building a team that is capable of have a feeling psychological safety and feeling low anxiety now I'm a wimp so I need a lot of of tolerance for my level of neurosis but not everybody does that and I have been in teams that don't abide by somebody with my level of neurosis and we don't work well together that's fine and the point is to identify that soon enough to find a team where that's exactly the the the skill or the ability that will plug a hole in your team have you ever heard of those to studio formations from the Romans like in Sparta where they have the rectangle and everybody has their shield and they form like a perfect cube everybody's protected from the incoming arrows and everything what happens is you have one person that just drops their shield and there's a perfect entry point for for things to go wrong so as as you're building your team the whole point is to create one of those where it's almost hermetically sealed and the arrows of your enemies can make it through again mistakes can be made so it took me a long time to get to a point where I am better at building these teams at coaching these teams at telling teammates that that they've grown too far too high and they need to move on to bigger and better pastures that's a really hard conversation to have because that opens up your overlap whole again and you have to go find somebody else that will be a good match for that team so sometimes you do you guys see what's wrong with this picture by the way yes okay good so a couple of mistakes that can be made what are the what are the common blind spots that you have so as you're building blended teams data analysis until you know at some point we will have greater definitions of what it is that we do I think you know it's between eight and 30 years for some kind of of coalescence on the on the nomenclature but in the interim I'll keep talking so a couple of common blind spots that we have software engineers software engineers often this is not one to one I'm not saying everybody on column A it's a column B that's not it but software engineers can have often blind spots in terms of sampling they're like just give me the data and I will just engineer it beautifully and I will plug it and I will do it and they there's this assumption that the time the data that they have received has been looked at cleaning DTL whatever and so if you have somebody who performs that kind of work and works together with that software engineer that blind spot goes away like that it's magical they suddenly feel in a very visceral way that this is my colleague and the thing that I don't understand that this person does is this thing that I take for granted deep learners they have a slight lack of rigorous statistics funnily enough I know that sounds insane totally true because deep learners are coming at it from a computer science Bend which sometimes makes it so that they don't have rigorous analysis and I'm talking about like math analysis and topology and those kinds of concepts that they can use as they articulate what they're doing and so you have a lot of folks saying well these models are not understandable these models are not understandable and that's not true these models are not easily understandable I don't know that we have the technology to know how or the expertise to get there but give it 10 years these models are understandable we just need to we just need to learn how to chunk them in ways just like we've done everything else I mean the government is not understandable the government is ultra complex but you you separate different pieces and you federate and suddenly I think we're gonna have okay I think we're going to have AI sociologists I think we're going to have a discipline of people who go and investigate why is it that AI makes decisions the way it does same way that we have people who study humans and why it is that they make decisions the way they do and the rush it out for their decisions is also below their level of awareness I think that's how I would like to think about AI is their computers they know why they're doing it it's just very complex and very probabilistic and it makes it difficult to understand that's a tangent sorry PhDs I don't have one so not that I'm biased or anything what they lack is shippetness they don't have a shippet bone in their body y'all are special because you you care for truth you care you are doctors of the philosophy of your domain right I get it that's what I want from you but also I want you to realize that for the magic to take place humans at the end of the line have to purchase a physical or digital product that has been launched and marketed and priced it has to leave the bench so if you have somebody who doesn't have that kind of background or expertise like me that's the kind of conversation that you can have and you can have that conversation with intensity because I can feel that pain in a much more visceral way than the person who's not feeling that pain and that's why it's important to have all of those different kinds of players in your team managers I'm going to jump that well statisticians because I'm one we suck at unit tests we're just like but it's intuitively obvious that by the setup of this equation it follows that what are you asking me about I'm asking because I don't have a phd in statistics and I'm like no neither do I et cetera but it's it's something that we really suck at until you work with a computer scientist who tells you the benefits and why and all of the nuances that go into it and how to set it up and here are the six get commands that you need to learn don't worry about it just learn these six and you work together in a way that a true organic team works together versus two professors debating it with swords and then the last one managers hi I'm a manager right now recovering data scientists but now manager which means I'm useless because my my drive is to keep the team running and to make money so that I can reward my team which is really close to p-hacking so I have to constantly worry that I am not asking my team to p-hack the data that they are so excited to show me and then I tell them no we need the answer needs to be this and then they're like no Angela the answer can't be whatever we want it to be the answer is so beware of managers even when we even when we try really hard we're also also susceptible to these blind spots that everybody on the team has and that's the point so what you're doing when you're hiring for the topologists and analytical geometrists in the audience what you're doing is you're doing a multiple objective decision analysis problem where you have a multi-dimensional set of things that you want to optimize and you want to find not everybody who has the best perspectives on everything the most senior who has exposure to the most number of disciplines who speaks the most number of languages but you want to find that distribution that has the compound value that outweighs the rest it's a very complicated computing process I wouldn't advocate actually doing it but once you have a team that you've inherited that is there and you've assessed them and you want them to stay as you're going to augment that team I would say that you should use a framework like this to make sure that you're not creating aggregations of skill at the expense of skills that you're not going to have represented in your teams so how do you interview for these things these are my favorite questions essentially I want to know how self-aware are you that's the key that I want to get at because if you're not self-aware to tell me what you're good at and what you're not good at then I can't fit you in the slots that are good and not good and then I can't hire you so I want to know about your past projects and I want to know what did you learn and I want to know can you tell me about it so I'm testing their communication I'm testing their understanding of the modeling I'm testing their understanding of the whole process that things have to go through but I'm also trying to understand where they're not excellent at not because I'm going to then not hire them but because I'm going to have a much better understanding of what price I'm paying for that balance that I seek so does that make sense so then the last piece after you've built these this team that is well balanced and works well together great now what the end drinks no I wouldn't do that I wouldn't do that to Karthik so I'm going to go back to Gilar's Pavka's slide which I think is really phenomenal because this is the next step this is what we need to do this is the how so you need to apply all of the skills that we know onto the processes that we need to walk through so as you're dealing with data collection you need to worry about that data collection and the biases that can be introduced as you're dealing with cleansing data and transforming data you need to worry about your data engineering and do you have any control over what's happening there how many black black boxes are doing your ETLs for you in terms of modeling and validating there's a lot of ethical concerns that go into that that perhaps having ethicists or sociologists or ethnographers pipe in and give insight as to where things are maybe veering a bit off from from where you would want to in terms of communication you have to worry about security security of your data security of your message security of your customers data and because this is the business side of the world you have to worry about when you deploy what the ROI is and ROI is return on investment I'm sorry jargon but it's important because this is what fuels the ability to keep having more and more fun things to work on so your data teams responsibilities are and this is set in stone and this I printed out and I give to everybody who joins my team we care about we're a data science team and you know the unicorn and everybody's special and it takes a couple of I was going to say days but it's weeks to get everybody off their high horse it does there's nothing to it and then you get them to understand that they need to focus on data quality data collection data engineering security ethics communication and the return on the investment from their shareholders their benefactors whoever they may be and so why why do you have to care about those things why are those cares incumbent upon a data team why aren't those cares incumbent upon the investment management team or the PR team well usually because nobody has cared about it this way until now and I want to give my perspective on it which is it's not that they are nefarious it's not that they haven't cared about this because they're evil and they want the world to burn it's just that they're not aware it's not what they do it's not what they've been trained to do it's not what they've been trained to worry about you have many folks who are and and those are great to have in your team as well but if you only seek folks who have that skill set already developed then you lose this amazing opportunity to say to take somebody on who doesn't understand that yet and and figure out what is it that the team needs to worry about and they'll bring fresh perspectives and they will challenge the status quo processes and they will say does this still this process that we instituted in the dark ages of 2014 does that still apply to our data pipelines in 2017 so it's not that no one cares is that folks are unprepared to care everybody wants to be a care bear they don't know how and have you guys seen the care bears yeah come on so the problem with the care bears there was one episode where they all went and they did their big thing that comes out of their belly and they ruined everything because it wasn't the right thing to do and they destroyed this nice little pond with their magical energy and in the pond we came over magically ponded and you know it was terrible and that's just because they didn't have enough information they didn't know what it is they needed to do and they came in and they said how hard could it be we're just gonna fix everything and how hard can it be are the death knell words and anything because it's hard so when should you start caring now that you know that you have to start caring and it's it's on you and it's not on anybody else's yet unless you work in a company large enough where they've already had data leaks and then they went out and hired people to worry about it but companies they haven't gone through it yet it's incumbent upon you to care and it's incumbent upon you to care now I'm not gonna do Arnold because I'm no good but you can feel them right because when you come in what you expect as a new member you want to have this magical data lake that is shaped just so for all of your needs and all of the formats are gonna be standardized you're gonna have continuous integration already available permissions and access controls are in place so people who shouldn't be seeing information are not seeing information and everything is sometimes even air gapped so that you don't have exchange of data that shouldn't be there you have your IOT ingestion taken care of because half of your IOT information is coming from the North Northern Virginia pipes some of your IOT data is coming from Tokyo and you have to munch all of that together and normalize it and everything is already done there's a process there's a script you press a button everything is magical performance is measured and you have a dedicated teammate who's worried about performance so they know when there's a dip there's automated data quality control and automated anomaly detection algorithms you have APIs exposed to the rest of the business if you're advanced enough you have APIs exposed to the rest of the world as open APIs which then you've already gone through the whole process of aggregating and de-anonymizing it's beautiful right because this is what actually exists it's this sewage pipe full of plumbing and you have like this little lake over here that's a little dark blue you don't know if it's got a little bit of cobalt you have this other little blob over here looks a little green might have too much iron in there and I don't know how safe is it you're gonna have to go in there with your scuba gear and investigates spelunk a little bit is this healthy that I'm gonna have to clean it before I can make any use of it is it worth the effort of cleaning it all that yellow one I don't know if that's sulfur I don't want to go so I don't have the sulfur swimming gear I mean you get the point it's what you end up inheriting has been built for other purposes unless unless you your company started out as a data company and that's very lucky for my role for instance the last three companies that I joined I joined to start a data science program they didn't have one they had all of these other things and I think this picture even though I'm being facetious and I'm saying like job security but no what I think what I think this says is that this is a good company to join giving you tips if you ever want to leave the in the world make some money then come back which is my plan what what happens is they build their their infrastructure their data lakes their processes for the job that they were asked to do and until somebody came in with the mandate of looking at data and worrying about data and treating data as an asset and and a liability potentially um things already look like this and and part of having that data science discipline become something that is not an afterthought but truly part of the strategy of the leadership of the company means recognizing that those guys didn't know where engineer anything right like this to me means the teams that came before deserve rounds of applause because they didn't know where engineer they didn't build more than they were asked to build they built for the specifications of what they have been asked and now part of that is transforming what's what's extant the legacy work into something that you can now automate and and and transform into the pre Willy Wonka chocolate factory work now if you've done this a few times or if you've worked in companies that have forced you to feel this pain you will realize that the heart can get really cold when all you've known is winter and I'm here to tell you that there is something after winter it's really easy it's really easy to care all you have to do is leave breadcrumbs so imagine this scene in accounting I'm gonna I'm gonna play you a picture new person comes into a company and they go into HR and HR goes oh yep you're in accounting go talk to accounting but they don't say where accounting is so new person is walking on the hallway trying to find accounting and they're like do you know where accounting is oh yeah is right there and then you get into accounting and you go what should I do and they're like oh yeah here's your onboarding because we've been doing this for 300 years we know ledgers and credit and debit and lines of things and assets and liabilities these are well understood processes I'm not trivializing them they're not easy but they are known in data science what ends up happening is somebody goes why did we the analog would have been somebody goes to an accountant to the company and says why did we divest that 30 million dollar asset and the answer is oh well yeah I think Janet might know something about that there was a git push somewhere like no the government would not allow that to happen you need to have forms upon forms filed with the SEC of the rationale and why it is that you're going to do these things so we don't have those kinds of processes in data science yet I think if we keep pushing the envelope and having black box algorithms that judge whether or not somebody should be sentenced to life in prison or five minutes in the yard that's that future is going to approach rapidly but the way to do right now before there is a consensus as to what data science is as a discipline from an academic standpoint documentation is how you program a business you leave those breadcrumbs documentation is to business as code is to product it's if I have to go and ask a human being why a business decision was made somebody already dropped the ball because that decision should have been recorded somewhere and there should be a rationale for what the decision was intended to do so that it can be tracked and it can be verified and we can learn from it because not every decision is going to be perfect and the hubris that only the perfect decisions get documented is a little bit silly because if we don't know the rate at which we make bad investments we can't improve of those on those we can't understand better how those get made and so what is it that you should document when I say leave breadcrumbs what breadcrumbs well motivations for projects not just results from projects but what sets you down this path because if the project blows up the idea didn't blow up and you can turn left and try again and maybe find a better result the reasons for you to believe your hypotheses either previous papers pre-prints blog posts tweets whatever it was that sparked the reason for you to pursue something dead ends document the stuff that blew up document stuff so that when you win the lottery and you leave and you don't take your group with you and they're all left behind allow them to be able to follow on your steps and figure out we shouldn't spend the same three months that Bob spent on this because Bob told us that it didn't work I wouldn't prioritize those types of projects as the kinds of things that you show your upline your management upline but have it for yourselves pause projects for the same thing and conjectures for things that you should do next rules of engagement taxonomies things that will help the group so this is more about documenting things that help the group not that help HR or the leadership team or anything like that and so what will you want to know about the data that you're documenting you want to know in what format it is including any transformations any assumptions you want to know how fast it's moving at what volume at what compression you want to know from whom it's coming what's the chain of custody around that data you want to know if it's trustworthy if you have the chain of custody or not you want to have some hypothesis on on the pre and processing pre and post processing that this data has gone through and you want to know where it's from you want to know information about the performance the scale any access controls that are in place and the speed through which it moves through your infrastructure and that's the data side the business side so what does the business really do did you know I'm not joking McDonald's not a food company McDonald's is a real estate company look it up I'm not kidding how does it make money how does it create value for their customers who are the customers who's buying who's paying who are the competitors who's regulating what assumptions are baked in and importantly do you have enough data to validate your answers to these questions or are you just parroting the party line it's very easy to during the onboarding brainwashing that takes place you just take things as gospel I say from the pulpit and you should challenge that don't strike me but you should challenge things that are said from the podium and so this is something that's near and dear to me I think I talked about it a little bit already in that you want to optimize across all the multiple dimensions of skill but I think having junior team members is a great forcing function so junior team members most times don't know where in the multiple different skill dimensions they fit so even if they fit in many many many dimensions closer to the genesis point they're a forcing function for your more senior developers senior data engineers senior product managers because they are not jaded and they can ask questions that all the senior senior people have forgotten to ask and that brings us full circle this is how you build teams that work to solve a problem whether that problem is a civic problem an indie problem or a corporate problem and with that I wanted to thank you all very much for spending a couple of minutes listening to me and I hope it was useful