 with Mike at one of his peer programming sessions, and I was like, okay, is there anything I can do for you? Because peer programming was quite fun, I've never done it before. And he was like, okay, why don't you try this? Now I got this job, I was like, okay, yeah, let's go ahead. Okay, so, yep, I'm here to talk about how to get a job as a junior data scientist. These jobs are actually quite rare, and they have gotten slightly rarer, so you have to bear with me, but basically it's still possible as long as you don't constrain yourself to just the title data scientist. And there'll be lots and lots of anecdotes ahead because this basically reflects my own experiences. Yep, so a little bit about me, I am at the Singapore government, I just started, actually, as a junior data scientist. Yep, and before I go any further, you saw the disclaimer just now, but yes, this is very important. These are my own views, not anyone else's views, not anyone, especially not my employers. Yep, okay, so. Yeah, okay, so today this is roughly what I'm going to cover. I'm going to talk about what companies want from data scientists and some differences versus the more common data analyst JD. What a data scientist actually does. Some resources and portfolio examples, if the second point doesn't actually scare you enough, and then pathways to that first data science role and questions you should ask. The reason, the other thing is that, I'm not sure how many of you are actually familiar with what a data scientist actually does. Can I have a show of hands first? Yeah, so okay, so basically not the majority. If you get bored, you can start digging your nose or put your hand on your head or something, so I know to speed up. Yeah, I'm trying to make it entertaining for everyone. Okay, so before I do this talk, I thought I would do a bit of homework. So I went to Drop Street and I scraped a bunch of drop descriptions and I put it into a word cloud. The word cloud is pretty, but I didn't use code to do it, I just checked it on wordclouds.com. Yeah, so yes, because it's prettier than the Python one, I'm sorry. Yeah, so yeah, so unfortunately, there were only 85 positions open on Drop Street when I scraped it at midnight two nights ago. So I took about 40 and I did some very minimal pre-processing and clearly what jumps out at you is like, you know, people seem to want machine learning experience, models, business. In one corner, I think you can see Python, yay. And you can also see SQL if you look hard enough. And I think that's MATLAB somewhere. I don't know, yeah, okay. So data scientists, that's data science. What about data analysts which actually has basically about 10 times as many job descriptions, job descriptions available on Drop Street? Basically you can see, yeah, things like business experience and analysis and email. I don't know why. Somewhere in there, there is Excel. You can also see Python in the bottom left, sorry, the bottom right-hand corner and actually quite a bit of SQL, which is great. So yep, companies seem to basically want experience, but if you don't actually have experience and you might be doing something completely different, what can you do, right? You have to get a bit lucky. But there is a line in the song that says you stay up all night to get lucky. So I like that take on it. You basically have to prepare a lot and just keep applying. So yeah, now that you know how competitive it actually is, what does the data scientists actually do, right? Okay, quite briefly. Data scientists frequently, this is just a generalization because there are a lot of different companies and for each company, the role could be a little bit different depending on the business needs. Okay, so what do DSS do? They get the problem statement from their managers or whichever stakeholders, and then the data which can come in many different forms. Yep, members' variety. So then you clean the data to the best of your ability and you try models to predict some outcomes. Whoever is in line with the problem statement with your manager, you update your stakeholders and go back to one. So it's kind of project-driven that way. Also at the very bottom, you've seen that I've basically put down that you have to keep up research for new ideas. Basically, for more new toys to play with, although don't quote me. Yeah, but the field is moving very, very fast, so yep. Okay, so how do data scientists stack out against analysts? You can see on the scientist column you tend to look at models. You tend to be more project-based and the requested skills tend to be. Python sometimes are SQL databases, keeping our research, and analysts, they tend to do more business as usual things like for instance, Grab has a lot of analysts looking at KPIs and monitoring them with dashboards. Or they turn out reports which they can do with SQL or they could do in Tableau, yeah, which basically are just multiple dashboards into a PDF file because this is reality. It's not everybody has a license in real life. Yep, so frequently requested skills are, again, databases, quite often, quite often Excel as a back end. And data visualization tools such as Tableau, click, yeah, a number of others. D3 is not that common. Yeah, okay, in reality, this is taken from February. I think it's possibly gotten worse. So this tweet is from this lady who is a data scientist and she basically put a poll on her Twitter asking how often do you, how much do you spend of your time, 60% of your time like cleaning data and the overwhelming response was basically most people with the title of data scientist spend over 60% of their time cleaning data. Yep, and only 23% spent 60% of their time analyzing and predicting data and there's still some work to do in production so Daniel, we'll look for you at some point. Yeah, so this is also reality. This was from my previous job. We were trying to figure out how come two versions of the same thing didn't give us the same results. In fact, they gave us completely opposite results. So fortunately, this is open source, it's from my own GitHub, so you can look at it if you want. But basically what was happening was that the original guy who wrote the package said, okay, like we have this and then the guy who wrapped it in R said, oh, you know, when I was writing the wrapper, I thought it would be good to change the random number generator and so the results are completely different. Yeah, that will become your life sometimes. It depends. So we couldn't figure it out, that's why we asked. But I would have been able to figure it out eventually if I had looked through the source code, maybe. Except one was in C++, that would have been fun. Yeah, okay, so also reality, understanding this paper. The second page is actually what you would have to put into code. Basically, this is how you would want to formulate some features. So you have to foster your features to look something like that. Yeah, okay, so this is also reality. This is at the student level, a new paper from Stanford CS224 class. It's about the model, but which just came out in October 2018 and already has been integrated into Google search. So, yeah, you can see it's quite competitive. The students are doing, it looks like a fairly good job to me, but I'm not an expert, so yeah. So basically, you end up constantly studying, yeah. So after seeing that, how many of you are still interested? You'll be spending most of your time doing data cleaning and crying tears of blood at like, you know, this, right? Because besides data cleaning, you still have to understand this and sometimes put it into production and do some differentiation so you can change the loss function which actually happened. Yeah, so how? Still interested? Oh, do you want me to go home? Okay, so if you are still interested, actually, this is the cool stuff. I really like this. I don't know how many of you have seen this. Oh, shoots, sorry. Yes, so this is a carpark space detector built with mass-car CNN as an object detector and the tutorial is actually here. You just have to search for carpark detector. I think you should be able to find it. Lots of copycats, yep. And in under 500 lines of code, but the trick is how to collect the data and label it and, you know, train it such that you produce results. That is the hard part, right? Oh, sorry. Yeah, okay, then, or you could try doing things like challenges. This is a challenge, a long time, well actually not so long time ago by Grab. This is computer vision, this I think graphs, yeah. And then this one, I'm not too sure I didn't really look at it, but yeah. I thought this was quite interesting because these are real, common real world problems, yep. Okay, so some tools that you might want to consider picking up, oops. So Python plus your library of choice, because Python has the most things. It's like a buffet, you just decide on what your favorite things are. The basic stack is usually pandas, numpy, sklearn, and keras, and pytorch, which covers machine learning, deep learning, and data cleaning. Data cleaning is actually very, very important. Possibly more than anything else. Also SQL, but then you have to figure out how to run a database, which is okay, Ken. Yeah, so within deep learning itself, or machine learning, you have to pick a specialization, because you saw the Grab thing, right? It's actually three different things. No one, frankly, has the time to cover everything. So you pick one and you stick with it, and possibly you do a lot of kegels, so you figure out what people are doing and what the latest research is, because there's a mad rush to identify state-of-the-art methods, then you can just go, hey, I win because my model is bigger than yours, which is a bit of a problem, but sometimes it works. So then one of the other things that you can do, which I think not a lot of people are doing, end-to-end machine learning and deep learning projects, one of them is the Car Park Detector thing, but try not to copy from it, because it's very obvious. Yeah, then obviously after you've done all of that, you want to use GitHub to show that you can use version control. You might also want to consider about open source contributions, which is something I'm personally working on. If you can, you try and win a challenge or two, then all of that together gives you enough experience to stand a good chance of, when you get to the interview, you have something to talk about for at least like 10 minutes. Yeah, so. Yeah, okay, so anyway, resources I like, which I personally found useful, like I wish someone had told me this when I first started, because I had to Google, spent a lot of time on Reddit, figuring out what to do. So Twitter is something that a lot of people actually use. I think OpenAI is active on Twitter. OpenAI is also active on Reddit. This is Kyming Her, responding to some criticism of Kyming Her on Reddit. Like I think this was in April, I can't tell. Yeah, okay, so then GitHub, also you can see this. Papers with code. Oh, the two guys just now are responding to me. One works at Amazon. I think the other one works at Criteo, which is basically ad tech company. Yep, okay. So I'm not getting paid to say any of this, but I like Pyimage Search and the guy who did the car park detector. And I also like Fast AI. Yeah. So if you are still interested, this is the slightly more controversial part, pathways to that first data science role. It might not be that straightforward. Okay, so this is, I went to school. I'm actually still in school. It's been about two and a half, coming the three years. I only graduated like probably, hopefully end of the year, maybe not next year. Yeah, so the unintended consequence of this, going to school for a career change thing is networking. I know everyone has a different starting point and not everybody might be able to go to school like I did. So some of you are going to boot camps and some of you are trying to self teach. Personally, I felt that going to school was actually worth it because of the networking, but you might beg to differ. The other thing was that going to school allowed me the time to do all these projects. These are some, but and under quite a lot of time pressure because each semester is like 13 weeks, you basically only have about a month and a half to deliver because you're learning about the latest developments in the, basically that domain, like NLP or deep learning during, sorry, like computer vision, during the other seven weeks. Yep. So, yep, this is, I might be biased because this is my personal standpoint. The other thing is I want to address this which emerged a little bit when I was talking to some of you is that could the field be saturated? I personally found it very hard to get a junior data scientist job. It was difficult for me. It was a lot easier for me to settle for something in business intelligence. Yeah, so, or, you know, not doing really data science but data engineering or data cleaning computer vision which is really a big unaddressed problem apparently. But this guy is pretty qualified and he just posted today. Okay, I'm going to summarize his wall of text. He's basically a machine learning engineer who has a PhD in unrelated things, basically physics. And he's like, okay, so I have a few papers and these buzzwords. In the first paragraph, he has a bunch of buzzwords. I don't know if you can see them. He's like, oh, I've created again and you know, I have a text. Yeah, basically like again, that was better than the best thing before GP2 which is a very powerful text generator. And he's got a bunch of things but he's like, I can't get a data scientist job. What do, right? So it's incredibly, incredibly competitive. Yeah, and then the other problem is that this is actually a graph taken from someone's GitHub repo. He scraped the job positions of before and after you do a masters. Basically the dotted line is anyone below the dotted line took a pay cut after. You can see quite a lot of people did that especially the people earning more than $5,500 a month which is where the two lines intersect, right? So after you are aware of all of this, if you still want to do it, I mean, these are some potential starting points. Depends on what you like. Personally, I like coding. So I tried the research, I was actually a research engineer. Okay, yeah, basically research engineer implements algorithms for the research scientist or basically data scientist. Okay, if we like visualization and are good at it, data analysts and BI specialist roles are a lot more common. Basically 10 times as common as your normal data scientist role. If you have good backend fundamentals, you might want to consider machine learning engineer or big data engineer, yep. So these are, yeah, and I don't think I have that much time to talk about this, but we can go back to this later. Basically every role has its problems, like the data architect might have unpredictable monthly bills which we saw just now, yeah. And the data analysts will definitely have problems with using Excel as a backend because that's quite common in many, many places. Yeah, Excel, let's go. Okay, so how, I mean, like, you know, after you've picked your specialization, how do you find the right one? I mean, job. This is actually a slide from nine years ago by SoftBank. The link is given below. I made it bigger so you all will believe me. You can go to the link later and find it. Okay, this is their 30-year vision. I shit you not, it happened. Sorry. Okay, so you know, the whole point is that maybe if you looked, did a bit of due diligence on SoftBank, maybe recent developments regarding SoftBank might not be so surprising because they had some companies which I think reflected this sort of ethos, yep. So that aside, I hope I made the point about doing your own due diligence. Make sure you apply and prepare for rejections. And when you are doing the whole process, make sure you understand your strengths and weaknesses. This is very, very generic, right? But yeah, like, I like to ask personally factual questions about the job. Like, what do you use as a backend? If I had asked this question, I would have not made the mistake of accepting a role then. Or interview, spending so much time on a company that eventually just use Excel. Yeah, or like, is there version control? Do they know what that is? Or what two things do you spend the most time doing if they say me things, what are you gonna do, right? Yep, okay, so personally a green flag for me is a technical interviewer because I like being technical. Other people might not like it, so this is my thing. Yep, okay, so in summary, I don't know if this has been 20 minutes. Yep, get your hands dirty, know what the code does. Please don't just say model selected for best accuracy because candidates have been rejected this way. Not by me, but I get a lot of feedback about this. I'm like, I'm sorry, but you know. Yeah, okay, then the second point is that there are many paths into data. The field is constantly evolving. And we can go back to that slide later to look at roughly what the different paths are. And then when you are doing the interview process, it's very common sense, but yeah, do your own due diligence, ask a lot of fact-checking questions, otherwise like this, okay, like this. Yeah, okay, questions? Yeah, if not, then yeah, I think that's it. Thank you. No worries.