 Live from the Fairmont Hotel in San Jose, California, it's theCUBE at Big Data SV 2015. Okay, welcome back. And when we are live in Silicon Valley for theCUBE, this is our flagship program. We go out to the events and extract the silicone noise. I'm John Furrier with my co-host, Jeff Kelly. We keep on at our chief analyst at Big Data. Just the end of the day, Jeff, you ready for the second wind here? Come on. We're at the end of the day. We're at the midway. Okay, Jeeka Chong, the head of data science was simply hired here with us. Welcome to theCUBE. Thank you. And gradually, Karni Mellon, you're involved. You got a PhD. You're a doctor. You take doctor. You take doctor Jeeka. Sure. Okay, let's see, two doctors on a one day, two PhDs. A lot of PhDs in the big data world, obviously. A lot of computer science, math, a lot of stuff going on. So, tell us a little bit about what you're doing on. I want to get your perspective on machine learning. Is it hype? We heard two counterpoints today. It's just, you know, hyped up, hyped up and something, and it's the next big thing. Tell us what's going on. Yeah, so I joined Simply Hired about two and a half years ago to head their data science practice. And Simply Hired, just a short intro, is a search engine, a job search engine. We serve about 34 million people every month. And the position there really started with the focus that we really need to help job seekers in that really dark time, right after the financial crisis. In 2008, after 2008, 2010 timeframe, if you look at the statistics, we're really in that huge valley in terms of the unemployment rate, or employment rate. And that's the position I'm at. So what about the practical matter? Because Simply Hired's been around, we've been following that platform for a while. Unstructured data and Hadoop, you guys were around on the front end of that as a company. So the LAMP stack, obviously you probably have a lot of MongoDB possibly in there and other stuff, LAMP stack, normal databases. But what has been the evolution of big data? Because now with unstructured data, the database model's changed. Have you guys transformed that? Can you give us some insight? What has that enabled for value? Yes, so part of the value that we provide is to turn a lot of unstructured data online, like the job postings that we aggregate from online sources, into a structured form such that people can search for it. And in that process, we aggregate about 10 million job postings every day. There is that process of turning all that data into a structured form in Lucene index. We're using solar now. And in that process, we have a workflow that converts all that into a database for our search index. Now, some of that workflow is dependent on how well we can extract the structure out of the unstructuredness of random blurbs of text that people post online. And with the analytic side of it was over 100 million job seekers event that we monitor and generate on our website every day. We dump all that onto Amazon S3 as the logs come in and then analyze that continuously with and dump those into Redshift. So we found Amazon Redshift was amazing process, but we didn't get there first. We were using Hive queries first, which was effective, but wasn't the kind of speed that we were able to have the data science team be productive. Did you also find the admin side was difficult and challenging? I mean, because HFACE had to hide, these are all awesome tools. Yes. Performance, extra bulk as needed on the hardware side, provisioning and also labor. Did you find that to be an issue? Yes. Is that what the reason why you went to Amazon? Well, those were issues in terms of we have to delicate resources in order to handle that. The part, well, we have been using all those tools on Amazon clusters all along. But what really made the difference was two key decisions. One is democratizing who can have access to the data, having something that a typical SQL, someone familiar with SQL queries can use is one thing. And another side of it is how fast they can have access to it. Before we adopted Redshift at the beginning of 2013, well, we had to write high queries and a large amount of time of the data scientists were in the ETL space, state or situation, rather than the analytics situation or the learning situation. So that was a huge help. But once we had Redshift, we were one of the first adopters of Amazon Redshift as soon as it came out of beta, we were able to get a whole team of interns on it and the interns loved it. And like two of those interns actually became full-time employees in our data science team after that. So on the machine learning question, so bad, competitive advantage, native, born in the cloud, born in the app world, I mean, what is that all fitting in? So for the machine learning side, there are multiple aspects of a machine learning that's being done that's simply hired. There is the natural language processing side where we're bringing structure out of unstructured data. A typical challenging problem is the government, I mean, the job posting space is highly complex. The government classifies course classification, all the jobs into 1,110 categories. And we actually classify it into even more refined categories underneath. In order to interpret, for example, estimate what would be a salary for a job that we would be able to use to filter and assess and make a job search results more relevant. And then, so that's the natural language processing and unstructured to structured side. There's also this other side about the business, about user interactions, about personalization. And those, because we're generating the event, we're generating the signals that they are structured to start with. So it's a little bit more easier to analyze from the structured unstructured side, but a little bit more harder to look at how to bring out the real knowledge from it, because there is a lot more interpretation that's involved. And right now, we have several efforts in how to bring those to produce return on investment. So for example, increasing relevance based on our previous job seekers behavior on our website produced a project that gave us 6x return on investment and over a million dollar in additional revenue just was one project in the long. Are you guys doing any external signal extraction? Obviously search engines, other sites, and what's the strategy? How do you look at that? So you have your working data of your existing profiles. Absolutely. Are you guys doing any cross-correlation between that and you can share some ideas that you guys are working on? No, that's a great point. For example, for salary data, for example, it's highly contentious. Well, where do you get the real data about salary? Well, the government- I don't realize what their salary is, but I would tell you too is what I'd say, come on, you know. But that's an important one, right? That's a very important one. So there are statistics, for example, from the government that people do, the government does surveys about like for unemployment benefit purposes. And then there are also other data sources like user generated, like people would post their salaries on some sites like Glassdoor or salary.com. And then there are other ones where the employers are including some amount of salary information in the job posting itself. So with all these sources, how do you, for the same job title, there are varying types of data that we get. And when we were doing some of these benchmarking, we saw that well, self-reported data are usually higher than average. And the government sources are somewhat okay, but somewhat lower than what we usually expect. I'm not saying anything about the intent there. But, and then for other sources, like the self-reported ones from the employer side. Well, that was an interesting piece. What we found was that it was consistently higher than average, and it sort of makes sense because why would you show how much you're paying if you're paying less than average, right? Right, but you take that all into consideration when you are saying, here's what the average salary is for this position or whatever. You know that some of the self-reported salaries are going to be higher, so you take that into consideration and that's part of what data science is all about. It's part of understanding the landscape and always be critical about any data sources that we get. So we were talking briefly before we came on air about building a data science team and trying to find that perfect data scientist that has all the skills you need from statistics to the communication skills to the domain knowledge, et cetera, and you take a little bit of a different approach, probably born out of real-world experience. Talk a little bit about how you approach that at Simply Heart. Yes, so building a data science team is a challenging one, especially if you look at the job board out there in the conference. All those companies like Uber and Google and all those different companies are competing for the same pool of resources in the Silicon Valley location. So at first, you know, we were looking at the data sciences that has a broad range of experience, but those were actually really hard to find in the first place and even when we find them, they may or may not be coming to a particular company because there are so many different companies competing for it. Yeah, or if they come, they may not stay long because they're going to get another offer next week. That's right, yeah. So what we found was that, what was triggered for me was DJ Patel's conversation about the data science of the team sport, that it's really about building a team. So now we have experts in data science, but more leaning towards natural language processing, data science that are highly competent but are really good software engineers and they're scientists that are PhDing physics or material science or other places where they're bringing very distinct set of knowledge to the space. So for example, for our operations field, for optimizing how we can optimize our internal processes with different channels that we're bringing, it's an optimization problem. And that's actually not a very common data science like machine learning kind of background. So, but we still need it. And it's that variety of expertise that we're beginning to be able to form and we're seeing that we're reaching critical math and the speed and the velocity of producing new product and services. So it may cost you more because you've got more bodies, you've got more people, but it increases productivity to such an extent that your ROI essentially on that model is delivered very quickly and then it's onwards and upwards. Absolutely, and one little secret there is all these data scientists, they're smart people. So they may have a certain set of background that they can bring to a project that they can start with first, but given the landscape of the specific company, they can adapt as well. So, and sometimes it's actually a benefit for a company our size simply hired was about 150 people where we're providing the data scientists an opportunity to work with the VP of marketing, VP of sales and looking at real world problems and allow them to grow beyond their original scope of knowledge. They can grow their careers but you also bring up another good point is bringing in data science as a team sport are some of those team members, the business people, people who are not data scientists. Talk about how you broke those relationships inside of Simply Hired. So that's actually one of the things I tell my data scientists even before they become data scientists at Simply Hired is there's something they need to be prepared for. I mean, many of these people are very smart, a top of their class, have been very good at communicating to their peers who are data scientists or who are scientists in their field. But then when they come to the industry, they're beginning to talk to people who have no idea what they're talking about. So what you're speaking about Jeff, it's definitely a challenge but there's a certain characteristic that we're looking for. But you can talk about some of the cool stuff that's going on in and around your world. Talk about the future of things like neural networks and you have to see with social media, internet of things is people too and people are things. Thing one thing to do is a famous book I used to read to my kids but it's not just sensors, it's just probes out there. You're searching all kinds of data, it's active data, there's passive data. Are we in a distributed neural network that's just going to turn into one global tissue of knowledge? What's the deep learning vision? What do you, as the PhD, how do you look at this? Do you like, what do you attack first? Where's the sequence? What's the progression that it'll look like to get to that AI, that value that people talk about? That's a really deep question and we've only got two minutes. Go. I don't like kidding. Well, the challenge here is really, how do we communicate with those non-knowledge? I mean, each one of us have this wetware, like this neural network within ourselves. But the thing is, training between people continues to be a challenge. So how do you pass information from one person to another, one data science to another, the data science team to the rest of the organization and the organization to the rest of the world in terms of what data science product and services that you can expose to the rest of the world? So in terms of, in the fundamentals, the deep neural network training side, part of the advance in the past few years is Jeff Hinton's technique for being able to train the deep neural network without a lot of training data, self-training the deep neural network, which is sort of what humans do, right? We get some feedback, but often we don't know what we're trying to learn when we're two years old. But another area of it is the speed at which the machines are learning. So the GPU, the networked training algorithms, the Google Brain, and all those technologies allows us to learn much faster. And then one step higher, those learnings then need to be adapted to the product and services. Right now, my research team is working on bringing some of the more abstract algorithms from Google Research Labs, like the work of Beck, into how to make those algorithms work for helping people find jobs. And then those, when they become product and services, becomes the channels with which we can communicate to the rest of the world in terms of linking together how different companies and how different industries could work together. Well, essentially, the algorithms are one thing, but it's how you apply them that's really where the value comes in. And the speed of communication between the people who know those knowledge. We've only got time for one more question. I wanted to touch on what you're seeing over the show in terms of what's getting you excited. You mentioned you were, yesterday, you were at the hardcore data science day yesterday. What was kind of, what did you take away from that event? The industry is evolving. Like what many have observed, in the previous few years, people were talking about the technology and that has been building up. But one of the areas that I did my PhD in was how to help domain experts better use highly parallel computing platforms. What I was doing this 10 years ago was looking at high performance computing, but now everything is high performance computing, even like your laptop or your iPad has a GPU that is of the supercomputer capability of yesterday's. So now with those platforms, what is still a challenge is how those platforms can actually help the applications. And what came out from yesterday's hardcore data science track was there is a lot more intelligent people who are experts in the application space, now realizing that if they make their infrastructures more broadly adaptable, in another words, make their applications into application frameworks, then a lot more people can take advantage of the underlying platform. And like I think six or seven talks out of the day's talk, the majority of them, actually we're talking about application frameworks and many of them may not even have mentioned the word application frameworks. And well that's one of the themes we've been talking about over the last couple of days is where all the applications and part of it is, as you said, developing a platform that enables other people to continue to build out applications. And that part of the reason we haven't yet seen a huge kind of proliferation of these kind of applications yet. Doctor, thanks for coming on theCUBE. Thank you. Of course I have to say that it's a PhD. Appreciate it, I mean, Carnegie Mellon. Just quick plug, I'll give you the final word. The most exciting thing that you're jazzed about looking at the ecosystem, certainly the show here is just a context of what's out there. What are you really excited about? I'm actually really excited about the government role in all this. I mean, this morning at the keynote, we saw that DJ Patel, first chief data scientist of the United States of America. And I personally experienced how the government is really looking at how to use data to help. Last year alone, I represented Simply Hired and went to the White House to take part in data jams on how to bring new technologies to help people find jobs. And through that process, many government officials, people in Census Bureau of Labor Statistics were excited about helping job seekers find jobs. And now, as we see the employment rate increases and unemployment drops, I see that the government role, there is a large part of the government role in that particular ramp, and I see the effort there. And it's a pleasure to have been a part of it. Well, thanks for coming on. I really appreciate it. Again, DJ is awesome. I mean, the first day of science, you got also a lot of action going on under the administration. A lot of tech folks in DC now, certainly the XVM, we're guys over there. And you got some Google folks in there now. So CTO is fantastic. So it was Megan Smith and whatnot. So this is theCUBE. We're live in Silicon Valley, extracting the signature noise. It's open. It's a great environment here. Great stuff. Open source is leading the way. Certainly a lot of change in data is helping the government and businesses. We'll be right back after this short break.