 Live from Cambridge, Massachusetts, extracting the signal from the noise, it's the CUBE, covering the MIT Chief Data Officer and Information Quality Symposium. Now your hosts, Dave Vellante and Paul Gillan. Hi everybody, welcome to MIT IQ. We are in Cambridge, Mass. This is the CUBE, SiliconANGLE Wikibon's continuous coverage of MIT IQ which is MIT's Chief Data Officer Conference. Nikola Havanovich is here, he's the Director of Data Science at Intel. Nikola, thanks for coming to the CUBE, good to see you. Absolutely, thank you for inviting me and having me. So your role at Intel, relatively new terminology thrown around. Data science, it could be argued, has been around for a long time, but you're increasingly seeing that title take shape, certainly over the last three, four, five years. How did it come about at Intel? So I think, generally speaking, the role of a data scientist, it was about to happen. In terms of Intel, a lot of people think that Intel is just a hardware company, but it's not so. We've been in software for a number of years and anything from making investments and developments of open source Linux at the beginning of 2000 and then around 2005, 2006 when the virtual environment, virtualizing environment came about and Intel made a significant investment in the VMware and their line of hypervisor products. And then around 2009, we saw a move to the cloud and if you know, last year we made a significant investment in the cloud era and Hadoop. And we believe that Hadoop is going to, it's a major inflection point and it's going to dwarf all the previous inflection points because we believe that Hadoop's ability to ingest these heterogeneous types of data sources, structured and structured, semi-structured, and it's increased ability to analyze effectively these data sources will continue changing business and how it's done. So what does a data scientist do at Intel? You analyze data obviously, but talk more about how you apply that. So our organization, we are a customer facing organization, right? We are part of the data center group and we work in a number of verticals, right? So we have a number of different, our team works in a number of different, for different organizations and some of the, we do a lot of interesting work. Everything from the healthcare and life sciences arena, FSI, manufacturing, you name it, right? Some of the interesting use cases and as a matter of fact, I'll be giving a talk later this afternoon, one of them is actually from healthcare and life sciences arena and it's about predicting readmissions for the healthcare providers, right? And a lot of the hospitals are interested in doing that because, for example, if the patient gets treated for a specific condition, right, let's say pneumonia, and then 30 days later gets released from the hospital and 30 days later gets back to the same hospital for the same condition, obviously something went wrong and the condition wasn't treated properly the first time, it's called the readmissions, right? And if that happens, the hospital doesn't get reimbursed for the second, neither for the first nor second visit, which is a really, has significant financial implications, right? So a lot of the healthcare providers really looking for the new ways of improving their quality of service, providing the right type of service to the patients in order to reduce these type of conditions. So we have developed a set of advanced analytics, analyzing their different data sources internal, as well as we brought in a number of external data sources to enrich the data sources that they had internally, to build an analytical model to help them predict and prevent the readmissions. When you think of companies that are traditionally in the short list of players, the major players in the Hadoop market, Intel is usually not among them, your investment in Cloudera aside, why should Intel be on that list or how should people think of Intel when it comes to Hadoop in this inflection point you're talking about? So I think that Intel is constantly earning its right to be a trusted advisor in the market. So when we deal with our customers, we often build a relationship in order to establish ourselves as a trusted advisor. And additionally, we are trying to be market leaders in developing new types of technologies and driving the market towards these, letting the market take advantage of these new technologies that arise. So I think that's, I hope that answers your question, right? Well, there's certain technologies you can point to, open source technologies perhaps where Intel has been a major part of that development effort? Yes, absolutely. So during our investment into the Cloudera, we have developed a number of security-based features for Hadoop that were now integrated into the roadmap of the Cloudera. And one of them is called Project Rhino, which enables a cell-level security access controls for HBase, which is a database for Hadoop. So this is something that was developed by Intel. And by the way, Intel, everything that we do, we put into the open source. We are 100% open source, right? So Project Rhino is an open source product, and anybody can take advantage of it. And Cloudera has integrated that into their roadmap. In addition to that, we do a lot of things on the hardware level to make sure that the processes that underpin Hadoop performance run best on Intel architecture. So you think about, you mentioned cell-level security in each base. I can't help but think of a cumulo. I don't know if you're familiar with it. Yes, I am. So help us understand sort of how those fit together. You got multiple open source projects. Is it just sort of put it out there? Can a cumulo utilize some of the Rhino technology? Can HBase utilize some of the cumulo? How does that all sort of fit together? So a cumulo and squirrel company. Local company here, right? They, that's their own separate product, right? That they are going after and they are developing and supporting. HBase, it's part of the standard distribution of Cloudera. If you look at the Hortonworks, they use HBase. They don't use a cumulo. You can use a cumulo on top of those distributions if you choose to. But they are not, it's usually not the part of standard distribution, right? And the same thing for Cloudera. They have HBase as their default standard database, kind of key value store database and they support. But if you decide to run a cumulo on top of it, nobody will prevent you. The capabilities of provided by Rhino are similar in terms of cell level security as those that are provided by a cumulo. But they are for HBase and HBase being a standard portion of the distribution of Hadoop from these major providers of Hadoop as that's the value. So how does this all roll up to a business for Intel? So I don't think of software. I mean, is software just a big part of your business that is not commonly talked about? Or does this come back somehow to the hardware that the company designs to support these kinds of technologies? Very good question. So at the end of the day, it does come. So first of all, it's all about the customer's experience, right? So we do want the customer, enable our customers from a variety of perspectives. From the business case, of course, we are looking for all of these things to run on Intel hardware and Intel architecture. So we're driving these major markets and the adoption of these major markets of big data market is multi-billion dollar market and it continues growing. So by enabling customers and the adoption of these technologies, our hope is, of course, that it will run on Intel architecture. So without asking you to pre-announce products, can you speak to how these might manifest themselves in Intel chips in the future? You're talking about the hardware? Well, things like optimizing Hadoop or like secure Hadoop or securing Hadoop. Okay, thank you. So yes, there are a number of things that we are doing. One of them is specialized. So essentially what we're trying to do in these cases, we're trying to push all these very intensive jobs that are typical of Hadoop. The IO, the encryption, decryption and other things, heavy analytical cycles, down to the silicon. To give you an example, there is an instruction that is component of the standard part of the chip right now that Intel has, which is set instructions that are called ESNI, that allows you to do the encryption much, much faster, right? To do it on the silicon level instead of the application level. And to give you some numbers for comparison, if you were to do this in the application layer, you could suffer, if you were to do the encryption, the encryption of the data as it's coming in, right, and into your environment. You could suffer anywhere from 8 to 12% penalty. If you do this in silicon, it's probably between 1 and 3%, which is a much easier thing to accept as an organization to keep you safe, right? That's not going to hinder your, usually it's not going to hinder your production process. What's Intel's perspective on, or your personal perspective on Spark? Do you know for Batch, a lot of people talk about how Google and others will obviously map reduce inside of Google. It's sort of old news now, they've moved to Spark, even that Spark being old news now in memory type of technologies. What's your thoughts on Spark? It's ability to change sort of that nature of Hadoop from Batch to near real time. So, no, I think obviously it's extremely useful, right? You do eliminate this unnecessary need in many cases to between mapper and reducer to write data back to the disk and then read it back from the disk for the reducer, right? That's an unnecessary bottleneck. Anytime you hit a hard drive, that's a problem, right? So from that perspective, Spark is extremely useful. So this in memory kind of distributed processing is extremely useful. One thing that I can say, and this kind of goes back to the previous question from the hardware perspective, what Intel is also developed is, and developing, is this capability of using solid state drives to complement main memory. So imagine instead of having, you know, 256, you know, gigabytes of RAM, now you have, you know, a terabyte of main memory. That means that your Spark processes can fully run in the main memory. So, and to add to that, of course, the SSDs are never going to be as close. I mean, they're not going to be as fast as the main memory. But they're persistent. However, they're going to come very, very close because they're going to sit in the express PCI slot. So that's, that's, that's. And they're persistent, of course, yeah. Intel got out of the memory business many years ago. Is there, are you saying that you can get back and do it? So now we're coming back. We have our Intel brand of solid state drives and a technology which allows to use these solid state drives to expand your main memory to, you know, to enable some of these. Is this technology you're building yourself or are you, are you outsourcing this to, to other companies? Well, that part, this is something we're building ourselves and obviously probably in collaboration with some other parties. I mean, Intel's, you know, your approach has always been to create standards, promulgate those standards throughout the industry and let the whole thing grow. And then you sell your technology into that. I presume that's similar roadmap or playbook for this initiative, this sort of SSD as memory extension. Right. I think this is a very natural and very useful initiative, right? Because as you want to do more in-memory types of analytics and bring the capability to Hadoop, well, nothing is really truly real-time. Things are pseudo-real-time or near-real-time. But in order to be able to do that, there are certain steps that you need to take, right? So, and allowing people to have these large amounts of main memory to support their spire processes or flink processes, right? That is very crucial in the future. There's a lot of talk about Moore's law in the industry. People say, oh, Moore's law, for years they would say Moore's law, it's coming to an end, Moore's law is dead. But I think the industry generally has underestimated human innovation and ability to do things, whether it's cores or however, you know, the basic concept of, you know, doubling performance every 18 months has been achieved. So what do you say? What's the scuttlebutt inside Intel when somebody, you know, you see an article in Wired Magazine, you know, Moore's law is coming to an end or other prominent people or your competitors now, of course, IBM doesn't sell X86 anymore, so they're happy to sort of, you know, talk about that. What's Intel's take? So we are believers in the Moore's law. There are, of course, so I think there is, as you mentioned, yourself. In terms of never underestimate the innovation and the brilliance of people, I think they will be able to come up with new ways, right? Even if the typical ways of, let's say, physics, you know, if you go from a lower, you know, than a nanometer, you know, into kind of, there are some physical constraints there that you cannot avoid. But given that, there may be some other opportunities that we will be exploring in order to address that and keep doubling the performance and keep growing the power to support the overgrowing demand of doing things faster on more data and... I mean, Intel's marched to the cadence of Moore's law for, you know, many decades and it's driven innovation in the industry. You're increasingly investing in software. Is there a software analog to Moore's law? Well, from the data perspective it sure is, right? The data keeps growing exponentially. Of course, unless there's a World War II, then all bets are off, but until then the data will keep growing. I think one aspect of the data and the way the data is growing, right, we all know that people describe big data from the perspective of three Vs, right? Volume, velocity, variety, maybe some people add some other Vs, but one important V gets often left out. It's the value, right? Because even though the volume of the data keeps growing exponentially from year to year and so on, the value present in that data unfortunately doesn't grow exponentially. So you are facing the situations where you have to compute and go through more and more data much faster, right? So that's, in a way, something that you have to live up to if you want to continuously maintain a competitive edge for the business or your organization. So Hadoop is sort of this filter, bit bucket, filtering mechanism, if you will. Does the spark need Hadoop more or does Hadoop need Spark? Well, that's a tough question. I think they both complement each other really well, right? You need distributed storage and you need distributed compute somehow, right? So you need to have them side by side in order to do things quickly, right? You want to bring your compute to the data, not the data to compute. So you want them to be together, right? So from that perspective, I think that's an equal partnership. But then let's not forget, I've heard people say that Flink is going to do to Spark what Spark did to MapReduce, so who knows? You are a real live data scientist and we're hearing, a lot of companies are hearing that they should hire data scientists right now. On the other hand, there are those who say that data scientists will become obsolete eventually. That software will essentially make, will do the work that you do now for the end user, what's your perspective on that? Okay, so in my vision and also this is part of Intel's vision, you do need the right type of big data platform to analyze these humongous throws of data that arrive, right, and that you have. And in order to do that, you need the right type of platform build, right? We see the four key layers of that platform. The first layer is your infrastructure layer and surprisingly, right? This is your networking, storage, and compute. And this is a very important layer because later on if you're doing analytics, let's say on your data that run in an iterative fashion, and during each iteration, most if not all of your data sifts through the network, you better have a good network backbone to support that. Now the layer on top of that is actually where you store the data, right? That's where you bring the data, integrated, mungid, perform, make it available for the analysis, right? Now, and after that, once data is readily available for the analysis, you need the right set of tools in order to take advantage of that data. But not only that, you need the right type of context, right? So you take these analytics or algorithms, you need to have the right context to apply these analytics to the right types of data and execute it, right? So when it comes, in several years, they don't call us data scientists, but they call us something else. There still will be a need for the person who will know which algorithm to take in this specific context and how to apply it to these types of data to get the value to the organization. So I do believe that data scientist role will continue to exist, even if it gets renamed. Well, whatever they call you, according to Hal Varian, they'll call you sexy, right? So, I'll take that. But how do you see it evolving? I mean, as you move up the stack, essentially, does data scientists become more of a strategist over time and less of a technologist? Or how does that evolve? I mean, it depends on the definition. There are some key things that data scientist needs to possess. So the context of the business that you do your analysis in is obviously very important. But also, you're kind of background knowledge, right? So the machine learning, data mining, graph analytics, some statistics probability, and kind of just open-minded, right? So you don't function. Quite often, you have to be unconventional and function outside of the box, right? So this is not about doing different things, doing things differently, right? So you have to be innovative and always come up with new ideas. So, from that perspective, I think, you know, that's obviously an important quality to have. So that's, I think, these different components when somebody's looking for a data scientist, right? Those are something that you need to know. Now, a lot of organizations fear that the data scientists are not being, let's say, readily available or there's lack of us or something like that. And the truth be told, I think it's the same as with any other profession, right? There are a lot of doctors, but to find a good one is challenging. However, these, so for data scientists, I think it's somewhat similar, right? As for any other profession, the education, you know, the university and the educational institutions, they keep producing and they have been producing people with expertise in machine learning, data mining, graph theory, statistics probability, right? All the applied math type of areas. So that hasn't changed. I think the key part is this context, the knowledge of the domain that you're working on that comes with experience. So that's a great point. So what are some questions that a company who's looking to hire a data scientist should be asking itself before they go out looking for people like you? Well, I think it depends on the company, right? It's not an easy question to answer, right? As I described- Obviously the domain is important. What do you want to do? What do you want to accomplish? But what are some other questions? So if they are technically astute, if they are willing to learn, I would say that I've seen data scientists who don't think right off that they have the answer to the problem and have this kind of more open, more agile mindset do better than the people who think that, okay, I'm just going to fit this template to every problem, right? So I've seen that work better. But other than that, I think there are a lot of kind of standard procedures that you still follow. Does this individual fit into the kind of the culture of the organization and the, and other standard things. But from the technological, from the kind of technical perspective, I think it's machine learning, data mining, graph theory, those type of areas are key to have. Just quickly, I mean, your PhD is in machine learning and this is something that I think is not well understood by many people what machine learning is. Can you define it? What do we need to know about machine learning? Okay, so I think there are a couple of portions, right? So you have your data mining and machine learning and simply speaking, a lot of, I guess the bulk of the work that's been done in these two areas is around either analyzing and exploring the data and that's more in the side of the data mining, right? All your kind of exploratory functions and algorithms to help you understand the data and things like that, right? From machine learning, you try to identify specific patterns or specific things you try to learn certain things that are present in the data in order to be able to predict in the future if something is going to work out one way or not, right? So I think it's all about learning patterns and understanding patterns. All right, so we have to leave it there we're up against the break. Mikola, thanks very much for coming to theCUBE. My last question though is your role here at the MIT CDO IQ, what's your take? You're given a talk this afternoon. You know, what's your impression of what's going on here? So far it's a great event. This is my second time attending this event. So I'm really happy to be here. It seems really enjoyed the keynote speaker this morning. Very interesting body of work that he presented. So I was really impressed by that. And overall, I think we have really great audience, great attendees, and I look forward to the rest of this event. That's great. Well, thanks to you for coming on theCUBE. Really appreciate your time and your insights. All right, keep right there. We'll be back with our next guest right after this. This is theCUBE. We're live from MIT IQ, MIT CDO IQ in Cambridge. Right back.