 The Cube presents On the Ground. Hi, I'm Peter Burris. Welcome to an On the Ground here at Oracle Headquarters with SiliconANGLE Media, The Cube. Today we're talking to JP Dykes, who's the master product manager inside, or one of the master product managers inside Oracle's big data product group. Welcome JP. Thank you, Peter. Well, we're going to talk about how developers get access to this plethora, this miasma, this unbelievable complexity of data that's being made possible by IoT, traditional applications, and other sources. How are developers going to get access to this data? That's a good question, Peter. I still think that one of the key aspects to getting access to that data is SQL. And so that's one of the ways we are driving, right? Try to figure out, can we get the Oracle SQL engine and all the richness of SQL and analytics enabled on all of that data, no matter what the format is, no matter where it lives, how can I enable those SQL analytics on that? And then obviously we've all seen the shift in APIs and in languages, like people don't necessarily always want to speak SQL and write SQL questions or write SQL queries. So how do we then enable things like R? How do we enable Perl? How do we enable Python? All sorts of things like that. How do we do that? And so the thought we had is, can we use SQL as the kind of common metadata interface and the kind of common structure around some of this and enable all these languages on top of that through the database? So that's kind of the baseline of what we're thinking of enabling this to developers and large communities of users. So that's SQL as an access method. Do you also envision that SQL will be a data creation language as we think about how to envision big data coming together from a modeling perspective? So I think from a modeling perspective, the metadata part, we certainly look at it as a creation or definition language. It's probably the better word, right? How do I do structured queries? Because that's what SQL stands for. How do I do that on JSON documents? How do I do that on IoT data, as you said? How do I get that done? And so we certainly want to create a metadata in like a very much a traditional database catalog or if you compare it to a Hive catalog, very much like that. The execution is very different, right? It uses the mechanisms under the cover that no SQL databases have or that Hadoop HTFS offer. And we certainly have no real interest in doing insert into Hadoop because the transaction mechanisms work very, very differently, right? So it's really focused on the metadata areas and how do I expose that? How do I classify and categorize that data in ways people know and have seen for years? So the data manipulation will be handled by native tools but some of the creation, some of the generation, some of the modeling could be handled now inside SQL and there are a lot of SQL folks out there that have pretty good affinity on how to work with data. That's absolutely correct. So that's what it is. Now how does it work? Tell us a little bit about how this big data SQL is going to work in a practical world, okay? So we talked about the modeling already. So the first step is that we extended the Oracle database in the catalog to understand things like Hive objects or HTFS, kind of where does stuff live, right? So we expanded that and so we have, we found a way to classify the metadata first and foremost. The real magic is then into leveraging the Hadoop stack. So you ask a BI question and you wanna join data in Oracle, transactions, finance information with let's say IoT data, which you're gonna reach out to HTFS for. Big data SQL runs on the Hadoop nodes. So it's local processing of that data and it works exactly as HTFS and Hadoop work. In other words, I'm gonna do processing local, I'm gonna ask the name note which blocks am I supposed to read? That all gets run, we generate that query, push it down to the Hadoop nodes. And that's when some of the magic of Big Data SQL kicks in, which is really focused on performance. It's performance, performance, performance. But that's always the problem of federated data. How do I get it to perform across the board? And so what we took is- Or predictably. Predictably, that's an interesting one. Predictable performance, because sometimes it works, sometimes it doesn't. So what we did is we took the exit as storage server software with all the magic as to how do I get performance out of a file system out of IO. And we put that on the Hadoop nodes. And then we pushed the queries all the way down to that software. And it does filtering, it does predicate push down. It leverages features like Parquet and ORC on the HTFS side. And at the end of the day, it kind of takes the IO request, which is what a SQL query gives, feeds it to the Hadoop nodes, runs it locally and then sends it back to the database. And so we filter out a lot of the gunk that we don't need because you said, oh, I only need yesterday's data or whatever the predicates are, right? And so that's how we think we can get an architecture ready that allows the global optimization because we can see the entire ecosystem in its totality, IoT, Oracle, all of it combined. We optimize the queries, push everything down as far as we can, right? Algorithms to data, not data to algorithms. And that's how we're going to run this performant, predictably performant on all of these pieces of data. So we end up with, if I got this right, let me just kind of recap. So we've got this notion that for data creation, data modeling, we can now use SQL, understood by a lot of people, that doesn't preclude us from using native tools, but at least that's one place where we can see how it all comes together. We continue to use local tools for the actual manipulation elements. We are now using SIMD-like structures so that we can push algorithm down to the data. So we're moving a small amount of data to a large amount of data, because the cost down improves the predictability, but at the same time, we've got metadata objects that allow us to anticipate with some degree of predictability how this whole thing's going to run and how it's all going to come together back at the keynote. Got that right? Got that right. All right, so the next question is, what's the impacts of doing it this way? Talk a bit about how, if you can, about how it's helping folks who run data, who build applications, and who actually are trying to get business value out of this whole process. So if we start with the business value, I think the biggest thing we bring to the table is simplicity and standardization, right? If I have to understand how is this object represented in NoSQL, how in HDFS, how did somebody put a JSON file in here? I have to now spend time on literally digging through that and then does it conform? Do I have to modify it? What do I do? So I think the business value comes out of the SQL layer on top of it. It all looks exactly the same. It's well-known, it's well-understood. So it's far quicker to get from, I've got a bunch of data to actually building a BI report, building a dashboard, building KPIs and integrating that data, right? And there's nothing new to this. It's a level of abstraction we put on top of this, whether you use REST API, in this case, we use SQL, because that's the most common analytics language. So that's one part of how it will impact things. The second is, and I think that's where the architecture is completely unique, we keep complete control of the query execution based on the metadata we just talked about. And that enables us to do global optimization, right? And if you think this through a little bit and you go, okay, global optimization sounds really cool, what does that mean? I can now actually start pushing processing, I can move data and it's what we've done in the Exadata platform for years. Data lives on disk, oh, Peter likes to query it very frequently, let's move it up to flash, let's move it up to in memory, let's twist the data around. So all of a sudden we've got control, we understand what gets queried, we understand where data lives, and we can transparently start to optimize exactly for the usage pattern a customer has. And that's always the performance aspect. And it kind of goes to, I think the old saying of, how can I get data as quickly to an end consumer when he really needs it? That's what this does, right? How can I optimize this? If I got thousands of people querying certain elements, moving up in the stack and get the performance and all these queries come back in sub seconds. Regulatory stuff that needs to go to five years of data, let's put it in cheaper areas and let's optimize that. And so the impact is cheaper and faster at the end of the day. And all because there's a singular entity almost that governs the entire data, that governs the queries, that governs the usage patterns. I think that's what we uniquely bring to the table with this architecture. So I want to build on the notion of governance because you actually, one of the interesting things you said was the idea that if it's all under a common set of interfaces, then you have greater visibility and where the data is, who owns it, et cetera. If you do this right, one of the biggest challenges that businesses are having is the global sense of how you govern your data. If you do this right, are you that much closer to having a competent overall data governance? I think we were able to put a big step forward on it. And it sounds very simple, but we now have a central catalog that actually understands what your data is and where it lives in kind of like a well-known way. And again, it sounds very simple, but if you look at silos, that's the biggest problem. You have multiple silos, multiple things are in there, nobody knows really what's in there. And so here we start to publish this into like a common infrastructure layer. We have all the technical metadata, we track who queries what, who does all of those things. So that's a tremendous help in governance. The other side, of course, because we still use native tools, so let's say manipulate some of the data or augment or add new data, we now are going to tie in a lot of the metadata that comes from, let's say, the Hadoop ecosystem, again, into this catalog. And while we're probably not there yet just today on the end-to-end governance, everything's kind of out of the box, here we go. And probably never will be. We probably never will, you're right, right? And I think we set a major step forward by just consolidating it and exposing people to all the data they have. And then you can run all the other tools, like crawl my data and checkbox anything that says SSN or looks like a social security number, right? All of those tools are still relevant. We're just having the consolidated view dramatically improves governance. So I'm going to throw your curve ball. Not all the data I want to use is inside my business or is being generated by sensors that I control. How does Big Data SQL and related technologies play a role in the actual contracting for additional data sources and sustaining those relationships that are very, very fundamental? How data is shared across organizations? Do you see this information being bought in under this umbrella? Do you see Oracle facilitating those types of relationships? Introducing standards of data sharing across partnerships becomes even easier? I'm not convinced that Big Data SQL as a technology is going to solve some of the problems we see there. I'm absolutely convinced that Oracle is going to work towards that. You see it in some of the acquisitions we've done. You see it in the efforts of making data as a service available to people. And to some extent, I think Big Data SQL will be a foundation layer to make a BI queries run smoother across more and more and more pillars of data. If we can integrate Database, Hadoop, and NoSQL, there's nothing that says, oh, and by the way, storage cloud. And if you have relatively common physical governance, then if I have the same physical governance and you have the same physical governance, now it's easier for us to show how we can introduce governance across our instances. Absolutely. And today we focus a lot on HDFS or Hadoop as the next data pillar, right? Storage cloud, ground to cloud, all of those are completely on the roadmap for Big Data SQL to catch up all of that. And so if you have data as a service, let's declare that cloud for a second, and I have data in my database in my Hadoop cluster, again, it all now becomes part of the same ecosystem of data and it all looks the same to me from a BI query perspective, from an analytics perspective. And then how do I now get data sharing standards set up and all of that? Part of that is driving a lot of it into cloud and making it all as a service, because again, you put a level of abstraction on top of it, that makes it easier to consume, understand where it came from, and capture the metadata. So JP, one last question. Oracle Open World's on the horizon, what are you looking for, what should customers be looking for as it pertains to this Big Data SQL and related technologies? I think specifically from a Big Data SQL perspective is we're going to drive the possible adoption scope much, much further. Today we work with HDFS and we work with Oracle Database. We're going to announce certain things like Exadata to Commodity Hadoop will be supported, we'll done super cluster support, we're going to dramatically expand kind of the footprint Big Data SQL will run on. People who come for like Big Data SQL or analytic sessions, you'll see a lot of the roadmap looking far more forward. I already mentioned some things like ground to cloud, how can I run Big Data SQL when my Exadata is on-premise and then the rest of my HDFS data is in the cloud? But then we're going to be talking about how are we going to do that and what do we think the evolution of Big Data SQL is going to be? I think that's going to be a very fun session to go to. JP Dykes, a master product manager inside the Oracle Big Data product group. Thank you very much for joining us here on the ground at Oracle headquarters. This is theCUBE.