 Live from the San Jose Convention Center, extracting the signal from the noise, it's theCUBE, covering Hadoop Summit 2015, brought to you by headline sponsor Hortonworks, and by EMC, Pivotal, IBM, Pentaho, Teradata, Syncsort, and by Atunity. Now your hosts, John Furrier and George Gilbert. Okay, welcome back everyone. We are here live in Silicon Valley at the Hadoop Summit 2015, San Jose Convention Center. This is Silicon Angles theCUBE, our flagship program, where we go out to all the events and extract the signal from the noise and share that with you. I'm John Furrier, the founder of Silicon Angles. I'm John, my co-host, George Gilbert, Wikibon's new big data analyst, look for his reports are out there. A lot of great research coming in. Our next guest, Matthew Hunt, who's the R&D chief at Bloomberg or R&D guru, R&D. R&D is the organization inside of Bloomberg that all of its development falls into. So we have. Welcome back to theCUBE. Thank you, 4,000 engineers. So Bloomberg is a pretty successful media company because they have technology. One of those things where I talked to a guy just the other night, if you're a trader and you're not connected to Bloomberg, speed is of the essence. You guys have obviously shown that for many years, but staying up to date is something that you guys, it's critical for you guys. So what are you guys looking at right now? I mean, again, people are using Bloomberg for sourcing information, certainly news, et cetera, but the key is traders, they want the edge, they want the data fast. Oh, you know, speed is obviously a part of it, but really the way to think of us is as an information and data company driven by technology, there are more than 300,000 people with Bloomberg terminals that they use all day, every day as an essential part of their trade. And what keeps that going is that we provide a variety of information and analytics that they see as essential. And so what do we see and what do we work on? And what are our concerns? Well, part of that is looking at the size and scope and scale of our infrastructure. We have several hundred thousand applications that are supported inside of the Bloomberg terminal backed by tens of thousands of databases, everything from pricing and risk and portfolio analytics to news and storm tracking. And how can we simplify the ways in which we provide that and make our systems more powerful? I mean, I think you guys, to me, are the poster child. First of all, we were inspired by Bloomberg, Silicon Angles, data-driven as well, and no real advertising on our site. We do sponsorships for theCUBE, but for the most part, a lot of data comes in off the social data. So we see the value of data, but what I love about Bloomberg is you guys really are in a shining example of a company that's successful, has been successful with data and information. At the same time, that's the pressure point of what people want today, real-time and at real apps. But you have legacy, too. I see you have to kind of change the airplane engine out at 35,000 feet without crashing it. So how do you do that? That critical part, how do you make the real-time and you had it on legacy infrastructure so that was obviously optimized and how do you make that transition to the modern infrastructure? Right, well, that's obviously something that we think about all the time. It's a hard and sophisticated problem and if you look at Bloomberg, this is not a new problem for us. We started 30 years ago building physical hardware terminals and we've had to morph our systems over time as new and more powerful technologies became available. But of course, you also have to support older systems that you have. And one of the powerful and exciting things for us now is the power of consolidation. It's really a vista that's only opened up fairly recently. It was once the case that the systems you needed for real-time price-ticking, ticker plants, for example, versus historical analytics, these are two completely different systems driven by their performance requirements. I could extend that out to dozens and dozens of other systems, but they don't have to be separate anymore. And part of what we look to is how can we combine and consolidate and consolidate a lot to this? Consolidation is not so much from a cost perspective but also it's really both cost but mostly functionality, what you're saying. You can actually get the best of both worlds. It's all of the above in theory, right? The question is, you know, in theory, theory and practice agree, and do they in practice is always the problem. Obviously, functionality is important and this is related to complexity. The more systems you have to support, the harder, more complex they are and complexity everywhere kills. If you can have fewer, faster, simpler systems, they're easier to support and easier to change. And so that helps you develop new things in addition to reducing the burden of older stuff. But you said something really key. In the past, you needed real time and then you needed historical. One gave you the, you know, the context and one gave you the freshness. Yep. What makes it possible to combine now? Faster machines and better software, essentially. Can you give us, well, the faster machines we can grok. What's the software? Sure. So, you know, it used to be the case that some of these systems that are, say, a hundred terabytes were very, very difficult to write. These are still problems that are too large to fit comfortably on a single machine, but they're managed readily by modest clusters. Now, writing distributed software is hard if you don't have a framework to address that for you. Bloomberg historically developed all of its own software from databases to network protocols because at one point it had to. And what we see is commercial and open alternatives continue going up the stack in terms of their capability and functionality. And the reason these systems can be combined is now there are frameworks that allow distribution without us having to design and implement the whole thing. And what are some of the ones, I'm assuming you haven't standardized on one, but what are some of these distributed platforms, you know, these frameworks that you'd use for different use cases that combine the real-time and historical. Sure, you know, Spark, obviously, HBase, Solar. You know, we have the largest solar deployment, I think, in the world at this point. These are all open systems that we've been able to adapt and strengthen for our purposes. And in fact, we don't really see them as separate things, they're part of a broader whole. I was actually going to ask on Spark and HBase or Spark versus HBase in the sense that with HBase, you know, you're building a pipeline with different components in the Hadoop ecosystem. Whereas Spark, you're just sort of calling the different personalities in that same engine. Are you using the two systems that way or are you using Spark as a discrete workload in a Hadoop pipeline? Right, so I guess, you know, we don't see these as having neat, clean lines between them. You know, some of these divisions are created by marketing concerns of the companies involved. We see them as being very complementary in part of a broader whole. Spark is a distributed computing platform that also unifies a number of different kinds of functionalities that require different systems in the past from streaming to in-place analytics. On the other hand, where does the data come from, right? You still need a database. So a distributed database is an excellent companion to Spark. Then you still need storage. And you still need storage, right? You need a file, there are the irreducible parts of the distributed computing framework. You need storage, a place to put stuff and read things from. You need a database and you need a way to do computation. And so for us, it's part of a big picture. So talk about the storage in that whole stack because those critical building blocks enable a lot of innovation. But where you store the data, where the data gets locked in, so to speak, has been unconscious. We've had it on the keyboard all week. Does the app dictate the tooling and the tooling dictate the storage? And what if I want to move? I want to not have to replicate data across multiple things. So if I have a Hadoop world, say, hey, I love Hadoop and everything, but I might have Spark sitting on, saying, masos, down the road. I have no idea, man, just made that up, but let's just say that happened. I don't want to replicate the data, but if I store it on a drive, commodity drive, I'm good, right? Or is that better architecture? Or this is the complexities we're trying to tease out. What's your take on that strategy or architecture? Right, so the problem is slightly different for us because as opposed to what it might be for a smaller firm. I mean, if you think about our problem, we have nearly 4,000 engineers. And so, part of our challenge is designing our infrastructure and architecture to abstract as much of the complexity away from the everyday developer. You shouldn't have to know details about the internals of Hadoop or Spark in order to be able to get your job done. And that's really the purposes of infrastructure. Infrastructure, if well-designed, again, in theory, lets you swap different components in and out. So if you wanted to use mesos instead, you could still make that work. Easier said than done. But obviously that's where it works. Sounds really swell on paper. And the question is, how close can you get to that in reality? Well, I got to ask you, hold on just one second, why on that point, on the programmable infrastructure, this is interesting. So I have an infrastructure abstraction layer that abstracts away some of the complexities. But now with DevOps infrastructure as code, I want you to comment on that trend. Maybe you guys are looking at it that way, but with virtualization, with containers, now I'm a developer, I agree. Developers shouldn't have to write extra code to do stuff, but they should be able to interact with the infrastructure. So this is the infrastructure as code argument comes out. Okay, what does that look like? So okay, abstract away. So is it just function calls? Are I pushing? Am I polling? I mean, so that's an interesting kind of an area that's emerging. What's your thoughts on this whole programmable, programmatic infrastructure? Wow, so that's a lot of topics. I'll try and take them in turn as best I can remember them and if I forget one, you'll obviously prod me. So infrastructure as code is really, and DevOps is something that's key to us. That's about the deployment side of the problem. You get a lot of machines, they come in. How do you set up and configure and manage what's running? And it's easy to set up a small number of boxes, but when you have 10,000 machines and they're changing all the time, how can you manage and stay on top of that? You need a way to express your infrastructure in a reproducible and testable way. Now, so obviously, we're big believers in Chef though, there are other systems that can accomplish the equivalent, but you have to pick one. Simply there's the question of what is the strategy to run on? Are you deploying to bare metal? Or are you using virtualization? Or are you using containers? And there's strengths and weaknesses of all three. We actually use all of the above. Containerization and virtualization are conceptually identical. They have different performance characteristics and maturity. Containers have less overhead, but are also less mature in terms of their security infrastructure. So there's kind of a trade-off. So we do a lot of open stack, big open stack supporters. We also have obviously Docker and Amazos and lots of bare metal. And we use a lot of virtualization for testing. How do you verify that your infrastructure's code scripts run? Well, you use virtual box and large powerful machines. So the answer is, there's a big question, but the thing is developing. I mean, there's a lot of different use cases. Too many to just throw a general abstraction layer. Hey, there's no real abstraction, there's abstraction layers. So it's actually both. So one is the physical deployment layer to all these machines, right? Now the other half of that is what is a developer's interaction perception of that? Yes. So obviously for testing, they can request virtual machines or containers and use those. Or depending on, sometimes they have a set of machines they're given, but we also provide a lot of APIs. How do you listen to what's happening for an incoming tickstream? We have an API for that, right? And higher level languages for performing financial calculations. And so that's the other side. The deployment layer is the bottom half, but you need the top half too in order to have an appropriate layer, a level of abstraction. I want to ask about that top layer. Now that you have a new generation of distributed infrastructure and that it raises the level of what a developer has to think about, is it's changing what Bloomberg as a company is and can deliver in terms of value to customers? I mean, what you could do five years ago or 10 years ago versus what you can do today, does that change your mission? Does that change the value you deliver? Right. So it doesn't change our mission. Our mission is providing information and news to our industry for. And I don't think that mission is going to change. What it does mean for us is more and faster and better, right? So we can do ever more powerful things. And you can see that throughout the arc of our company, we started off by providing single security analytics for anything you want to know about a stock or bond or an option or another instrument, you could learn about through the Bloomberg, which was revolutionary since that information wasn't electronic prior to us. But things for analyzing very large portfolios and risk calculations, these are much more intensive or storm tracks and what are the companies that are in supply chains affected in its path. These are much more powerful analytics in terms of the computer power provided and we keep being able to provide more and more of those things. When you say the plant here, do you mean you're looking at what's going on in the physical world that would impact a security or portfolio? Absolutely. This is, and so this is where I guess the data scientists are pulling in different data sets and coming up with richer and richer models. But if you're doing that work, to who do the profits accrete to you or to the hedge funds who apply those models? Well, hopefully it's not, in general, it's not a zero-sum game, right? People pay us for our service and so we profit. But that's services- And then if we provide better information to our customers, they profit too. But that type of calculation sounds like what they used to do. They used to be broader than a single security, long-term capital management. Let's calculate risk out of everything. And didn't quite work out that way, but they had a broad model of the world and it sounds like your model is going beyond just a single security. We provide many things that go beyond a single security from portfolio analytics, which is the group I work in most of the time. But that's not the same as competing with hedge funds for their own algorithmics. We're a provider of news information. You're a supplier to those hedge funds. We're a supplier. Yeah, and you have real-time users. So your job right now, and I can grok this, is you have a business that works. This model's great. You've got some real-time quotes and all the data in the official business. But now your data sources for real-time significantly increase. You can do more inbound real-time data sources. The weather thing is a great example, right? Okay, you have the weather data. Now you can look at the impact to say economic, regional, non, you can get more data sources back into Bloomberg. So I guess the question is, what is your choice of technologies for that? I mean, is it a web app, native? Is it a cloud? I mean, so I mean, I can just see Bloomberg, just, you know, oh my God, we can bring all this into our model. External data or other data sources. So you got to unify the data. You got to do all those kinds of things. I might be oversimplifying it, but what do you guys see there? I mean, how do you attack that problem technically? Yeah, so we have a huge volume of inbound data. We have lots of databases for various kinds of analytics. And what can we use to unify those? We see a lot of the distributed systems being developed here as a big potential part of the solution. It's not necessarily everything. You know, we're large enough that we have essentially a little bit of every system. Are you migrating in a certain direction in terms of the analytic foundation that you're using? We're big backers of HBase and Spark and Solar. And we think the parts of the ecosystem that you see being developed here are definitely part of our long-term solution. And what would you like to see in HBase and Spark and Solar over the next few years that would keep that migration moving or even accelerating? Right, we're very excited by the trends and the pace of improvement that we've seen today. I guess one of the important things is just watching the interface between these products improve and their ease of use and interactions strengthening. Most of these products were designed for a very specific, more limited audience. Obviously, I need to index the internet without going bankrupt if I'm Google. And sometimes this argument is oversimplified by people saying, oh, it's all developed for a batch process, which is actually not true. Bigtable wasn't developed for a batch at all. And that was also part of the initial versions of these systems. But the classic case of the five companies that have hundreds of petabytes to process is not the most common use case. There are many people that have 100 terabytes or 10 terabytes, problems that are too big to fit on a single machine. So you have to have a cluster, and thus you have all the problems related to a cluster of how do I distribute my load? Where do I store things? And how do I manage to fail over? And is it consistently fast all the time on every machine? How efficient is it? How easy is it to set up and use? Do these things work well together? There's more progress to be done on all these fronts. All these things are classic enterprise software hardening. Yep, that's a great way to put it. Yeah, they want to get your thoughts. We have a couple of minutes left, two minutes left. Share with the folks out there who are watching or who are in the enterprise. They might not be as on the bleeding edge as Bloomberg. You guys are touching everything. If you're in the tires, 4,000 engineers, impressive operation, you guys are really strong. Doing all kinds of great stuff. So a lot of folks out there have consolidated in the 90s and they have nothing left but like skeleton crew and some outsourced stuff. That's now being reinvested. So that's an old story. So I got to hire more developers. I have to essentially transform my enterprise, IT and complete global infrastructure. From consolidated bare bones to big billion dollar budget, I got to keep on executing and running. What's your advice to guys out there to deal with this environment? The Hadoop ecosystem, the emergence of the cloud, native mobile apps is on the horizon. This could change their business. What's your advice to them? Yeah, some of the things are obvious. If your startup used the cloud, not because it's cheaper necessarily in the long term, for the advantage of simplicity and management, I think lots of people are seeing that. And the holdouts are the people who have regulatory problems. You're in the healthcare business and you have HIPAA restrictions. Second is understand the problem you're trying to solve. There are lots of problems that can be addressed quite simply through technologies that have been around forever. Not everybody needs some kind of magical whiz bank thing. And really, this goes back to the enterprise hardening. How do you make it all just work and make it simple to set up and use? And that's a big part of the direction. You shouldn't have to know 500 different pieces. Don't overcomplicate it. Don't overcomplicate, complexity kills no matter who you are, so keep it simple. And the whole DevOps thing, just baby steps, how to approach POCs, you guys probably do a lot of that. Same kind of advice, just get small wins under your belt, move it along. Yeah, I mean, it's all about making things work in production in reality. That's what engineering is all about versus fancy hand-waving PowerPoints. Try it out, see if it works for you, if it doesn't, then that's all right. A lot of these, advantage of a lot of these systems related DevOps is they're a lot easier to use than they once were. They're taking off really quickly and they're a lot easier to figure out and learn. You can take a Docker tutorial in a day. You can go through a Chef tutorial or a Puppet tutorial quite quickly. And if you have more than five machines to manage, then it's worth the time. Final comment, to give you the last word, what's the show about writing this year? Share with the folks who aren't here. What's going on this year to do something? What's it about? What's the whole walk away vibe and feeling summary vibe discussion? Continue to enterprise hardening and the future of where these systems are going and H base two and three, right? So where's the future leading and how are these systems becoming integrated? And as you can see, the continued rise of Spark and analytical systems merged in with the ecosystem. The fog is lifting, you're starting to see the straight and narrow now. Yeah, I think everybody, it's much clearer now to most people, where is this all heading and how's it going to come together? And the way in which Hadoop works may change, which parts are in it, but we'll continue to call it Hadoop and it's part of the future. It's just evolving, it's beautiful. Real growth, Hadoop growth, standardization, great development, applications, workloads, just a great confluence of cloud mobile data. We'll be right back. Share more from theCUBE after the short break. Stay tuned and we'll be right back.