 Okay, we're back. This is Dave Vellante. We're live in New York City at the Stratoconf plus Hadoop World. This is Dave Vellante of Wikibon.org and I'm here with Jeff Kelly who's the big data analyst at Wikibon. He's been called the number one big data analyst in the industry and we're going to focus the discussion here on some of the practical applications. How customers are actually applying technology to get a business capability and create business value. I'm here with Venky Rangashari who is the global head of development at Thompson Reuters. Welcome Venky. In your card it says you're head of QED velocity analytics. Everybody's talking about speed, real time, velocity. So before we get into that, tell us a little bit about Thompson Reuters. I mean everybody knows the company but set it up and then your role there. Sure. We work with the financial and risk business unit at Thompson Reuters. We get data from 400 stock exchanges, currency markets, commodity markets. We aggregate, consolidate those data and provide it as a data feed to our customers but also provide analytics on top of that. So Thompson Reuters has been the business for a long, long time for the last few hundred years and the financial and risk business unit which I'm a part of focuses on real time market data for financial services companies. Okay and can you talk a little bit more about your role and your team. Yeah I'm part of the content technology organization. I head up the analytics business so it's a software as a service offering for analytics on demand. So customers such as, you know, some banks, brokerage houses, hedge funds, when they look to do their pre-trade, post-trade analysis and their investment strategies, they use our solution to get to the financial analytics. So look at analytics as a set of statistical functions that a trader would use to determine their financial investment strategy. So at the base we have a large content database that ingests the data real time from the stock market and then we put our analytics solutions on top of it. So let's talk about Hadoop. We're here at Hadoop World. This is the third year the Cube has been here. I think it's four years of Hadoop World. It's growing. You're based here, right? Yeah. I'm based in Silicon Valley. You're in West Coast. Okay. Thompson Reuters is headquartered here. Of course, right. So we had Mike Olson on before. He said the first Hadoop World was 500 people and then 800, which was the first year we came and then last year was 14 or 1500 and now it's closer to 3,000. So it's really starting to grow. But you guys were early on into the Hadoop space. Talk about that. Take us back a little bit, you know, from sandbox to production. So you know, I'll take you through what we used originally, right? I mean, when in the context of financial data, the type of data we refer to is called time series data, right? This is time delimited data, look at it as stockports, purchases, you know, any information which is called ticker data, right? And relational databases didn't solve the problem for this kind of a problem. You know, you look at the rates of data that we get, we typically get data that peaks at about two to three million messages per second. It's also called ticks per second. So at that data rate, relational databases don't scale. So 13 years back, we built a proprietary database, one of the companies that Thomson Rogers acquired in the West Coast, and this company built its own proprietary database, stored the database in a very flat file type of format, very optimized from an ingestion and query perspective. So that's kind of the background. You know, having your own proprietary database, you also have the share of problems of scalability, you know, improving it, you know, we're not an Oracle or a Microsoft or SQL server to sort of staff that amount to do that. And, you know, at that time, we didn't have as much of options around there. And slowly over the last couple of years, we've been looking at Hadoop, we've been looking at Cassandra and the other NoSQL and big data players in there. And we did a couple of experiments and were kind of pleasantly surprised about the results that we got in the space. Okay, so essentially what you're doing is you use Hadoop for batch, you use Cassandra for real time now. We use Cassandra for real time and we use Hadoop for all our historical data. So you look at, you can segment data into two pieces, right? You have near real-time data, which is, you know, zero to six months of data, right? And then you have historical data, which is six months to 12 years of data. So we use Cassandra for a lot of the real-time applications in there. And then you also have historical data that people don't queries across multiple years. We use Hadoop for doing that. Interesting. Yeah, could you add a little color to that? May I use some examples? So what are some of the real-time applications on Cassandra you're doing? And then maybe we can tie that to the more batch type of analytics, deep historical analytics, and how they complement one another. So one of the prime reasons why we chose Cassandra was its ingestion capability, right? You know, when you get data at real-time, you got to take it or you lose it, right? So you get data at two or three million messages per second. You got the first piece of the puzzle is to ingest the data and put it in a database, right? And when we tested Cassandra against other NoSQL and even Hadoop, Cassandra had a unique capability of being very optimized for insertions. And so when we tested some of our ingestion rates, we were able to get some of the loads and hundreds of thousands of inserts per second using even a 9 to 10 node Cassandra cluster, right? So that was pretty, that was almost twice our current capacity. So that was part of the reason why we chose Cassandra. The other part is how we can efficiently provide some of the analytics on top of that. So the data is distributed into multiple nodes. And when a quantitative analyst in a, you know, in a banking house provides a function, like they want to do a volume-weighted average across, say, a bucket of stocks, NASDAQ 100, you know, for the last six months, right? We provide that using Cassandra, right? If somebody says, I want it over the last two years, then we got to take some data from Cassandra and some data from Hadoop and stitch it into one query, right? Most of the data comes within the last six months, right? And keep in mind, the way we do data strategy is the intraday data is in the memory. So, you know, in terms of size of data, an intraday data from a New York Stock Exchange perspective is about 400 gigs. So we hold the data in the memory and you batch rights into Cassandra. So the eventual consistency that Cassandra offers is ideal for a use case, right? We serve a lot of data from out of the cash, right? So can you compare, see, you talked about before you had your own database. You actually acquired a company that provided that for you. So that gave you some competitive advantage. But it was hard to sustain that because you had to do all the R&D, all the support, all the training. How has life changed with this sort of Hadoop, open source, how have you accommodated the ecosystem to sort of replicate that competitive advantage, you know, today? You know, you know, simple thing is, you know, using Cassandra, we're almost able to do 2X more ingestion. We don't have to, we're able to scale out the infrastructure. In our old iteration, the only way we could scale is go vertically, right? We used to buy more CPUs, more memory, bigger machines and bigger boxes, right? Chasing chips. So a lot of solid state memories and doing the work, right? With Cassandra, we're able to scale. So in that way, you know, horizontal scalability was a critical component. And, you know, going with Cassandra, I think we solved that problem. The second piece is also the fact that we had large amount of this historical data lying in tapes and flat files and file systems, right? We are a data company and we always want to monetize this by, you know, giving the right data and analytics to our customers. We couldn't do this before. I mean, today with Hadoop, we can put it into Hadoop and somebody says, I want 10 symbols of last four years of data. I want a volumated average and, you know, these criteria. I could call the data and provide that as a package to the customer, you know? So, I mean, try searching for that in two to three petabytes of data of flat files, right? You know, it would be days to do that, right? So I think that's an interesting challenge that we are able to take advantage of some of the big data technology and solve that piece of that. I want to switch gears just a little bit and talk about some of the applications that you're building on top of Cassandra. Because, you know, we're hearing a lot at this conference about big data applications, finally, I'd say. Because we've been, you know, at Wikibon, we've been doing a lot of research around this and just hasn't been a lot of activity. And ultimately, it's those applications that kind of bring the data to life and help the end user actually take action, gain insights and take action. So how do you go about the actual application development process? You get it, you know, it sounds like you get requirements, specific requirements from your clients. That certainly requires you to respond quickly, very agile and develop applications that sit on top of Cassandra. So take us through how you go about doing that. Yeah, I mean, what we provide is a application platform. A lot of the financial services customers are moving towards R as a analytics, you know, programming paradigm for, you know, for the analytics, right? And so we use R as a platform and the quantum analyst within a bank basically do their strategies and their functions in R which then takes the data from Cassandra or Hadoop and serves it out. Keep in mind that a lot of the statistical formulas and strategies are intellectual property for a bank because that's their algorithms that they build out of. We don't get that we are able to have a proprietary space in a customer-owned area and we have a bucket of generic analytic functions that we provide on top of that. Now how do you use data stacks? We use data stacks as the underlying tick database. Okay, so it's not the native Apache, it's essentially data stacks, enterprise editions. Data stacks enterprise with a combination of solar and Hadoop. Okay, so you use solar for search? You use solar for search. Hadoop for the historical data, Cassandra for the last six months. Yes. And, you know, data stacks ability to provide the monitoring and provisioning aspect of it helps in ease of operations, right? You know, you need a new node, you need a new Cassandra node, data stacks as a UI-driven tool that can help create a new node, you know, rebalance nodes, provide monitoring across all the processes there. So data stacks gives you the same management infrastructure for you to be able to, I guess, focus on other things. Sure. Okay. Helps us sleep better. Right. Good. So what do you make of the conference this year? The conference? You know, I mean, exciting to see a lot of new players, a lot of new entrants in the application space and the visualization space, we've been having some new interesting entries there, so, you know, great to see a good amount of excitement around the big data space. Is that your main sort of interest area, is visualization, trying to get more users being able to see the data? Is that a challenge of yours? Yeah, I mean, that is a challenge for us, and, you know, it's like early days, right, in terms of visualization tools, so we're looking for some neat visualization tools to build on top for our analytic solution there, so that is a key challenge for us. Excellent. All right, thank you. Well, thanks very much for coming on theCUBE, sharing your practitioner knowledge and good luck going forward and really appreciate your time. Thank you. All right, keep it right there. We'll be right back. This is what happened yesterday. It's a crazy lineup here, so keep it right there. This is theCUBE, SiliconANGLE.tv's continuous coverage of Strata and Hadoop World, and we're live. We'll be right back.