 Okay, welcome back everyone, we're here live at Percona Live. This is theCUBE, our flagship program. We go out through the events, extract the signal from the noise. I'm John Furrier, the founder of SiliconANGLE. I'm joined by my co-host Jeff Frick with theCUBE. And our next guest is Tim Callahan, Vice President of Engineering at Tukutek. Is that right, did I get that right? Tukutek? Tukutek. Tukutek, okay, depends how you want to word that. Welcome to theCUBE. Thanks for having me. So you're famous, internet famous, you and your brother, we were talking online, your brother Mark Callahan works at Facebook. With your company, VP of Engineering. You guys are sequel geeks. Would that be a fair statement? I think that's fair, in the past there's always been this East Coast, West Coast rivalry, so Celtics Lakers, Shaq and Kobe, and Mark and Tim. So tell us what's going on. So my sequel obviously, we're a huge fans of open source, is what we live and breed every day. It's been great innovation, but now we're involved in multiple stacks, integrated stacks, different philosophies, different religions, if you will, around scale and development, engineered systems like what Oracle's doing to, do you make it more open, scale out, open source, a lot of things, and a lot of software is going to be tying it together. But my sequel's been a big part of the success of a lot of startups, certainly Facebook you mentioned, your brother works at. But everyone's talking about scale, right? And web scale sequel kind of points to this data first philosophy where companies have a data-driven business model. That's the real time. What does that mean under the hoods? So take us through what all those trends mean for the database under the hood. I think what's interesting and what we're seeing more and more at this show is just, my sequel is an infrastructure, and it's open and something lots of people are building on top of. So just within the expo floor itself, there's several companies working on the scaling issues. There are several companies working on availability. We're a, Tokutek is a storage engine vendor, so we make an alternative to NODB to try to get better performance and compression without having to change any of your application whatsoever. And then you see lots of contribution in other ways as well. There's monitoring, Vivid Cortex just launched, and so my sequel is kind of at the center of all this, but there are lots of businesses and just open source activities going on to allow people to continue, keep their data in my sequel without having to go to a new solution and improve it based on the efforts of others. Okay, so take us through in your mind, from an industry standpoint, as someone who's in the trenches and had a lot of experience, the trend from storage has been a big part of supporting the lack of memory. Now you have memory with persistence and flash, now non-volatile compression, you're seeing some expansion of that addressability. Now storage is the problem. So talk about how you guys play in that with the storage engine. What does this all mean? Like why is this a game changer? Why is it an opportunity? Why is it a challenge? It's actually interesting in multiple directions. So one is this move toward flash. Flash is fantastically fast, but it's costly. And not only is the cost per megabyte significantly higher than spinning disks, but the durability is something you have to be concerned with. So something very good about TokuTek software is not only do we do compression, but we write quite a bit less data to disks. I think if you look at some of my Mark's blogs recently, for example, he's focusing on not just how the size of a workload on disk, but how many megabytes for a particular benchmark have been written to disk as well. And that's the big key to flash is not just, it's more expensive, but how long will it last? And in the keynote today by Fusion IO, the other mention was one way that flash is getting cheaper is just that it's not getting smaller, it's just getting less durable. So to bring that cost down, you have less writes. And so cheaper flash is going to wear out faster if you don't have a storage technology that just plain old writes less. Well, the durability is an interesting point. And let's drill in that for a second. A lot of people look at flash and get enamored by it because on paper it looks good, right? But the durability is a big issue. So how do you guys manage it? So you guys have this, you manage on between the storage piece and memory. What specifically is the key issue to manage the durability? Okay, the way we manage it is Tokutek was built on a data structure called a fractal tree index, which is very different than a bee tree. And one of the goals of the fractal tree is we amortize many, many operations before we actually write something to disk. So on a traditional bee tree, if your data doesn't fit in memory, when you put a row in a table, you'll be writing to disk almost immediately. Whereas with a fractal tree, you might get a thousand or 5,000 insert operations, for example, for one single write to disk. It's fundamentally very different in terms of how it interacts with storage. And part of the reason, we've ended up in a flash world, the fractal tree index was created in a world of spinning disks. So the goal back then wasn't necessarily about durability, it was about the cost of an IO. So now an IO has become free with flash, although it's become expensive and doesn't last forever. And so you guys kind of like the world spun in your direction. So you've built it for storage, make it more efficient. And then all of a sudden you become a key element of the flash world. That's what you're saying? Yep. What are some of the examples you guys can point to out there where you've worked with customers and use cases that are highly optimized for what you guys call big data? Okay, so insertion is, speed is certainly a use case we see quite a bit. So in the world today, we see gathering data from sensors, gathering data for dashboards, network, logging network traffic, for example. Those are all high insertion. So we need to be absorbing the data quickly, but the data does need to be indexed as well because to just insert data into a database without being able to select it later is a problem. So we maintain indexes at high rates of speed. The other thing we do very well is compression. So we have a couple evaluations and things going on right now where the highest compression I've ever seen is 20X compression. We actually have a user right now getting over 30 times compression. So they're building an appliance and in that appliance, if they put a 200 gigabyte hard drive with 30 times compression, it's behaving as if it were a six terabyte device. Tell about how the company got, you saw the fractal tree index is nice. Talk about how you guys got here, how it was developed. Just tell the story about the company. Okay, so fractal trees were created by the founders, Martin, Michael and Bradley, back in 2008, 2007 and 2008. It was patented and that technology has evolved into what we call TokuDB, which is our MySQL product, as well as Toku MX, which is our MongoDB product. And the initial concept and strategy got created and then over time, the last six or seven years has gone into making it performant, giving it high compression, making it acid. It does have acid properties. It does have the ability, it needs to be durable, it has to be able to recover from crashes and such. So many of those years have been just adding more features and functionality to the basic technology. So Tim, you gave a keynote here on benchmarking, I believe, right? I gave a presentation last year on benchmarking. Oh, on benchmarking. Okay, so it seems relatively straightforward and not something that people would have to be reminded about. So what was kind of the focus of sort of readdressing the benefits of benchmarking? That was a very fun presentation for me. I've kind of been a lifelong benchmarker. It's something that's in my DNA and I really like measuring things and making them faster. So my proposal for last year that got accepted was I wanted to try to take 20 years of benchmarking experience and turn it into a 50 minute presentation and really get the point across to people about some of the fundamentals that I think are often lost. And the biggest fundamental is to change one thing at a time. Like as soon as you introduce two variables into any benchmark, you have no idea which of those variables actually affected the performance of what you've just done. So at Toku Tech, we're small. We do frequent benchmarking just to make sure performance should always get better or stay the same and never go down. And with automation and benchmarking, you can accomplish those goals. So there were a lot of best practices in that talk. It was kind of a beginner, intermediate, advanced section so people could kind of, everyone could follow along and at a certain point, people I think I was going well beyond what people would normally do for benchmarking. But I enjoy it a lot. I do all the benchmarking at Toku Tech on our products, comparing us to MongoDB or comparing us to NODB. Do you feel that as a skill has slipped a little bit in the age of agile and fast moving and run, run, run, run? I can describe it in a term that Mark throws my way from time to time. He calls it benchmarking. And I try not to fall victim to that but it is certainly possible. I mean, anytime you see a vendor benchmarking their own software, when is the last time you saw a vendor show you the benchmark where they lost to their competition? So in benchmarking it's often the practice you might come up with eight benchmarks and you win six. Well, the six or the six you publish. The two are the ones you don't and if someone else finds that in the wild it's you have to come out and talk about it. I think it's important to talk about the use cases where your software works well and to admit where it doesn't because the last thing you want to do is have a user come in and try your software for the wrong reasons and get bad results and that doesn't help you as a company because when the user complains about it it's actually going to be worse. Yeah, hopefully you're benchmarking for your own performance and your own feedback loop as opposed to benchmarking just purely for the sales effort. Yes. Where are the benchmarking out there right now? I mean, obviously benchmarking is definitely being done. It's been aged, it's been going on for age but now with crowdsourcing, Jeff, I mean that's going to put a wrinkle in that because now people can call out these benchmarks. But we've seen the collapse of benchmarking. You don't see a lot of independent benchmarks anymore either because no one can fund it or no one takes it. So what are out there for benchmarking? Independent benchmark, are there any? There's some. So II Bench is an insertion benchmark that was created by Toku Tech and is actually now owned and maintained by Mark. Sysbench is a benchmark framework that is owned and is maintained by Precona. TPCC has been around forever. You don't see a lot of that in the MySQL arena but it is certainly a way to just keep track of performance. I think the challenge right now is finding the place for benchmarkers to congregate and put their efforts together because I think there's a lot of one-off benchmarks. It's putting a twist on it, for instance, right? Someone might say, hey, yeah, this is a general purpose benchmark but I want to uniquely put a use case in there, right? That's what you're referring to, right? Yeah, it would be great if someone somewhere could come up with a few real-world use cases we want to model and then one or more vendors could participate in building the benchmark and modeling that. The challenge there is any vendor who feels like they're not going to do well in that benchmark is going to pull back and not contribute to the effort. Which, if you document that with live video and a crowd chat, then backing out is bad for them. Exactly. Come on, people, let's get on that. Jeff, we'll get on that. We'll make sure we look at the benchmark. Well, like Gary said, Gary Ornstein was on yesterday that some of the benchmarks are changing. What's relevant and how your measuring performance is really changing? I think he talked about eBay now measuring that web pages serve per kilowatt. So, as cloud gets more pervasive, as the data gets more pervasive, the actual things that people want to measure are changing out from underneath their feet. So, I'm sure that adds in another challenge to the whole thing. Yeah, and then there's what are you measuring in the benchmark? As I mentioned, we're getting to a world with Flash where it's not just how small you are on disk, it's how much data have you written to that device, and in the background, how much garbage collection has it performed? So, that's an aspect of benchmarking. We can use all the existing benchmarks, rerun them and not just look at speed or size on disk, but look at megabytes written, for example. Yeah, benchmarks are a challenge. Tim, we'll get on that immediately. The Wikibon team will be on that. I got to ask you some more industry questions because I want to get your perspective because the database world is really hot right now, but it's been around for a long time. It goes through these cycles, you know, cubes, structures, it seems to like, something hot comes in, then structure comes back again. The new comes in, then you got some sort of structure, database never goes away, scheme, et cetera. But looking at your website, you guys have customers in cloud enablement, social networks, e-commerce, data processing. I mean, it's almost like the modern era is the new wine, old wine in a new bottle as the expression goes. So, I mean, data processing is a term that kind of goes back to the main frame, but it's really critical now. Real time, cloud is just infrastructure, social networks is just graphs and date more data. What is the big trend right now? And how can you compare this real time cloud, social networks to a distributed networking computer science problem in the past? So, I mean, there's always comparisons with a twist. Could you kind of dice it out for us? Is there any parallels to older paradigms that are now rearing its head again? Okay. I think one way to look at that or a way I look at it is, I worked at Volt TV for two years, which is a Mike Stonebricker company, and Mike Stonebricker, as Vertica, he has the whole TV. So, Mike's one premise there is one size does not fit all. So, let's build a database for in-memory computing. Let's build a database for OLAP. Let's build these search solution. What's interesting in the MySQL ecosystem is let's try with one common foundation to make it do as many things as possible. And there's going to be edge cases that just don't fit. There are likely some use cases where Cassandra is the best technology for a given problem. But you could certainly extend MySQL to go a long way. You look at Facebook, for example. They're doing all manual sharding. They're not using a sharding technology to get that done because they've looked at the effort and they've mastered how to make it work for them. It's a very particular implementation. I don't think they would recommend everyone in the world go in that direction. But you can make MySQL do a lot of things and over time people just find out more ways to make it do interesting things or vendors come along or open-source projects to extend the functionality further. So, I'm kind of a fan of this. Let's make MySQL do as many things as possible. Instead of thinking we need to cut a different version of MySQL to maybe handle the OLAP problem. Because if you look around at the MySQL market, there is no open-source OLAP solution. There's InfoBright and InfinityB that are closed-source commercial solutions but no one's written a storage engine for OLAP. And maybe at some point that'll happen. I think that would be... Well, that's an opportunity, right? So, there's just two strategies. Do you have all these little siloed-like use-case-built structures? Or you go general-purpose platform that enables people to do customization. That's pretty much what you're saying. That's the preferred approach in your mind. But what about the companies that don't have the expertise like Facebook? Or, I mean, everyone wants to be the next Facebook. Well, I want to do what Facebook's doing. Well, guess what? You need talent. You need developers. Now, the good thing is MySQL's got hundreds of thousands of developers out there. So, where's the developer community going to enable that, I don't want to say slow-moving enterprise, but kind of a normal business, large enterprise. Not Facebook, not Google. Not the guys who actually can whip up their own sharding approach. Rooms full of age keys. Not everyone can do that. Yeah, it's a tough problem to solve. It's good to see Oracle moving in the direction of making sharding easier with MySQL fabric, for example. They announced it, I think, a year ago. There's a couple releases have gone out since then, so it's nice to see that they're trying to put something in the box. I would argue something MongoDB didn't really well was they put sharding in the box. And they came out with their own implementation so people didn't have to do it themselves. One might argue performance about MongoDB. You might come up with, you might not like it, that it's a perfect solution, but it's there. And Oracle is coming along later and adding that feature and functionality, which I think is fantastic. That will hopefully push and inspire the commercial vendors who are doing sharding to do better and to work faster to stay ahead of that curve and kind of the competition makes everybody better. We're here with Tim Callahan, guru in databases. Last couple of minutes we have here, I want to just drill into some of the trends around big data. So to find big data from your perspective before we started, you gave me a good definition. I want to get that out there. I also want you to comment on this new data source that are coming in. It's new data source only because we're now connected with the internet of things, you got sensors, surveillance networks, social networks, there's much more of a graph database, non-sequel type databases are great for catching data that's unstructured, but in the day it's got to connect in. So you have real time, you have social networks, you have new data coming in that quite frankly you can't really prepare for in the old model, old world. How do people deal with that? And what can they do and what does that mean for the new guy making their architecture decisions? Okay, as we talked before we started here, my definition of big data is data that doesn't fit in RAM. So for a product like TokuDB for example, NODB is a fantastic engine, but once the data is larger than RAM you'll be gated in performance by your IO. So for us, as soon as the data no longer fits in memory the product just keeps going and remains fast. The challenge with this big data problem is data is arriving quickly and it needs to be indexed so people can do queries on it later. So there's time series, there's data coming from sensors, ingesting this data at high rates of speed, doing it on a reasonable number of servers and then keeping the data indexed so you can answer the queries that your users have because it's tragic to bring the data in and then not be able to query on it. Or have to ingest it into MySQL just to push it out to something like Hadoop so you can run your reports in a non-real time or a batch mode. So we're seeing lots of... It's hard to put that genie back in the bottle once you ingest it in this tragic situation. You meaning you have data and you really can't work on it. It causes some problems. What are some of those collateral damage? The other challenge is kind of the rise of products like MongoDB. I'm ingesting this data at high rates of speed and the use case changes. For example, in MySQL, I need to add a column. I need to start ingesting new data on an existing table and if rewriting that table takes days, weeks, or months, it's an impossibility. So you take a product like Mongo that doesn't have a schema change problem and it's happy to ingest rows of data that now have a column that wasn't there before. So part of, in my mind, the rise of the NoSQL is that. It's about just complete flexibility on the schema and not having to anticipate every aspect of my use case to be really agile, to just be able to upgrade my applications in the middle of the day on a Friday without worrying about downtime, having to put up a little message on the website saying, sorry, we're going to be down Saturday for an hour. Yeah, that's unrealistic. I mean, that's an unacceptable situation. So that's a positive. But what's the negative of being too ingested or as Gil at Factual and I were talking about a couple weeks ago, being data full? I mean, at one point, you're bloating with data. You're busting out with data. Is there an issue there? What's the consequence of that? Is there a consequence? There is one. And the one consequence is, normally this data does have a life expectancy of usefulness. So ingesting high rates of speed means at some point you're likely having to remove that amount of data at that same rate of speed while still continuing to ingest at probably a higher rate of speed. And that's the part people don't oftentimes think about it's the, I'm going to keep the data for six months and it's arriving at a certain rate of speed. Well, when you hit the six months, the insertion speed is still exactly as it was or better if you're successful. But you need to build in for the ability to get rid of data. That's a really good point. And that's a mindset issue too. That's the old mindset of data warehouse business intelligence was, I'm going to store data, park it out in the backyard and then we're going to run some algorithms, do some data mining on it and poof, answers come out. The lag is not big of an issue. What you're talking about is, you're talking about acting on data when you have a real time inbound ingest that's complicating the hell out of it. It makes the data dirty. So how do people solve that problem? The common way to do that in my SQL is partitioning. So in my SQL, when you have these big tables that are going to ingest high rates of speed, you partition data by day, a week or a month and it makes it very easy to remove old data in one simple operation instead of having to delete row by row. Well, this is a great conversation here getting in the weeds. I love the my SQL relevance is still continuing. The developers are out there. Tim, I want to ask you to share with the audience out there your perspective on why my SQL is still important. Why is Percona Live such a great event and what is the core conversations happening here on the ground at the event? For me personally, it's the community. I used to go to Oracle Open World for several years. Just as I was an Oracle user, just a very different feeling. You walk around here. There's a lot of enthusiasm. There's a lot of hiring going on by very big interesting companies. At Toku Tech, we open sourced a year ago at the conference and a year has now gone by and I've got lots of people coming up to me telling me how they tried our software, what their experience was, that they plan on trying it, giving us ideas for features. I did a birds of a feather session last night on extreme my SQL performance. My expectation was to have people standing up as Facebook lookalikes and telling me how fast their systems were and have other people who would stand up and learn from them. But instead, the first person who raised his hand was just getting started in my SQL. And the best part about it was we dissected every reason my SQL might be slow. So we still talked about extreme performance but we picked it apart from a beginner level and no one in the room was upset that this wasn't about people thumping their chest about how fast their data is. It's an authentic community. What are you saying? It's not like grandstanding or open world is about grandstanding, doing deals. This is more of a develop, this is developers. Yes, this is all about developers and operations. The developer market's changed. I got to ask you with the cloud and more is going on here. You have, everyone wants to win the developers. Does that worry you a bit in terms of all the spam that could come in a lot of the fud, create poison the well if you will around the real movement around what developer community is. And second question is what is the hot thing for developers right now? What do they really want? That's interesting and given that we span two technologies, we have a nice no SQL solution in the Mongo market and a MySQL solution. Your point about winning the hearts of developers, yet another thing Mongo has done really well is they've created a product that a developer could just install and get started and there's no database administrator. There is need for operations and administration but it's different than having a formal release process where you have to add a column to a table so you stage a release and you do it very formally and you have checks and balances where in Mongo the developer can just change the schema on the fly and then keep going and I think Mongo has won the hearts of developers because of that. It's easy to use. It's easy to use but there's always been a tension between developers and database administrators where a developer might want to make a change to a table schema and a DBA might look at that and kind of push back because they don't agree with it where in Mongo there's no opportunity for that check and balance. It's just here it comes and... Yeah and that's powered in a lot of the DevOps culture which is infrastructure as code which is a big part of its trend and that's good. The DBA and also yeah the network guys and they've got three levels of hassle for the, well two levels of hassle for the developer. DBA which can be a bottleneck or a milestone but now the network guy provisioning servers. With DevOps, how does that all change and how does MySQL look at that DevOps view? That's a good question so kind of comparing and contrasting to the NoSQL model again. It lands in the laps of DevOps often with like a MongoDB application for example where the app developer might have made a change and there's no control between those changes actually making it into production. What's interesting in the MySQL world again this comes back to community is there's lots of DevOps people here from lots of different companies some of which you could actually consider competitive with each other all sharing information and ideas and making each other's lives easier which is good for the community but it's an interesting world we live in where Facebook is making many changes and creating many open source projects which are critical to their success yet they're open sourcing them and letting the world use them and I think their model is that the world will help make that product better and it's not worth keeping the code locally and not sharing it and you see that with web scale SQL many of the bigger MySQL users contributing back to a common code. Final question before we break is what do you think about Facebook's developer outreach you have this F8 conference coming up in April 30th are they doing a good job are they targeting more app developers what's your take on following that Facebook developer community? I am kind of blown away by how Facebook does development there was a few months ago they opened an office in Boston for the first time and I went they had an invite only event I went to it and they had it's partially about recruiting but partially about explaining how they get work done and they talked about their mobile development platform and I think the number was over 200 developers at Facebook working on the phone apps and they mentioned how every check in was code reviewed by two engineers who didn't write the code within 10 minutes and it's just stunning to me that level of scale and then how often they push code to the phones it was multiple times per day releases were going out the door so they've not only scaled in terms of headcount but in terms of automation and testing and just pushing things out so I think they're doing some really interesting things do you think that's a bellwether for the future? I think the iterating quickly putting software out there and improving and fixing things problems as they occur is critical for success especially in mobile or social Tim thanks for sharing your perspective here inside the queue we'd love to get the data out of your head and share with the audience my sequel certainly very important performance I mean it's just a game changer and it's going to go to the next level you're seeing things like Facebook and others leading the way you guys doing a great job with your storage engine thanks for joining us we'll be right back with our next guest after this short break