 So before we get started, can I get a show of hands? How many people here work for product companies? Almost everybody. Anyone in the services industry, should I even ask that question then? So we do have a participation. When you think of big data, are you dealing with actual problems related to the size of the data or the volume of the data? So for volume, so velocity of change of data, I guess let's call it that way. And size? So could. We are here to just learn about stuff, right? That way I don't have to answer any of your questions later. Awesome. Fair. But yeah, absolutely, right? Your data can be poorly modeled, complex, hard to get to, there can be all kinds of problems, right? In addition to the ones we talked about. But I think as we talk about sort of big data, there's certain other aspects than just the sheer size of it. That's what I was trying to get. So with that, so maybe one last thing. How many people are engineers? All right, product managers? Cool. All right, so we have an engineering heavy audience. I'll try to answer some of your guys' questions. But through this talk, having been an engineer myself, what I've tried to do is I've tried to lay out some structure of what we did, what problem we faced, how we solved it, and what kind of things we tried. Now, in a half hour, it will be hard to fit in all of the sort of nuances and differences and sort of issues with each of the things we tried. But I'll at least share some insights into it. Hopefully those will be valuable to you. So with that, the first thing, for us, the scale, how much data do we have? We have about 10 petabytes of data today. So actually, not today. That was a week ago. Someone asked me a question and said, how much data do we have? And I actually had to go look on the live side some of the tools we have, and it ended up around 10 petabytes. And it's structured and sort of semi-structured. I'm not going to go into the unstructured realm in this talk. So like I said, our data is not just big, our data is fast. So we get about 250 billion system events in a day. So what are system events? These are things like your servers are sending out events, your network devices are sending out events, your applications are sending out events. You're getting sort of the semi-structured data from all over the system. Your applications are also sending out very structured business events. We target 500 transactions per second. So you think of transactions. I think an earlier speaker also alluded to this. Transactions even today rely on structured data. They rely on the properties of RDBMSs in building those OLTP systems to support volumes of transactions like this. But 500 transactions per second with data that may be complex and not easy to get to, you can well understand the magnitude of the problem we face. And over that, we're generating about 25 terabytes of data daily. Anybody in a similar boat? All right, where do you work? OK, cool. Anybody half the scale? A fourth? 10th? All right, you guys have it good. You don't have a big problem, right? More manageable, huh? You do? Yeah, you do. You already said you felt the same scale almost, right? So imagine a sort of system geographically distributed where we have to get something out of data that's flowing at this magnitude. Not only the other 10 petabytes of it, this is how fast it's growing, adding to that 10 petabytes every day. By the way, if you guys have questions in their term, feel free to interrupt me. Let's try to make it interactive if we can. And if the lunch was too good and it's too heavy and you don't feel like asking questions, that's fine too. So what's the problem? There's a lot of data, right? But data by itself is sort of, yeah, it's there, right? What you're really looking for is information, intelligence out of that. And in our case, that information is in multiple places, almost everywhere, somewhere. There's multiple data sources. Like you saw, there's business event streams. There's OLTP systems. There's system events. There's data that's in multiple formats. Even semi-structured data has different sort of structures that it follows. So information is actually fragmented across multiple places. But we do need to sort of combine that and come up with one answer in the end, right? I think the keynote speaker showed you an easy way to write a query into a page and actually get, oh yeah, this is how many, this is why you shouldn't sell black t-shirts in Bangalore, right? Unfortunately, a problem isn't that easy either. Because we don't have sort of the coherent or the single unified data structures, data sets, or types of data. Our problem is actually magnified by the virtue of this fragmentation. It's not that we don't have tools in infrastructure either, right? We have multiple tools built over time through the organic growth of the organization, and they work well. They work well for the use case they were built for, right? But when you try to unify these things, try to get a single coherent answer, you run into a challenge. Ad hoc queries, like I said, are not simple. And they're certainly not simple when they're not on one source. If I just had unstructured data, so if I had to, and this is not to sort of pick on anyone else, right? So if I had to take the Mahabharata and I had to mine it and I had to create a view out of it, that's one source, that's one type of data. Yes, it's a complex problem to solve, right? But I have some constraints. Now, if I remove many of those constraints of, okay, you can give me data in any which way you want. I have to incorporate it. I have to be able to process it. I have to give you an answer of the ad hoc query you asked me. I was lost, really. So what would you do as an engineer, right? I guess we'll get to it. And I think you've probably heard this one before, right? Despite all that, our problem was a little different. Everybody's problem's different here, right? Not the same problem as mine, not the same problem as the guy sitting next to you. Your problem's a little different. We had to solve concerns around security. We had to solve concerns around, solve for concerns around fraud detection in real time. So when you talk about those 500 transactions per second, there's background analytics fraud models that run to find out if that transaction should actually go through or not. And these have to complete within the SLA of that transaction. So getting a result in 20 seconds is not good. The result has to be in one second, perhaps under one second in many of the use cases. We have long running transaction processing use case, where you leave out that model of it has to run immediately and I have to give some result back to the user to transactions that can run over days. Two completely different problems that have perhaps different constraints and limitations and different flexibilities in how you can solve them. And then we had a system command and control problem, which is around, is my application, is it running? Is it doing what it's supposed to do? So you mix all these varied concerns together, right? And it's the same engineering team that's looking to solve all of these. And normally we think of common solutions, we think of not doing too many things, right? So there was some challenge in actually coming up with a solution that fit all of these use cases. And don't get me wrong, we haven't perfectly solved some of these yet. We're still working through it. And if any of you guys think you have a solution, come talk to me, tell me what it is, I'll get promoted. So again, like we talked about, there's data versus information, of all these things, we want answers about security, we want to address fragmentation in the solution, and we want to somehow think about things that people may want in the future, things that may come up in the future, right? A little bit of perhaps speculative engineering. So I know we also heard this morning about, you know, TCPIP now working if you're on Mars and Earth, right? So, yeah, I would have never thought of that. It takes a brilliant mind to think of things like that, right? So, but we did have to, in a real business, we did have to account for some of those possibilities, some of the things that people may want in the near future. Because if you build something that's only good for today, by the time it comes out, it'll only be good for yesterday, not for today, right? Okay, and on top of that, we did care about cost of scalability. So putting thousands of servers, thousands of cores, tens of thousands, 600,000, not an option. Anyone of you that works in a product company has real constraints, right? You've heard the budgeting sort of point of view before, we can't do that because there's no money to do that, right? It's a very valid constraint for businesses, right? So in every case, just throwing more and more machines. Yeah, it doesn't work, right? It's not viable for everybody. And, you know, we always want, you guys have heard this from your business partners before, right? We want it better, we want it faster, and we want it cheaper. So the moment we go to them and tell them, you know, here's, give me $50 million and I'll give you something that works for two of your use cases. Guess what the answer you're gonna get is, right? You guys have heard those answers before, I bet, right? So, what did we do? As engineers, right? The first thing that comes to mind, if the problem is too big, you divide it up, right? So let's try to split it into smaller problems. So let's pick the semi-structured data first, right? So we want, what if transaction processing works today, you know, we have ways that transaction processing is getting improved. So we'll pick the semi-structured data first, right? And we wanna get near real-time analytics, right? Yes, and sort of, that's not too hard to get depending on how you look at it, right? You want correlation between events. We want the capability to detect a variance in, say, business activity or an applications processing or performance. And then we want people the capability to do search on the semi-structured data. So that was our smaller problem. Questions, no? So how did we solve it? So the first thing is a group of people sitting in a room. So how do we solve it? Oh, yeah, it's data, right? So we can stick it in Oracle and we can write queries on it. How many of you have run into that before? No one? No one's team has had the idea of putting data into a database, right? But that's the obvious first choice for engineering teams, right? Everybody knows SQL, right? Everybody's worked with database applications before. So the moment you try to solve a problem that's related to data, the first thing people will normally jump to is, yeah, if it's data, put it in a database, right? And put it in a relational database. And to be fair, right? There's bulk inserts that are possible, right? So you don't have to insert data at each event as it comes in. There's ways to sort of do multiplexing of connections and write officially into even an RDBMS database, right? So don't get me wrong, there's ways you can sort of try it. But that's probably the first thing people think about, right? So we did the same thing. And when we sort of even did the evaluation of it, right? So okay, we have 250 billion events, each of these if you sort of even take the semi-structured data and take those semi-structured parts and create a schema, which is fairly simple, not complex, right? And try to put the data in a database, you get thousands of billions of rows a day. And I think when we actually did the proof of concept, after about 15,000 inserts, the second thing fell over, or it was like a 12 core machine or something. So obviously not workable, right? So okay, now what do we do? Everybody had heard of Hadoop two years ago, right? So let's use Hadoop, right? Perfect, let's go do it, right? So we borrowed some hardware, tried something, set up the actual cluster, set up the map reduce, set up the query, tried a first thing, 30 seconds before the query would even try to start, 30 seconds. And this we did with like four terabytes of data, right? So we have some cases SLAs of like one second, right? Two seconds to get answers, four seconds, right? Probably not reasonable. Then what do we? Okay, on top of that, when we actually look at an RDBMS, six million dollars, and if you want high availability, you can sort of multiply that by how many ever number of nodes you want redundant. You don't get much scalability. Scalability is primarily vertical, right? So vertical being a misnomer in itself. When you look at Hadoop, tool set was limited two years ago. Now you do have certain mature tool set, but we would have had to build everything itself. And like I said, about 30 seconds before the processing would even start. Then what? More ideas, right? More brainstorming. Okay, there's columnar databases, right? We heard about columnar earlier too. And our need, so let's quantify our need. We want five million messages per second for load optimization. And we want 2,000 simultaneous queries while we're loading it. So we picked a set of products that we took and let's go try these things out, right? And we went through the regular product evaluation cycle, called the vendors, got some hardware, got some software installed, got some data in there, started trying things out. And then we did that with both GigaSpaces and some of these other products, right? So why we had gone columnar databases, right? Just an oversimplification of the difference we thought it would have from Oracle. So an RDBMS serializes rows for persistence, right? Columnar actually serializes the columns. So you actually get the benefit of things like run, let the encoding for compression. So if you have four sort of A's in a row, you can put them as account and the actual value and you can take similar values and do all kinds of encoding in them, okay? So, and it was promising, right? So we had, it was a lower cost solution for hardware, software, and power. We wouldn't have to stand up thousands of, tens of thousands of nodes to do this. And it was designed for analytics types of processing, already linear scale out model and had HA support out of the box, right? And had tools, yeah, problem solved, right? Once again, the reality of the world, the cost, right? It just wasn't feasible. When we actually looked at all of the licensing and all of the hardware we would still have to buy, the cost was still a factor, perhaps more so than in the case of some of the RDBMSs. So as I say, right, then sort of necessity becomes the mother of invention, right? So we had a need to keep looking yet. So for us to divide the problem, right? There's still multiple problems to solve, right? Just recognize that we can't have sort of the perfect holistic, beautiful solution where there's one thing works. We, it's cheaper to duplicate storage than to try to process everything in the same way. Put some reasonable constraints on how sort of semi-structured we want things and separate the semi-structured from the completely unstructured and then try to avoid unnecessary scale, right? If you try to take one problem and put one solution to that problem and take the same solution and try to put it to multiple problems, invariably you end up in situations where you sub-optimize for all of the problems. And then you need to support it with the scale that all of these different problems need. And as we all know from experiences, cheaper to build something that requires to scale less than to scale more, in many cases, right? So what did we end on? So our partners over on eBay had the same problem a few years before we did. And there was a sort of custom multi-dimensional aggregation tool that we have now. Supports ad hoc queries on semi-structured data and actually on the results of, on pre-computed results in some cases. We get meaningful compression, use the sort of examples from the columnar databases on encoding and we do have more than one tool for time series analysis. There's no one tool. Like I said, again, the most cost-efficient for us was to actually go with something homegrown. And so we're now in discussions with our peers to actually see if we can open-source this within the next maybe a year or two years, however long it takes. So my lessons learned, maybe perhaps not everybody in my organization's lesson learned, right? Hadoop is not synonymous with big data for me, at least. There's still significant sort of challenges where we talk about asset requirements like for our transaction processing use cases. There's still challenges with sort of the continuous stream of events. So the speed of that data versus just the scale, right? There are tools now becoming available for real-time analysis and query, but again, it's sort of abstractions on top of abstractions on top of abstractions, right? So what you really have to think about is what your problem is, how you can best solve for that problem. That was my lesson learned. Any questions? I know I probably ran over time still, but. We have time for a couple of questions. No questions? Is it possible to elaborate a bit more on the specific, you know, for the real-time response, how do you all handle it? So like I said, that's a separate problem, right? Right now we have several ways which we are trying to address that, and one of those is an OLTP system, right? We're moving away from that to actually go to a very flat data structure to go more columnar. But again, remember, in the end, that is an extended part of a transaction. Even though it's an analytical fraud model that'll run, it has to run in the context of a transaction, in many cases. One question. Hello. So either way, you have a mic, you can go first. Yeah, I had a mic mic. Hello, yeah, what was your experience with Vertica and Vectawise while processing this? So I don't think I have the numbers. Don't think I remember them off-hand, but I think in terms of the insert performance, it met our requirements actually, right? In terms of the query performance, it did meet our requirements. So we had about, what was, five million inserts per second, 2000 queries while we were doing inserts, right? Within the SLA, about three seconds. Those queries actually returned the new data that was being streamed in. So like I said, they met our requirements. That wasn't the problem there, right? But again, it comes down to, can I afford it? And especially at our scale, right? To buy hardware, software, to support our scale becomes a costly proposition. Yeah, so, sorry, go ahead. So we did evaluate a data grid, right? So sorry, I rushed through that, right? So we did evaluate gigaspaces data grid, right? Again, the cost there was about four times, if I remember correctly, off of the columnar databases. So to set up a gigaspaces data grid at this volume, and I do remember that number because it shocked me even at that time, I would need 500 machines, fully custom, based on which, so it's implemented in C++. One last question. When you talked about the inserts per second, how did you manage the insert? Was it always append-only? Pardon me? Was that managed as an append-only? Yeah, so inserts, right, new inserts. So there's just events coming in, there's a stream of events. I'm just inserting those new events. There is no updates. Okay, so it's always append-only kind of? Yes. Are we good? Thank you so much, guys. Thanks for listening.