 I don't know if I've had many years of experience, and I don't think yours actually teach you anything, but thank you. So what are we going to talk about today? Like what I was saying earlier, at PayPal we have several use cases where we deal with massive amounts of data, some cases streaming, some cases analytics, some cases are mixable. So what I'm going to cover today is one use case for site monitoring. So if I think a mature company like PayPal, opinion business is very successful, we would have figured these things out day one. But we're the world's biggest startup. The world's biggest startup, and you guys are doing startups, you're in the startup center, things like monitoring come afterwards. First it's about making money. You want to get a product out there, you want to get it in people's hands and you want to make some money. So things like how will this thing actually stay available, how will it scale, how do we manage the system, how do we operate it, take a little less importance at the onset. And if you have success like PayPal, well, it comes a little later than normal. So when I joined PayPal a couple of years ago, I was told, well, we need to figure out sort of how we improve our monitoring capability. And when I looked at it, our monitoring system started from application to server to the switches. The entire stack was generating about 250 billion events in a day and about 20 terabytes of data. I didn't try to correct me if I'm wrong, but now our application logging is at that level. So this has grown to probably about 350 billion by now. And it's growing on a daily basis. So how do you make sense out of that bigger stream of data? What do you want to do with it? When you talk about monitoring, that's a pretty big sort of wide open scope, right? What is it that we really want to know? So we say, okay, there's two more things we're going to focus on. We're going to try to do real-time analytics. We're going to try to do some basic correlation. And I hope you guys all know what correlation is, right? So if you have an event stream trying to figure out which events are really related to each other, even though they may be coming from different parts of the system or different levels in the stack, detection is actually a big part of monitoring, right? How do you actually know some things that are wrong? And then of course such, right? So monitoring applications, generate passive parts of logs, trying to actually go through those logs and find the little bit of information that you're actually interested in. So of course the first inclusion everybody's out to. Let's put all this data in a database, right? And then we can, like, query, right? And then we can figure out what went wrong. Easy. Everybody knows SQL, right? All of my interviews are about to change. I said, well, you know, it doesn't sound right, right? 250 billion events, 20 kind of bites a day. What else is out there? Everybody said, yeah, let's use Hulu. Great. Okay. But when we actually did sort of even a basic analysis and a look of concept, right? I had a budget of $500,000. And to stick this thing in Oracle would cost me $600,000,000 just in hardware. Right? Scalability is a concern. Traditional IDPMSs are scalable vertically, right? And as you scale vertically, the cost of that server goes up and up and up. Right? Okay, so Hadoop was supposed to solve that scalability problem, right? So yeah, we can get some commodity servers spread from across and that will work. Well, what we found out was it takes about 30 seconds to start the quit. Obviously not going to work, right? I want real-time analytics if I find out. So don't get me wrong, we're not at 30 seconds even now, right? But we're getting there. But with Hadoop, it wouldn't have been possible to get past the 30 seconds. So what do we do then? So we needed about, you know, 5 million messages a second for load optimization. And about 2,000 simultaneous queries on this data. Right? That was sort of the benchmark we were talking about. We said, okay, let's look at, you know, what's intended for sort of real-time analytics and other things. Columner databases was one of the things that came up, right? So we looked at vector, sort of vertical, vector-wise, and part-self. So columnar databases, of course, I don't know if you guys are familiar with these, but instead of trying to store data in rows, they tend to scale. And a simplistic explanation, store data in column structures where columns, sets of columns actually then make up a table. Right? Or a data grid. You get spaces and some appliances which were data. Both of these work very well in terms of actually performance and in terms of giving us that sort of capability to do real-time analytics and scale it out as time went. No promise has the lowest cost solution for hardware, software, and power. And they're designed specifically for real-time analytics use cases for streaming. A lot of companies are using them for this. They scale linearly. Since they are sort of linearly horizontally splittable, the availability factor would be had an availability requirement of 4 nights. 99.99, right? They meet that requirement. So the number of nodes you can put in, you'll meet the availability requirement. But again, do we have the money? Right? All of these things come at a cost. And I think that's how solutions like, you know, Hadoop and any other workforce first came apart, right? Someone's part of their way to solve a problem, solve it at low cost, solve it efficiently, solve it for all the parameters that are important. So these solutions, although they were suited for what we were trying to do, had a cost associated with it that we couldn't bear. For example, just as cost. So if you get spaces sort of linearly, we would need 500 nodes just to set up the data. 500 machines, right? So you can imagine, you know, a whole data center today runs about 2,000 machines. So I would multiply the data center footprint by, you know, 25% to just get sort of visibility into the data. So what we ended up with was sort of a custom multi-dimensional aggregation mechanism which was not invented by us at PayPal, but eBay had this, you know, working for about two years. And as part of the analysis, and this is proprietary, so I can't tell you the details otherwise I'll have to knock you in the room forever, right? So this is what we end up using, end up picking to actually go implement on our side. There's an ad-hoc query system for read-only nested data. This is sort of the biggest problematic query when you actually talk about real-time analytics. We get meaningful compression out of it by actually aggregating data and not trying to store all of the raw data. And then the time series data, right? So split off different types of data into different data sources. So all of the time series data now goes into a run-up in data history. So the first thing I do is it's really synonymous with big data from all the sort of things that we played with. But it does work for simple batch processes, right? That don't have hard real-time requirements. It's pretty efficient, you know. And now things like iterative processing are being added to Hadoop in the recent leases, and when they go to sort of one dot or one things like that. But many tasks, things like anything to do with absent requirements, right? Anything that requires asset properties. Anything that requires continuous incremental updates. Hadoop is not a good fit there. And those are the use cases we have. So someone earlier asked a question about the actual transaction process, right? We use OLTP systems for transaction processes. We use Oracle at the scale that we run at. At the size we run at. We still use Oracle because an RDVMS is even now best suited for transactional type of use cases. So the lessons learned out of this, there is no one solution. There's certainly a way to jump to conclusions and try to fit one solution into the problem space you have, but there isn't one solution there. You have to pick what fits you best. And that same thing goes for big data, right? That's all of the general principle I try to follow. I think the same thing fits one of the big data. So I guess, do we have any questions? I know we have 10 minutes to get to that. I hope I didn't use up anyone else's time. So any questions? My name is Chandu Nayak. There's a lot of things that you've talked about seem to be largely structured data. But a company like Vepal and Swarman are giving a lot of unstructured data as well. How do you handle that? Number one and number two, how do you integrate it at all with structured data? So you're right, a lot of the data that goes into things like logs and other things is unstructured. But again, if our primary use cases for analytics are based on at least semi-structured data. So outside of log files, we don't have, and even for why the centralized application logging is in place is to put some structure on the data so that it's not completely unstructured. So the way we have it is it's a constrained set of fields. And yes, there's variance in the values that goes in these fields. But it's not completely unstructured. We don't have major use cases where we view in sort of documents. Even our transactions, so once the transaction actually is paid by, again, it's constrained by the interfaces we provide. It's sort of structured by what data we actually expect in a particular interface. And same goes for the interchange of the partners. So most of our data is at least semi-structured. And we don't try to sort of marry it with structured data, like I said. We try to split it not on the data here, but more so at the application. And I have one question. When you said about real-time analytics here, what precisely, I mean if you could tell that, what precisely is that real-time analytics, especially with PayPal? So for example, one of the prime use cases is a decline. So when someone submits a transaction, that transaction can get declined for a variety of reasons. Your credit card was expired. So we detected that this is a fraudulent transaction. It's coming from an IP address in a geographic region which is not the same as the geographic region you normally do a transaction from. And many parts of the system can actually decline the transaction. And external partners can decline the transaction. So decline is a business metric we track very closely. So at real-time, we want to know how many declines are happening from which part of the system or from an external network. Another example, we have cost associated with processing. In many cases, in some countries that we go to, our transaction volume is a fraction of a percentage compared to the volume in the US. So if we have an outage for customers in that country, it will not be detected in the overall health of the system. But for that country, for that processor, it's important for us to know if the transactions are going through or not. So to figure out, in a country like France with a particular processor who charges us less than another one, are all the transactions going through as they normally do? That's not a case of real transaction. So it's more O&P than a prediction kind of... No, it's not transactional, right? So it is analytical. Yes, probably O&P. It's saying that you just pull the data out of the warehouse every single minute or every single second. You can't pull that much data out of the warehouse and you can't feed it to the warehouse. Yes, but yes. It's processed into streaming there, right? That becomes important. So, are you doing it? Besides... Like, since the other day we were doing this data, what are the possibilities that we're going to have? The client is a dude. How many are genuine and how many are fake? Can I lock you in a room? Sir, we're not telling you how we're doing it. No, no, no. So there's a percentage of declines that we do that are false positive and that are sort of positive, right? That is a business metric that is not publicly sort of available. Because that's... We don't even expose our risk rules or anything related to those rules even for task purposes to our external partners. So if you're trying to test against our system, you don't ever hit a decline, right? So there is no way to actually yield. Percentages are there. We measure those entirely and they're under 5%. I will not give you the exact percentage, but there's under 5% that are sort of false positive and that's where sort of the metric is tracked for our risk systems to actually track continuously lower. Sir, why are you taking this... The recent experience is there for me. How many? Two of my genuine transactions declined. That's what we can tell you, right? So can you start coming out and start running? So one solution I have is that, you know, each speaker can actually take a couple of questions but then we're going to open it up for questions and ask a lot of questions. I mean, it may not be about your transactions with PayPal, preferably big day and hour later. So we'll be glad to help you and sort of tell you why your transactions are actually declined. If you can give me a PayPal account, I need a password or anything. I need something to start the investment.