 Hello. Welcome. Thank you for coming to my talk. I'm going to talk about high-performance transactional queues implemented on top of HBase. My name is Andreas Neumann. I work for Continuity. What is Continuity? We're building an application platform called The Reactor. It's built on top of Hadoop and HBase and makes it easy to develop, to deploy, to manage your applications. And it's really squarely on top of Hadoop and HBase. So we're on top of the open source ecosystem. My agenda is the following. I'm going to motivate. Why are transactions valuable? I'm going to explain how we do transactions and then I will explain how our queues work in that transactional system. If I have time, I might talk about transactions and batch processing and end with a little outlook. So The Reactor is an application platform built over Hadoop and HBase and ZooKeeper and MapReduce and many other of the Apache open source stack components. Here I'm going to focus on HBase. We use HBase as a storage engine. What we do is we allow you to collect data in real-time and store it. Then you can process it in real-time or in batch. You can store the results of that processing and you can query that data through a unified API that we provide. The way that you process data in real-time in this platform is called a flow. A flow is similar to a lot of you probably have heard of Storm or S4 or SAMHSA. There's a lot of flavors of that real-time stream processing now. It basically decomposes the processing into small units where every unit in the flow is called a flowlet. In Storm they're called bolts, I believe. So each flowlet receives events on an input queue. It does something with these events. Maybe it does some aggregation. Maybe it stores it away somewhere and then it emits events to its outputs. And then you can connect multiple of these flowlets into a graph or a topology. That's what we call a flow. And then you can run it and you can feed it with real-time events and it'll do the real-time processing for you. Most of the real-time stream processing engines out there do not really talk about what happens to the data. They just give you guarantees like we will never drop events. We will reprocess if a failure happens or we can replay the events from a certain time. But what happens to the data that you've already written? And then the event gets reprocessed. Now you write again. Do you have to worry that all your operations are idempotent? How do you deal with these scenarios? In this platform we perform the operations of a flowlet in a transaction and that means when a failure happens none of its effects have actually happened. When the flowlet crashes or when the power goes out your data is always in consistent state because the transaction gives you asset guarantees. So this is what processing of a flowlet looks like. You have an input queue. There's events on the queue. The flowlet reads them and now it can emit events to its output queues. And then there is downstream there's other flowlets that consume that. Excuse me. Now of course the flowlet typically will not just do processing. It will persist some information that it has extracted from the event or that it has aggregated. And so in our platform a table is called a data set. We have an abstraction called data set that allows you to store information and now what could happen is that you have two different real-time processors, two flows, and they both operate on the same data set. For example your data set could be user profiles and the same user is watching videos on Netflix and he's also making purchases on Amazon and you happen to get both of those events streams and you have two flows that consume that and they both write to that user profile. So now both of them write to the same data set. The data set is shared. But what happens if they try to update the same record in that data set at the same time? This is what we call a conflict. If we don't take care of this situation then the consistency of the data is at risk because we don't know. Maybe flowlet 2 will overwrite the effect of flowlet 1. Maybe flowlet 1 overrides what flowlet 2 did. Maybe the operation isn't even atomic. Maybe some of flowlet 1's operations go through and some of flowlet 2's operations go through and in a highly distributed system, a scalable system, you may have hundreds of agents running at the same time. So it is very likely that eventually you're going to hit this situation and you better be prepared for that. So that's why we have transactions. Whatever a flowlet does for a single event happens in a transaction. So this includes the reading of an input event from its input queue, the processing and the data operations that it performs as a consequence of that, and then the emission, the writing of output events to its output queues for the downstream flowlets. This happens in a transaction. So what does this mean? A transaction gives you asset properties. A stands for atomicity. Everything you do in this transaction becomes visible at once. Nobody will ever see a partial write from you and either everything gets committed or nothing gets committed. So if you fail halfway, you crash, you run out of memory, whatever, whatever you have done up to that point in the transaction will be undone. How that happens, the system takes care of that, right? Second one is consistency. Consistency means you will never have a partial effect, a partial state change due to conditions that happen. The third point, very important is isolation. So we know that transactions can fail and they may roll back if they fail. If other transactions can see writes of transactions that are in progress, they might make conclusions based on that data. But if that data gets rolled back, then it's actually invalid. And again, your data could become inconsistent. So we provide a strong isolation guarantee, which is that you don't see any dirty writes. And the fourth one is durability. Durability means once you have committed your transaction, it is persisted in a way that you're not going to lose it, no matter what happens. So this is very similar to the traditional notion of asset in relational databases. In relational databases, consistency, for example, refers to a table and its indexes. It can never happen that the table gets updated and its index doesn't. But in big data, it's not much different. Often we make two writes to two different tables or to two different column families in HBase. And one of them might be an index for the other one. HBase doesn't have that feature built in, but your application may have a pattern that is similar to an index. So this idea of consistency is equally important here. So how do we implement transaction? We follow a style called omit, which was introduced to the HBase community from a group in Yahoo. Yahoo research, actually, probably about two years ago. And so we adopted that approach. And what we're doing is, who here is not familiar with HBase? Okay, who is familiar with HBase? Okay, and half of you don't know whether you are or not. Okay, that's inconsistency, by the way. It's not assets. Okay, so HBase, I have no slides on this. HBase is the Apache open source implementation of Google's big table. So it's a columnar store. It's built on top of HDFS, which is the Hadoop distributed file system. It achieves durability in its persistent consistency by persisting to HDFS. And HDFS has replication and failover and all these nice properties. So HBase uses that to get the same guarantees. I don't have the time to go into a lot of detail here, but an important feature of HBase is that for every cell that you write to the table, you can have multiple versions. Each version has a timestamp. And with that timestamp, you can refer to it. And the timestamps, of course, are ordered. When you do a read, you can give it an upper bound for the timestamps you want to see. And then you will only see data that is older or equal to that timestamp. So this is the way that you could do snapshots over an HBase table. What we do for the transactions is we hijack those timestamps to implement multi-version concurrency control. So every transaction is assigned an ID. The IDs are ascending. They only get bigger and similar to timestamps. And so we get a total ordering of all the transactions. And then we use that transaction ID as the timestamp that we use to write to HBase. And then in order to achieve isolation, we read from HBase in a way that there's a filter that filters out all the transactions that have not committed yet. Transactions that have not committed yet are transactions that may have performed the write, which at this time is a dirty write. Or if we would read it, we would call that a dirty read, because we're reading something that we don't know whether it will actually persist. So for isolation, we use an HBase filter that excludes those versions that have not been committed. For avoiding, for consistency, for avoiding two different agents cancelling out each other's writes, we have to do something else, right? We have to prevent that two different flowlets can write to the same cell or the same row of a database at the same time and not notice that they're overriding each other. So in relational databases, this is typically done by locking. When you intend to write to a row of the database, you lock that row. Now nobody else can, it depends on your isolation level, now nobody else can write to it. Most likely nobody else can actually read it. And everybody else will have to wait until you commit your transaction. So that's how it's done traditionally. And we are taking a different approach. We believe that in big data throughput is everything. And locking. Locking an agent that does real time processing, making him wait for somebody else to give up the lock means he's busy waiting. He's just idling and he's sacrificing his precious compute resources for waiting for a lock. What optimistic concurrency control does is it avoids this lock. Instead of locking, we simply allow concurrent writes to happen. And because they use different versions, different timestamps, an hbase supports multiple timestamps for each value, multiple versions. These writes can actually go into the database, into the table without problem. The only problem now is we have to, yeah, so the thing with the timestamps is that if you use, say, microseconds, you may be unlucky. And you could have two parallel transactions starting in the same millisecond. And they get the same transaction ID. So that doesn't work, right? So what we have is, now we're going to detail, we have an oracle that produces transaction IDs. And in a way, they are timestamps. But it needs to make sure that it doesn't issue duplicate timestamps, right? So they go in increments of one. And there's various ways of generating unique monotonic ascending IDs. Okay, so now we say we allow two transactions to modify the same value. Now what we have to do is, when that happens, we have to detect it. And when we detect it, we have to fail one of the two transactions. That's what optimistic concurrency control does. So instead of locking up front, it allows the rights to happen. But then at commit time, every time we commit a transaction, we have to verify that it did not have a conflict with another transaction. And if it did, then we failed the transaction and rolled it back. And then the processing is retried and hopefully succeeds. So the flow is like this. This is what happens in a flowlet, the red stuff here. It starts a transaction and then it does its work. How does it start the transaction? It talks to what we call the Oracle. The Oracle is our central transaction management guru. It's an agent that sits there somewhere in the data center. And its only purpose is to maintain what are the active transactions, that was the last transaction ID I issued so I can do the next one. And are there conflicts between transactions? So every unit of work starts with talking to the Oracle and asking it for a transaction ID. Then you do your work. In a flowlet, that means, yes. So I can go into that later. The Oracle, there's various ways of implementing it, giving you different levels of reliability. The most efficient way is simply having it in memory. But then, of course, if it crashes, it's a single point of failure. So you want to have at least a second replica and the two of them synchronizing each other to get high availability. But there is an implementation that actually persists every right to HBase. And then when it crashes, you can just bring up a new one. But it's less efficient. It depends on how many transactions you need to do per second. If you're talking about several hundreds, no big deal. If you need 30,000 per second, better do some in-memory thing. OK, so now you do the work. Doing the work means I get an event from the input queue. I do my aggregations, my processing. I persist whatever I want to persist. I write to datasets. And I emit events to my outputs. And then I commit the transaction. As part of committing the transaction, I find out whether there are conflicts or not. If there are conflicts, I abort the transaction. And part of aborting the transaction means I have to roll back the rights that I have made to HBase, to my storage engine. I could actually leave the rights there. These are just versions that are stored in HBase. And as I said, we use filters to filter out the transactions that are in progress, so that nobody sees dirty rights. I could just use the same mechanism to filter out transactions that have failed. But over time, I would build up a really large backlog of failed transactions that I have to filter out at runtime. And I'm going to see performance problems because of that. So what I will do is I will always try to undo the right. So if I did an increment, I can undo that by doing a decrement. If I did a right of a new version, I can undo it by deleting that particular version from HBase. So we always attempt to do that. But if that fails, sometimes we have failure because HBase is down. I simply can't do it. Then we need to somehow keep track of the transactions that were aborted, but weren't able to undo their rights, so they become invalid and excluded from future transactions. So this is the general flow. As I said, optimistic concurrency control avoids the cost of locking up front. And the good thing that we get is we don't have idle processes. We're always busy. Especially, we never get deadlocks. Deadlocks are the worst thing that can happen to you because it means at least two of your agents are sitting there for seconds, if not minutes, until the deadlock is detected and both of them get killed or canceled. And you really waste precious time there in real-time processing. Another thing you see in relational database is often there's lock escalations. If your transaction locks many rows of the same table, eventually the lock gets escalated to the table level because it gets too expensive for the database to maintain all those individual locks. And so when you have too many locks on the table, it will simply lock the entire table and not let anybody else access that. And that's terrible too. It just means other agents are blocked from doing their work. So this doesn't happen with optimistic concurrency control. But of course, we have to pay a price for that. And the price is that we pay the price of conflict detection every time we commit. We need to make that check. And if we have to roll back, the cost is higher too because these are now actual write operations, probably IOs in the storage engine, IO operations that we have to perform to undo that transaction. So why do we think it's good? Well, this is good if we know that our application is designed to have few conflicts. If we know that conflicts are rare, then we can pay that after the fact cost for a conflict once in a while, as opposed to paying the smaller cost of locking every time. Why do we think that conflicts are rare? Well, it depends how good you write your code. But typically, when you write a big data solution, you're going to partition the work. You're going to have a flowlet. The flowlet, a single flowlet cannot do, a single instance of a flowlet cannot do 50,000 events per second. Just not doable. So what do you do? You run multiple instances of that flowlet. Now you partition the work. If you do a random partitioning, and the flowlet does some writes based on the keys in the events that it receives, now each instance of the flowlet will receive random keys. So you have a chance of conflict. But if you partition by that key that will actually be the key for the write, then you know that writes for the same key will always go to the same instance of the flowlet, and you will never have conflicts. So there are techniques and best practices and design patterns that help you avoid conflicts. And these are patterns that you want to follow, regardless of whether you have transactions or not, for the sake of locality and caching, and just better throughput. So if you follow those patterns, then you actually will rarely have conflicts. And so that's what we are taking advantage of here. OK, so in a little more detail, what does the transaction oracle do? The transaction oracle, and this is a slightly different view of what we saw before. Everything that's red happens in the flowlet on the processing side. Everything that's blue happens in the oracle. So the oracle maintains all the active transactions. So when I start working, the first thing I do is I get a transaction from the oracle. Oracle gives me back a transaction ID. And it also gives me back a filter. We call that a read pointer that tells me what other transactions do I have to exclude from my read. Now I do my work. And when I'm done, I try to commit the transaction. Committing the transaction actually happens through the oracle. We send to the oracle, we send the keys of the rows that we have modified in this transaction. The oracle keeps track of, we call that the row set, the row set of a transaction. So what the oracle has is it has an in-memory map of all the transactions that may possibly have overlapped with any ongoing transaction. This is only all the transactions of a short window of time, as long as transactions are short. So it has a map of all those rowsets in memory. And it can very, very quickly determine whether the intersection of two rowsets is non-empty. And if so, the oracle will come back and say, there's conflicts. If the response is that there's conflicts, then the flowlet will abort the transaction by rolling back its rights, by doing the reverse operation of what it initially did. And when it's done with that, it goes back to the oracle and says, I'm done. Now the oracle can remove that transaction from its list of active transactions. In case the transaction is successful, if there's no conflict, there's no more work needed on the client side. It's only on the oracle side. The oracle will respond, there's no conflicts. The client goes back to the oracle immediately and says, I'm done. And the only thing the oracle does is remove that transaction ID from the active transaction list. And it'll remember the row set of this transaction for a short time. And it'll remember that as long as any transaction that started that might be overlapping, because when that next transaction commits, we need to do a conflict check for that transaction. So the oracle actually becomes key to this. The oracle needs to perform well, needs to scale well, needs to be reliable. The moment the oracle is not there, nothing's going to happen in this system. So kind of tricky. How do we make this oracle fast, super fast, and persistent? So we have been experimenting with various ways of doing this. The best way to do this appears to be to write everything that happens in the oracle to a writer-head log. We don't persist to HBase. It's too expensive. HBase under the covers also uses a writer-head log. But every write to HBase will actually result in probably six writes to disk eventually. Because HBase writes everything to its own tables, and the tables get persisted to a file store, and they get replicated three times by HDFS. And HBase also writes to a writer-head log, and that also gets replicated three times. There are faster implementations of writer-head logs, for example. There's an open source one called bookkeeper. So in case my entire cluster has a power outage or whatever, that writer-head log is persisted. Yes. No. No, no, it's not an oracle database. It's an oracle in the sense that it can make predictions. Like the oracle of Delphi are. Not sure I understand the question. Yes, yes, yes, that's right. Better scale. Air scale? Yeah. Cheaper. We don't have benchmarks, because we are an oracle company. So we don't actually run relational databases. So the whole idea of, it's actually a very interesting question, because when Hadoop started, people saw it as an alternative to the traditional relational databases and traditional data warehouses. Because the traditional databases hit a point where they couldn't scale further. Or the cost factor of scaling out more became so high that it wasn't affordable to scale. Can you scale an oracle database to five petabytes? Very hard, very hard. I don't even know if it can do that, but it's extremely expensive. And you can use other databases that were designed from scratch for more scalability, like Teradata. But it's going to be very, very expensive. So Hadoop came up as an open source, free, cheap software to do analytics over very large amounts of data on commodity hardware with a much lower cost factor. And how could it be cheaper? It was cheaper because it gave up the asset constraints of relational databases. Yeah. That is true, yeah. I see. No, yeah. I guess we can. So just to finish that previous thought, is what we see is that even in big data, you need actual asset constraints. And so H-Base has evolved as one of the major databases used in the NoSQL world. So we implemented it on top of H-Base. We do also have an implementation over level DB. We have one over SQL database. And I'm sure we could implement the same thing on top of Oracle NoSQL. I mean, in the end, it's just code, right? So Oracle NoSQL is not an option that we had when we started this work. I mean, this may exist now. Still Oracle, I don't think Oracle is open source. I don't know that particular. Okay, that's great. I mean, I'm sure you can do similar things on it. I would have to do a deep dive on Oracle NoSQL. What are the asset properties that you get out of it? I don't know. It's an alternative approach. I mean, all I can say is that Oracle, to me, represents the old SQL, the old relational database world. And then this whole NoSQL world has evolved in parallel to the Oracle world, right? And now Oracle is bridging the gap and they're coming over and say, we have a solution too. So yeah, I mean, I'm sure you can use Oracle if you want to. So, all right, so how can we achieve performance and reliability in the Oracle? So we use a write ad log to persist, but we don't actually operate on that. We only use that for recovery. And we have a second replica of the Oracle that in real time replicates everything that the primary Oracle does. And if the primary one crashes, it can take over right away. Well, as I explained, the transactions give you asset properties, but they give you those properties in a different way. And in the traditional asset world, the I means isolation, right? And relational databases have various levels of isolation. Here we only implement one, which is pretty much complete isolation now. What is that called? We call it snapshot isolation. Snapshot, it's actually a flavor of isolation that's not really implemented by most databases. I don't know if you wanna run your financial operations in HBase. I think you probably don't, right? No, so the motivation for this work is to do real time stream processing, right? And you want to have reliable processing and you want the stream processing to be able to persist its state fast. For financial stuff, I don't know. I think eventually this technology will be mature enough for that, but today I don't think that is the domain. Okay, okay. All right, so the motivation for this was, but thank you for the question because it gives me the introduction to cues. So the motivation for this was that we want our flowlets to operate in transactions. And what flowlets do is they read data from cues and they write data to cues. They don't just interact with tables. They also interact with cues. So these cues are a little more complex than a typical cue. So a typical cue is a FIFO, right? You write something to the cue. You have producers that push to the cue and then you have consumers and things come out in the same order that they were enqueued. But everything comes out only once, right? So that's the traditional kind of cue. In this case, the cue is slightly more complex. So every consumer of the cue, because we're in a highly scalable system, can itself be a distributed agent. It may have 16 instances, may have 50 instances, and they will all concurrently try to dequeue, try to read from that cue. Also, in a flow, the flow is not just a pipeline. The flow can have a branch. So a single flowlet can have an output that is consumed by two downstream flowlets, which means when an event has been consumed from the cue, we can't just remove it from the cue. We have to wait until all of those flowlets have read all the events. Then we can evict them from the cue. So here we can see this. The green, the light green boxes are flowlets, right? So the first one has two instances. Between the instances of one flowlet, the cue is partitioned, right? Every event on the cue goes only to exactly one of the instances of the flowlet. But there may be a second flowlet that also reads this cue. And it also needs to see all the entries. I'm actually running out of time. So a quick, very quick overview of what the cues do. So a cue and age base is simply a table. And every entry translates into a row. There's two synchronization points for the cue. When I write to the cue, I need to make sure that the elements are ordered. So there's a right pointer, it's basically a counter. Every time I write, I increment that counter and then I insert a row. In case the transaction fails, I'm gonna mark that as invalid. I'm not gonna remove it from the table anymore. The reason is that the right pointer is a counter and we want to have entry IDs in the cue that are consecutive. We don't like to have gaps in there. Because if there are gaps, then we don't know whether the gap is there because a transaction failed and it removed its entry or because a transaction is still ongoing. It has written, but because of read isolation, we don't see it yet. So when we detect a gap, we don't know whether we can skip it or not. So what we do is when we roll back a transaction, we don't remove that entry that we've written because that would cause a gap. We simply put a tombstone on it, mark it as invalid. Everybody who tries to read it will then skip it. So when we see a gap, we know that this must be an entry that has been added to the cue, but it's not committed yet. So we have to wait for that transaction to commit. We can't proceed with the processing because otherwise we would process events out of order. Consumers also have a counter, which is the read pointer. So that's the head of the cue where you read from. And when I dequeue, I do the same thing. I increment that pointer. And because I know all the entries have consecutive IDs, I know I can read the next one. The dequeue needs to be able to roll back too. If the transaction that processes that event fails, I have to undo that dequeue. But the read pointer is actually a counter that is optimized for only incrementing. Do nothing else but increments. And all the code is written in a way that it assumes that this will only move forward. So what we do is every consumer has a state, and in that state it remembers the entries that it has claimed, and that's persisted. So if a transaction, when I do a dequeue, I just put it into that claimed list and then give it to the client. If the client fails, I simply leave it in the claimed entry list. And next time he dequeues, he will see the same entry again because we read it from that claimed entry list. But if the transaction commits, then I have an action to do on commit of the transaction, which is removing that entry from the claimed list so that next time he moves on and gets the next element. And I can actually use the same mechanism to do prefetching of elements. I could fetch 100 at a time, put them in my claimed list, and then I don't need to go back to edge base and read next time. It's like a cache for me. So this is the idea. So what we see is what happens for every entry is in order to enqueue, we have to do an increment and a write. And in order to dequeue, we have to do an increment and then a write, which is the state. So it's four writes for every entry. Something is out of order here. So with that, my throughput is kind of limited because edge base can only do that many writes per second. When I do partitioning, I can get concurrency. And concurrently, edge base can do many more writes. What I can really do to improve my throughput is to do prefetching. When I prefetch 100 or even 1,000 at a time, then that's one I owe to get 1,000. And after that, my state is in memory. If I crash, then I fall back to the last state that was persisted. And everything gets undone. So operating in small batches helps me a lot. And I can still preserve the asset guarantees of my stream processing system. So what we see in our performance test is that we see roughly 10,000 enqueues per second per producer. With multiple producers, we see more. There is a saturation at some point. We haven't seen, with a single queue, we haven't seen more than 40 or 50,000 per second. But that's a pretty good number. There's few real-time processing systems that get more than 50,000 events per second. For DQ, we're a little bit slower. We get about half of that. And that is because the DQ consists of two steps. The first step is claiming the entry. Then we do some processing, and then we come back and we say, we can actually mark it as processed. So it's a two-step, so it kind of translates into two RPC calls. We have a, I don't want to go into this, we have some wish list for HBase improvements. If we would get these things into HBase, our queues would become much faster. And I will have to skip the batch. I can say one quick thing about batch. So when you run batch jobs, batch jobs run for a very long time. MapReduce, in a way, gives you assets. Because MapReduce, even though it runs for a long time, it will commit its output to the file system only if the whole job succeeds. And if part of the job fails, then that part gets retried. And whatever the output of the failed task was, it gets discarded and gets replaced with whatever the retry produced. So Hadoop can do this. And we know that the stream processing can do this. Where we're struggling a little is when you have a data set that is manipulated both by a real-time stream processor and by a batch job. If you would impose a transaction on the batch job, now you would have a very, very, very long running transaction in the batch job. The MapReduce jobs often run for 15 minutes, half an hour, sometimes hours, which means it is so very, very likely that it's going to have right conflicts. So the batch job will always be rolled back. And that's kind of tricky. So what we do is we take a pragmatic approach and we say, if you actually do this, A, we don't recommend doing this. We recommend to use two different tables and then just merge on the fly when you read so that you can segregate the rights. You can isolate them from each other. But if you have to do it, then the idea is that the batch job will run over historic data. So it will not have the latest information. If you have a real-time job that writes the same data, it probably has newer information. So instead of rolling back the batch job in case of conflict, we simply let the short transaction, the one that came from the real-time system, we let that one prevail. Regardless of who commits first or in which order things happen, we say the short transaction prevails. So this is just a pragmatic approach and strictly speaking it violates acid. But sometimes you have to take this kind of short cuts. I don't know if we have time for questions. We started five minutes late. Well, thank you for your attention.