 I'm a staff engineer at VMware. And in this presentation, I'll talk about our experience of implementing a configuration time series store using Foundation DV and the things we learned in this process. But first, I want to thank the Foundation DV team and all the sponsors for organizing this event and having me here to give this talk. So to give a little bit of context, Network Insider is VMware's security and network management platform to help large enterprises manage the data centers. It works by periodically collecting configuration state from entities inside the data center and then transform it to create an entity-relation graph of the data center. One can go back in time and look up the snapshot of this data center entity graph at any arbitrary point in time and then have more use cases that work against this graph, like search or analytics. There are many different use cases that we support on this graph. And they access this data in very different ways. One thing to note is that the system is hosted both on cloud as a hosted service, as well as given out as installs for customers to deploy it in their own data centers. It's important to note that when this is installed within the customer's data center, we don't have any access to the system. And so it has to deal with any failures that arise there and be able to come up out of it. So it was a previously modeled on Postgres as a set of tables and multiple indexes. We did not require foreign keys and just somehow flipped. OK, we're here. And we have thousands of such deployments. Each of those deployment is a cluster of size ranging from 1 node to 30 node so far. And we are growing the cluster size as we scale out more. The layer that stores this configuration graph is called object store that I'll be talking about today. It maintains time series of state changes for every object in the data center. And there could be large number of such objects, like virtual machines, access rules, ports, routes, and network flows. Different use cases access this layer in different ways. Some doing bulk scans using some specific attributes, like object type, version start, and end time. And there are others that want fast point access to this data. So it was modeled on Postgres, as I was saying earlier, using multiple tables and indexes on it. We did not require foreign key constraints and joins between the tables, but we extensively use indexes, grouping, and ordering features. But we face a lot of issues in large cluster deployments with this model. Firstly, there is no inbuilt HA or redundancy. So there is always a risk of us losing the data if the node goes down. Due to data base being hosted on a single node, it starts to become bottleneck in terms of both IO and disk size, as well as the compute in terms of number of transactions that we can run on it. And this load increases linearly with larger cluster sizes, where your database still is hosted on just one node. I do not know why it's going ahead. And periodic maintenance jobs like data deletion and vacuum worsen the problem by eating up into already scarce resources. So we explored different database replication options to resolve the problem of HA and redundancy. There are many tools that help with replicating the wall logs or with replicating the statements in conjunction with some orchestrator tools that detect failures and then promote a slave or a standby to master. We found these approaches to be very brittle and requiring a lot of manual supervision. We ran into issues like wall filling up the disk because wall slots are not being cleared fast enough. Diverse transaction timelines because of mistiming of promotion of some standby to master and need for manual rollbacks and thereby incurring data loss. So we gave up on it after spending quite a bit of time and still not being able to make it robust enough for us. We then evaluated some of the more popular natively distributed stores. And over time, we found that they were not sufficiently mature yet. We had a lot of edge cases around performance and stability. We worked with the community and the developers of these systems to discuss the issues we found. And we realized that these are non-trivial implementation problems that had not been encountered yet. And with time, I'm sure they will get ironed out. But it gave us a pause that if you are still finding such issues or more fundamental issues with database layer or store layer, we cannot risk deploying these in unmanaged environments. There's also a lot of confusion about what kind of guarantees these systems provide and how proven are those. The guarantees themselves come with a lot of caveats and knobs, and it makes it difficult to reason about. You don't know what you're really getting once you are too far into using the system. Many of these stores have a lot of moving components and external dependencies that make them difficult to harden for unpredictable environments. Finally, almost all of these systems depend on accuracy of system clocks. And they have serious consequences on correctness guarantees if the clock accuracy is not met. And we cannot guarantee this accuracy because the environments are totally out of control. And yeah, I mean, NTP can go out of sync by a few hundreds or milliseconds or even more, and we can't control it. So once the Foundation DB was open source, sometime last year, we had a discussion with devs on Wavefront team who are in the same organization as specifically Clement, to get the feedback of how the store would play out if we actually went ahead and tried implementing a layer on it. We took our time after it to understand the architecture and the transaction guarantees to evaluate if a non-trivial layer could build on top of a key value schema. We determined that to build our object store layer on a key value store, we required three important rates. High throughput, audit scans, low latency, point rates, and writes. Like by low latency, I mean met single digit milliseconds. And this is more of a short term constraint. Our current layer user are used to getting this kind of latency from Postgres system. And I'm sure with time, we can bring in more concurrency and batching to offset some of the latency, but at least to begin with, we could not go back too much on the latency. Otherwise, it becomes a non-starter. And finally, we needed multi-row transactions with a strong serializable isolation to ensure the validity of the data at all times. We cannot afford to corrupt the data by doing some partial updates to it and not leaving it in consistent state. So FDV seemed to satisfy all these. The transaction throughput rate that it provides was already much higher than what we presently expect. So yeah, I mean, it looked good. Finally, we did some specific workload performance test and failure test to ensure that the desired resiliency is met. It specifically concerns about auto role migration, no data loss, and no manual intervention for recovery. These are important things for us, because as I mentioned, these systems run in dark environments, and it is very important that they are healthy all the time, even in face of failures. So the initial modeling work started with mapping our data access patterns and tables onto the key value schema using the very flexible set of constructs that FDV provides. Some of these are key selectors that can be used to efficiently locate keys within certain offset, within certain distance of an offset. Version stamps are extremely useful. They can be used as pointers to connect rows, as proxy for time to create ordered logs to generate unique IDs, and so on. Conflict ranges and snapshot isolation provide much finer control over conflict management. And there are a lot more features that probably I'll go into more detail in the next talk later, but just gives a flavor here. We had to address some challenges around change log induced hotspots. So change log is a feature in our store where every change needs to generate an event or a log that someone else can listen to. But this creates hotspots into the system. And the second problem is FDV, due to the transaction limits, cannot handle large records. So again, these are some things that we addressed, and I will go into more details in the next talk that I have today. We had to create a CLI tools for layer users to access and manipulate the records. This was not a problem with Postgres because Table Schema had enough context. And using the PCQL CLI, people were able to use the existing CLI to access the data. But we had to create it. So this was something we realized later on that it took some time for us. So early on, we realized that there are a lot of important functions that are being repeated throughout the layer and should be centralized. First of them was global key value space management. Unlike RDBMS, which gives you concept of tables, where each new layer can start with new tables and work its way through, we had to carefully plan the global key value space in order to support things like multi-tenancy or any future extensions of introducing more layers on a common FDV cluster. We also route all the transactions through some common shims that handle some important aspects, the most important one being health check, back off, and throttling. So we do have notions of a lot of background activity to take care of the store. And we constantly monitor the health of the FDV cluster and then back off these periodic activity so as to not saturate the cluster too much. We have realized that FDV runs pretty OK if you're not saturating it. If you saturate it, then things start to back off one after the other. And it gets difficult to come out of. Secondly, we do our own retries in case of transaction errors. For instance, we use retrieval transactions for supporting long scans where the transactions repeatedly fail and restart from the last point in five second batches until they make all the way to the end of the key range that we wanted to scan. So here are the exceptions or errors by design. We don't want to put additional sleep in between transaction retries. So we do not put any in case of two old transaction error that we recover. But if we get something more serious like a future version error, which indicates that storage servers are not able to keep up with the load, then we put extra delay between the retries to give FDV time to recover. There's another interesting aspect of throttling or limiting the number of outstanding transactions that we have on the client side. We notice that if we don't control it, then we could accidentally put millions of outstanding transactions within the FDV client library that gets run along with the layer code base. And that puts a lot of memory pressure on it. It allocates a lot of memory to hold those outstanding requests and then to get the data back from FDV server before delivering it to the binding. So this causes the memory to go up, memory utilization in the FDVC client library, and it never returns it back to us. So we have to be very careful about how many outstanding transactions can we keep and still have manageable amount of memory used by it. So I think there is some pending request on GitHub that the client library should periodically release the memory. But yeah, that is what it is right now. So this layer also provides some other functions like version caching for reducing transaction latency. Again, I'll talk more of it in the next one. We took some inspiration from FoundationDB testing, and this layer injects random failures in the transactions so that the layer code can be sure that it is handling the errors in correct way. I think, again, this is a feature already in 6.2 where the client library itself will introduce these errors. But we took it on early, and we have it in a code. Finally, we am a detailed metrics to capture transaction failure rates, latency, counts, and sizes. And I'll go over it in a minute. That is extremely useful to us. So this is one of the screenshots from Wavefront dashboards that we have. So here, you can see that we are capturing a lot of statistics about individual transactions. How many of them are we doing? What is the total time they have taken? What is the size of those? And if we are seeing any errors, then what kind of errors are being noticed on each transaction? And this gives us a lot of information about what to prioritize. If you are seeing a lot of conflicts on particular kinds of transactions, we can go look into it and optimize them. This is, again, a sample dashboard that we have for FDB cluster monitoring itself. So Wavefront gives us pre-built plugins, using which you can go and integrate your FDB cluster. And it will put a lot of interesting things on screen. I mean, the ones that are highlighted here are the T-Log and storage server queue sizes, which are probably a few of the most important things that we go and monitor. We also wrote some command line tools for parsing FDB JSON output that we get. And we put it in a watch utility. So it refreshes every second or so and gives you real-time information about what's going on. And together, using this and the Wavefront dashboard, we get both kinds of information. Long-term patterns to correlate problems, and also short-time, like, live state of the cluster as we are doing things into it. So as a quick example, I'll take one of the problems that we are still working through. We noticed that our storage server periodically, so the storage server is on the left side, and on the right side is one of our operations from the layer which I'll be talking about. So we noticed that our storage server starts to stall periodically, where it is not able to retire the data that is coming into the queue, and thereby, the system then throttles and backs off. So we spent some time looking through these dashboards and noticed that one of our periodic operations of deleting the older data just directly coincides with the storage server back queue growths. And I mean, we tried to then think through it, like, what could be causing it? There were a lot of possibilities. It could be that the deletion is causing extra mutation overhead and just tipping over the amount of throughput that the given size FDB cluster can handle. It could be that the IO is stalling somewhere, causing storage server to stall. Or it could be some problem with the foundation TV delete path itself, or lastly, some problem in the layer. And we then rejected one hypothesis after the other, and we were left with that there has to be something wrong with the layer because other things would be noticed by a lot more people a lot quicker. And we realized that the only thing we are doing in the delete path differently is that we are packing a lot of individual row mutations in a single transaction. They are not large range clears, but they are single keys, like about 500 to 1,000 within a single transaction. And we experimented with and make that smaller to about 10, and you don't know what's going on. And you can see like the green line is where the change was deployed, and suddenly the worst queue size stabilized from quite a bit. And we still don't know what within FDB is causing this. I'm still in discussions with the FDB team on this. I'll get back to it once I have a little bit of more time. But yeah, I mean, it shows that there are certain non-obvious and ambiguous troubleshooting to be done when you are dealing with FDB clusters, and it takes time to work through those, but having these integrations at least gives you a good starting point and then make informed troubleshooting branches in the debugging. Finally, just to recap what worked for us. Efficient multi-row transactions with strong consistency, the very, very powerful guarantee. This is sufficient to solve a large class of problems that were traditionally solved using RDBMS. FDB gives us a very rich set of constructs and low-level data placement controls, using which you can develop a lot of different kind of layers. It is easy to operate, doesn't have any external dependency, very, very resilient, no dependency on time, which is quite a bit of a bonus. And a very strong community, and they are happy to help at any time. What we found challenging was we spent a lot of time on modeling and fine-tuning the storage layer. So for more conventional tables, I would still recommend go with a record layer or something that is maintained by a strong team so that you don't have to repeat the whole thing again. FDB gives us detailed status of the cluster in form of the JSON, but it took a lot of time to figure out what to monitor. I mean, it was through community interactions and maybe we can start contributing back what we learned, but I think this is going to be a big hurdle for anyone onboarding onto FDB. Transaction limits are kind of mixed back. Like, they bound your worst case behavior, and so things will not at any time go very bad, but then layer development on top of these limits becomes slightly more tricky. We found that FDB has slightly higher transaction latencies, about 25 to 30% higher than our Postgres store that was running, under whatever workload conditions that we had, and it is more sensitive to slow IO compared to, again, Postgres that we were using. All filtering and transformations that are done using FDB are done at the layer level, and if some of them could be pushed down into the key value store of FDB, that would be a big help, especially some sort of user-defined functions, like HBASCO processor kind of things, that would open up a lot more things that we could do. And finally, we realized as we were hardening the system more and more that it is difficult to control and limit the memory used by the client side of the FDB that runs alongside your layer, and I mentioned how we do it, but you have to be careful. Otherwise, if you're running a binding like Java, you would really have no control over how much of memory it has taken, and then if you have given XMX kind of limit, then the process won't even go down, and finally, some Linux OOM killer will come and kill your process. So, yeah, we need to be careful with it. And that's all I have. Thank you. Thank you.