 The Carnegie Mellon vaccination database talks are made possible by Autotune. Learn how to automatically optimize your MySeq call and post-grace configurations at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at steveamoyfoundation.org. Hi guys, thanks for coming. It's another vaccination day with seminar talk. Today, we're excited to have Brian Plutz. He's the CEO and co-founder of Flurry, a blockchain graph database that he's going to talk about today, the turn of architecture. We appreciate Brian for being here. His wife just had a baby a few weeks ago, so he's actually outside giving this talk to make sure he has a quiet room, a quiet area to give the talk. We appreciate him spending time with us instead of helping his new one. But he's already thought we'd do this. That's why he's doing it. So as always, if you have any questions for Brian, as he's giving the talk, please unmute yourself, say who you are, and ask your question and feel free to do this anytime. That way, Brian doesn't feel like he's just talking to Zoom by himself for an hour. And with that, Brian, the floor is yours. Thank you so much for being here. Yeah, well, thanks Andy. I really appreciate being invited to come. So this is exciting. And yeah, I bundled up a little bit. It's a little chilly here in North Carolina, but it's the only quiet place in the house right now. So yeah, so thanks. And by all means interrupt them. I've got my, I just have my laptop out here. So it's full screen on the presentation. So I can't really get into the chat messages. I'll of course get into those or Andy can interrupt me if something. Yeah, but I, yep, I can get into those later. But yeah, just, just interrupt me speak up. I cannot talk and talk and talk, especially about databases. So what we're here today, there's just so much that we try to accomplish with flurry. There's a lot of different things we could talk about. So we decided that in order to fit this into one presentation, we would focus on one specific problem, which is how do we scale, and some of the unique things we do to scale our query capability to the edge. So flurry has a pretty unique architecture that enables this to happen, but it fundamentally, all the components that make up what you would think of as a typical database need to be structured in a slightly different way to be able to make this happen at the end. So I'll start out. I'll first talk at a high level. What is flurry what do we do. What makes us unique. I'll be very brief with that of course we have a website or feel free to pipe up and ask questions. We'll talk very briefly about the architecture because the architecture kind of a high level conceptual architecture, then informs that kind of edge scaling capability. So we have to talk a little bit about indexes, because indexes and how we do indexing ends up being very important to the scaling component, which will get to hopefully will be, you know, at least half of the talk, but that's that's basically the destination that we're headed to. So what is flurry. And I'll start out with a high level philosophy of the problem that we're trying to solve with flurry. So we are incredible believers in this concept of data centricity. And we think that is almost the opposite of how we've been doing things for, you know, for the years or so and computing which we consider to be application most, most companies, most businesses, whatever the case may be, are asked to develop an app and developers go they develop an app. They develop some front end UI usually web based nowadays. They develop a bunch code in an app server. And then in the course of that they say okay well we have state right we have data we need to store somewhere. And then they try to pick a database that meets whatever needs that they're going to and of course there's a lot of options out there. And it's the same developers really that are developing the app that end up developing the database to design the database. And they really design it with that in mind is a place to store state for the app. When we talk about data centricity we really talk about thinking about data first not the app first and making sure that data is strategically described and highly reusable across potentially many applications. And we think this is going to be very important for organizations to have this kind of fundamental shift and how they think about data over the coming years, especially as we're getting more enthralled than this data driven economy. So for them to survive they're going to have to be better at understanding and leveraging the data they have most organizations have a lot of data it's not the quantity of data that they have that's the problem it's really the quality. And how they're able to leverage it outside of that application centric sort of viewpoint. So when we move to this model, a few things have to change when we think about databases. So the interaction model kind of this first first kind of line in the grid here. In an app centric standpoint, the interaction model is one to one. There's only one thing writing and there's only one thing reading from the database now that doesn't mean that you couldn't have thrown a data lake or a data warehouse in the back end scroll, you know, scrape data out of, but it was really designed to only have one thing talk to it, which is the app server that's sitting in front of it. In a centric approach, it's a many to many approach, there's going to be many writers that could come from multiple apps and excitingly, they could even come from many different organizations and I think that's where, you know, kind of the future ends up moving. And of course, there's many readers as well. So the database has to exist in a model where that's just given there's going to be lots of writers and lots of readers to the data not a one to one model, which is what most databases today do and focus on. And so that then that hits the next line which is how do we secure the data. Well, there's only one thing reading and one thing writing to it we might as well write the security in with our app server so it's sitting in custom code and that's fine because that's the only thing writing and reading, but in a data centric model, the security needs to be co resident with the data itself, and we often refer to this that data has to defend itself. It has to understand how and who is able to write and read, and that can't live in an app server because there is no app server. It needs to be co resident with the data. And then capacity and this is really I think where we're going to kind of drive into quite a bit today is that while there's still a lot of challenges in scaling your database and an app centric model at the same time you have a lot of control over it. You know if you have queries that are taking a long time, you can rewrite the queries you can optimize for that you can create views you can do a lot of different things to try and make sure that your data layer is scaling with your app. Because again, you're the only thing talking to the data layer so there's a lot of control. So it can be capacity can be controlled to a degree, but in a data centric architecture again many apps are reading many apps or even organizations of writing. You all of a sudden need to think about scale, I think a lot differently, because you have a lot less control over how those queries are going to come in who's running them. And you need to be able to perform and have horizontal or sort of this linear scale capability, particularly on the query side. So flurry is designed as a solution to be a data platform for this data centric sort of world. And these when we think about the components that have to go into a data platform or database, if you will, to live in that world that we just talked about, we think of these layers that have to be there. So, at the core is trust. And especially if data is getting natively shared across apps or even across organizations how do they validate it. I often draw the comparison to the little green lock, or whatever color it is and in your web browser that shows up in the address bar that shows cryptographically, you know you're on Microsoft com or you're on Carnegie Mellon's website, and that little green lock tells you that there's no way anyone could have manipulated that data and the data. Well, when we have data talking across the web or across the organizations they also need that green lock. So trust and integrity around every piece of data becomes important we think every piece of great data should be traceable and provable that it hasn't been manipulated. And we should be able to give that little green lock to every data that exists. So Flurry is a semantic graph database sometimes called the knowledge graphs sometimes called a triple store sometimes called the name graph these are all sort of different names for the same thing. But semantics is where we can describe data in these globally unique terms and it allows us to automatically integrate data together. So we think semantic semantics is key, especially if data is getting shared in these multiple contexts. Again, if the only thing talking to it is your own app and an application centric world and semantics don't really matter because you control everything. But here's a world where you don't have that control semantics become critical. Security who can write who can read in Flurry we have a capability we call smart functions. It is an entire application programming environment and a virtual machine that can run co resident in the data itself. And we won't be talking a whole lot about that today. But that's a core capability of Flurry, and then time and quick question that that a full VM or is that a container like what is the actual like concept of using to run it say it's a sandbox container so it doesn't actually spin up a separate physical VM but it basically parses validates all the code and will compile it into reusable bytecode that it can execute quickly. Like, like, like BPF. Well I'm not sure what BPF is, but it's a DSL it's not arbitrary code, you compile it verify it and then instead of running the kernel, like in Linux you're running it inside a container that you control. Yeah. So, then of course there's time. So time, we think is just critical, especially when you're sharing data, typical databases were designed for probably historical reasons to destroy state to destroy prior pieces of information I mean that data might exist in a log but it only knows about one moment in time which is the current time. Well when you're sharing data and trying to collaborate around data, especially across parties, how can you collaborate around something, unless you have a consistent view of it, and you can never have a consistent view of information if it's constantly changing underneath your feet. So, you know what data Andy is seeing and what data I'm seeing unless we can coordinate and lock in time, and then all of a sudden machines know they have a consistent view. So sometimes we call this like GitHub for data but we have inherently in the system it's what we call by temporal. So, you know, you have the ability to issue queries and execute queries at any historical moment in time. And this is incredibly fast and efficient so it doesn't, you know, you issue a query for the data as of a year ago, it will respond in a time as you issue the same query for the current moment in time. Time in this environment becomes critical. And then of course sharing and this is where we start to get away from maybe having to develop custom API's for example to share data across organizations. Organizations sitting on the foundation that we can just talked about can now issue queries because security is going to limit what they can see and can't see at the data level. So, time is going to allow them to get consistent results for queries and now they're more in control of the data they are asking for in the shape of it, instead of us having to build, you know, 1000 API's for 1000 different views of the same data. And then at a high level conceptual view what does flurry actually look like and this is where we'll start diving in and, and we'll talk about how we scale this capability to the edge. There's really two main pieces to flurry. There's what we call the ledger backplane which you're going to see on the right, and then the query servers, which are right next to it. Now most databases, you issue queries or reads and updates or writes to the same physical machine, or the same service in flurry we actually separate the two, and this, this becomes important to the scaling service. We don't think there's a necessary a need to have these as the same service and you're always kind of battling whichever is your weak link in this model. The ledgers are only responsible for the rights and flurry and the query servers, servers are only responsible for the reads, they're completely separated services now of course they communicate and talk. You know, if you're connected to a query server or your app is and it wants to update data the query server will just take that and forward it on to the ledger server for you. So it's not like you have to physically connect to a bunch of different services, but they actually operate independently. So the ledger is the source of truth the queries are where all the reads happening are happening. Every interaction with whether it's a query server ledger server happens through these kind of layers we identify in the middle here which is identity, you have to prove who you are to interact with the system it is a zero trust database if you can I properly identify yourself so that it can determine what you have the rights to do, it won't even talk to you. And it starts to even open up kind of crazy sounding ideas, but I think exciting ideas of why not have your database on the public Internet, if it can protect itself. Then it doesn't need to sit behind an app behind the firewall behind another firewall it can actually exist out there where anyone with the proper permission can access it. Smart functions are a way that we can write code to actually programmatically enforce these permissions and then provenance so this idea of tracing data changes through time. And all this ends up getting represented in a pretty simple way, at least I think a simple way to the consumers of that data is a knowledge graph. It's a semantic graph database we happen to support sequel as well and graph ql, but sparkle and and our own query language flurry ql are primary ways that you would get the most power out of the system. Because it is a graph database if you're querying it with a sort of a rectangular query language like sequel, you're going to get rectangular results and, and you know some rectangular nuances. So it sits there as a knowledge graph that knowledge graph can consume ontologies as well so it can do inferencing, if that's of interest so things like schema.org name we do a good amount of government work that's the bigger ontology that the Department of Defense and other groups are focused on. There's lots of ontologies out there, describing lots of domains and flurry can natively support those. And the comment you made about like, oh, you can have your electric back lane have the data just sitting out in the public. I mean obviously it has to be encrypted. So, did the queries is encrypted at rest and did the query servers, you have the decryption keys and how do you protect those like how does that work. Yeah the service can sit there on the internet that doesn't necessarily mean that the data has to be encrypted. And that means just that you can issue SQL SQL queries for example, just using a relational mindset, you can just issue SQL queries to the database the database has the ability to read and write the data directly. You don't have the ability to necessarily get direct access to the file system, you have the ability to issue SQL queries, it is going to protect the data based on your identity, using public private key and your photography, but there's other ways that we can allow this to happen as well. The question is that how do you want to unmute yourself. Sorry. So, so I think there is some overlap with what this Cambridge semantic guys are doing use of the triple store but also in the large amount of data with Olaf. So what's your view of it. I'm sorry I missed the first part of your question I think you mentioned specific vendor company company called Cambridge semantics. Okay, so what's your view of it. Yeah, well I mean there's a number of semantic graph database vendors so Cambridge semantics there's you know. Star dog. There's of course the Apache the open source Apache for sake. So there's there's a good collection of vendors in that space that I would all consider competitors to each other certainly and somewhat competitors to flurry. They are all basically triple stores. They don't incorporate any of these components that we talked about really here in this private previous slide so they don't include things like time travel data defending itself trust in provenance around data so these are unique capabilities that flurry brings into that space. But yeah certainly from a pure semantic graph database or knowledge graph database standpoint, they would be considered a competitor. Yeah, okay, although they are all up database that they have to deal with the high volume insert and load and all of that stuff. Yeah, sure. Well and that's part of what we're talking about here is we're going in as how do we how do we scale this and scale this in a horizontal way. Thank you. Okay, so now we're going to kind of work towards how does the scale and when we talk about scaling we also talk about this idea of scaling to the edge we think that. There are many things I think we we think that a database should be one of which is that it should be able to act sort of like a content delivery network, but a real time one, where the database can exist anywhere in the world, they can be spun up and spun down on a win the five a big analytical process that I need to run. I can spin up a database run my queries against it and spin it down after 15 minutes if I don't need it and I had quite literally zero impact any other running service on that database. So we'll work our way up to that. So we're going to kind of introduce this idea of separating kind of the system a record with the ledger servers and these query servers. We'll talk a little bit about how you might physically deploy a flurry service. So there's kind of four different ways that you can deploy it, and they're listed here on the left. The one is just on a single servers, this server so flurry is open source by the way. And we we build sort of pre build jar files so it runs on the JVM so you can just download them from our website you can go to GitHub get the source build it yourself. But when you're developing with flurry for example it's very common to just run it on your local laptop. So you could run it like this in production as well if you wanted to but just to run it on your laptop you're just really running one single server service. So this is the first way that you can deploy flurry. So it's going to run on the JVM, you can containerize it you can run, you know we have a cloud service you can run it on or you can just run it on the JVM locally. And you have multiple choices for where the data for that ledger server is going to actually be stored so things like object storage in the cloud like S3 are an option. You just run in memory. And that's a fine thing to do obviously it's not going to persist state between downtime but perhaps for testing or other reasons you just run it in memory, or you might use local disk and run it on file storage. And that single ledger server can now have all these different things connect to it so it can have things connect to it and issue queries either over an HTTP API. Or we actually have the ability of embedding the entire database as a library inside of your app. So we really like this idea I really like this idea of database as a variable, and especially when your databases become immutable we stop talking about databases as being sort of this thing that's mutating and changing with every update. We start talking about a database being a single immutable database as of the moment in time and the next time there's an update, there's a new database. So you can can get this idea that a database ends up being a variable in your code, and you can pass it around between functions and do whatever you want with it. And just like any sort of data structure sitting in your code it's, it's never going to change unless you mutated or do something, but then you have basically a new version of that data. So we even have a version of flurry or query servers that run in JavaScript in fact you can run the entire database engine if you want inside of a web browser. And we do some cool things well like with react to make that all work real real time. So again this all gets into kind of this idea of separating the idea of the ledger or the source of truth from the query services, which can be in different ways as a library, or they don't have to be you can still sort of use traditional ways of sending a request across the wire having something else execute the query and then giving you back your results. For us we do that typically which with an HTTP API request. So that's kind of flurry as a single server. Now this is where we start to scale because the query servers are designed to actually exist as an independent service. So instead of all these things talking to your ledger server which is really just responsible for doing the rights. Now we've offloaded all the reads capability. And of course the ledger server is going to get every single transaction process it update it validate it. And then of course broadcast the results out to the query servers and in fact you can have as many query servers as you want. Like I said you can spin them up for 15 minutes shut them down the query servers are in memory database servers. So, obviously the longer they're running and the more memory they have the more they're going to have resident there to answer queries on the fly. Of course, as soon as you started up it's not going to have any data in memory, and it's going to have to pull a lot of that data to bring it in memory to start answering queries. And once it's got a lot of data in memory, you're really running or it has the data in memory it needs for a particular query. It has the ability to now run it in memory speeds. And this is the part that we'll get into a little bit more. The ledger servers themselves can also scale in fact they can even be run decentralized if you like so multiple organizations can run ledger servers they can have voting and consensus around whether transactions are valid and obey the rules. The rules exist as data, alongside the data as well. So, the rights can scale with the ledger group and being a semantic graph database one of the nice things is is you can connect graphs together. So you kind of get away from this idea that a query has to talk to a database, and you can start to have a query that can actually do joins across multiple databases. And this is a really great way of scaling because if you don't need acid sort of semantics around the transaction. There's not necessarily a reason that that data has to exist in the same database, because you can query across database and do joins very easily across databases. And of course you can get to the point where these different nodes can connect to other nodes and other ledgers and you're running more in the decentralized world. So pause there, and then we'll talk about sort of how we scale up these query servers with by starting out talking about our data structures in indexes but see if we have any questions that have popped up. So I mean I have, I think it's a real question for me just like communities asking what's the main goal to be versus overlap. So they asked about like, you know, which which do you support transactions. I guess my question is like, what's the ideal use case or application scenario or something like flurry. What does somebody want this, the third, you know, it's completely decentralized architecture. And are you targeting like front end applications things are powering websites or phone apps or is it back in analytics. Yeah, so good question so a lot of the semantic graph world very much focuses on analytics they do not operate as a system of record flurry does what it can be used like that. That's a poor use case for it you can still create these knowledge graphs you can still put in ontologies for example to do that, but it really doesn't want to be a system of record because it has time travel because it has this provenance capability. So really wants to be the system of record that is I started out ideally multiple apps and multiple contacts can leverage that data because it's semantically described it's interoperably described. And it can protect itself, it doesn't just have to talk to a single app that's controlling everything about that data. But usually people are building an app right there building some sort of product and they're saying, I want my data to have integrity. I want it to be able to be leveraged in multiple contexts. I want to be able to share it with securely without having to build an API for every table in my basically traditional database record I want to be able to secure it so people can ask whatever question they want and if they have permission to see the data they can see the data. So we do focus on what we call more data centric. We call it network use cases. Usually starts out with a single app that someone's building except they have broader visions of who's going to leverage that data and how it's going to be leveraged across multiple apps and multiple orgs. Okay, any questions before we keep going. Okay, so this is somewhat technical except I think that was the intention of the talk. So we'll talk a little bit about data structures and indexing because that's obviously very critical when we think about how we scale out in memory databases on the fly anywhere in the world. So let's take here a very simple set of data. I don't know how much the group I guess I should ask the group and maybe Andy, you know, have you spent much time with graph databases is a more just relational databases what's the familiarity. We, on the research side we're focused on relational databases but there's a bunch of people here that aren't at Carnegie Mellon. Some former students, some people from IBM research. So, you know, it's everyone, but everyone knows databases and I think we somewhat familiar with graph databases. Okay. So, Flurry at its core is a graph databases sometimes is easiest to start out when we're thinking about graph data and just looking at it in the form that everyone's used to seeing data, which is in a spreadsheet, or rectangular form, which is kind of the domain of traditional SQL databases. So, this is some simple data here that we'll be dealing with. Obviously we have rows on the left with different people, and then we have columns, and then we have value sitting in there. The difference in what makes this, this data end up being a graph is that everything you see in the little angle brackets is actually a pointer to another row. That row doesn't technically have to be in the same database that row can be in a different database so we see you know in that is interested in column on the right. We have some pointers to some different, you know, in this case, artists. Those artists could exist somewhere else and my existing wiki data which happens to be a semantic graph database. But in the other case we have like is friend of column kind of in the middle there. And these are all pointing to other rows that we see in the lab. So a graph just allows any node to point to any other node we don't have this kind of level of indirection that relational database would have where you have to kind of go through a table to get to another piece of data. What we can do is we can actually represent all of this data as our DF as what we call triples, and they're called triples because there's their three titles. So all the data that we had in the spreadsheet on the left is identical to all the data in the right, we've just turned it into our triple format. So the first element of the triple we call it a subject is the row. So you see all the Alice data than the Bob data and then the Jane data from our Alice Bob and Jane rows. The second part of the triple is the column. So this is where we have, you know, is a friend of is born on. And then third part of the triple is the value that we actually see in the cell on the left. So this is the graph. This is a triple store this is a knowledge graph these are how this data ends up being a beautiful thing about a triple is that it's a very generic format of representing data. Any conceivable piece of data can be represented as a triple. And it's very flexible you can take triples and if they're in, you know, the right format, you can push them into a relational database you can push them into a document database, or it can be a knowledge graph database as well. Flurry technically extends the triple and I'm not going to get into any of that in the presentation, although happy to take questions about it. We extend the triples to incorporate time, this ledger concept. We extend the triples to as we're incorporating time and linking to transactions and the provenance of data to also represent whether we're asserting data, or retracting data. So you think of a typical database there is no such thing as retracted data because anything that's retracted or deleted is just gone right. The concept that there is a piece of data that used to be there that's now gone. So Flurry extends the triples to also incorporate this information we also technically extend the triples one additional component. So we, we have six components to our triples we call them flakes, which is a set of metadata about those triples. So this can incorporate things like language tags, enabling multilingual databases this can enable things like expiration date something like Cassandra one of the features I always love about Cassandra is you can expire data so it can hold inspirations and metadata. It could also include property related data, sometimes called RDF star but this is basically like Neo4j as a graph database but a completely different class of graph database called property graphs. So we can incorporate property components into here as well. But we're just going to focus on our simple sort of three tuple triple structure here. Now we need to think about queries because that's what databases do. So how do we structure indexes so that we can be very efficient with queries. So in this case we just sorted everything by the row, then by the column, then by the value in it. So, on the left we have a sparkle query. One of the cool things about sparkle, I think is that the query format is also consisting primarily of triples, and it's really like a pattern match of the data on the right it makes it I think very easy. People maybe aren't familiar with sparkle and querying graphs but I think the fact that your query format is identical to the data format can help a lot. And here we're trying to find all the people that Bob has said they're a friend of. And in the where clause we see the first part in yellow here Bob is a friend of and then we just do a variable binding and we want to see everyone that Bob said he's a friend of. And then you see in our select statement, we're just pulling out that friend variable. And you see the data on the right this would be very efficient in this if this is an index for us to find this data. If we find Bob, then we find the column and it's exactly how our data sorted so our data is really just a sorted set. And we're just going across those columns. So we don't really have to do anything except just keep this data on the right sorted in this exact format to be extremely efficient with a query that's structured like this. In this index, the SPO index and in fact in flurry we call it the spot index because we incorporate time in the index as well. But again I'm just focusing on the triples, which stands for the sort order subject which is the first kind of row in our triples or the first element of the tuples predicate or property, which is the column name the second part of the tuple. And then of course the object which is what they call the value over on the right. So, so far so good. But let's have another query. And here we want to find all people who are of class or of type person. And you see the data on the right. And what we need to match, because we'd no longer have the first part of our data we have the second and the third. So I have to do a range scan across the entire database to answer this question. So everything highlighted on the right would be our matches, but we'd have to go row by row to do this this obviously is going to make our queries very slow. So we want to fix that. And that's where our next index comes in, which is the POS, which is just swapping the sort order where now we're sorting first by the column or the predicate or property. By the value or the object. And then third, we have the subject so the row basically. So now the same query obviously becomes extraordinarily efficient, because now I can look at is a and scan to that really quick. And now of course hit person and then I'm going to get all my values there at the right. So there's two indexes that now allow us to answer a lot of queries in a couple different formats. There's really only one more index we need to be able to be very efficient with almost any type of query. And this one we call the OPS or in flurry we'd call it actually the OPS T to fold in time. But it's the object. So we're, we're going to be sorting on the value first, then the predicate then the subject. And this helps us answer questions about what is connected to something else. So in this case we want to find everything connected to Picasso. As you can see from the query on the right, the only thing we have filled out is the third part of the tuple the Picasso value. And once again if we use our just how we had this data sorted by default and beginning to do this. What we would is, as you can see have to do, again arrange scan of the entire database basically to answer this question because we'd have to find every, every place Picasso was in that third sort of place in the element. So this last index, what we do is we actually flip the subject and the object, and we sort based on those. So the only thing we have to include in this are links to other things because scalar values don't make sense. So in this case, for example, we have a bunch of values in here like dates and you know these string values. Those don't really matter at all for this index. Only thing that matters are things that point to other things because those are the only types of questions that we're trying to answer. So now we have three indexes in between these three indexes we can answer virtually any question you could ask. Technically in flurry we do a fourth index, which I won't get into which you need if you're not going to index all the values that you end up putting in the database but if you let's just say you index all the values that you put in the database, you can answer any question, extremely efficiently with these three indexes and we're just sorting those triples in three different ways. I'm sharing the entire triple with every entry in the index. Yes, although it's summarized so what we do in flurry to make this very efficient is that everything you see for example in this particular index right here is represented by a long integer and that long integer is actually an alias to the row identifier to the column identifier. In this particular one where you know every row and every column and every value point to other rows. These actually just get stored is long integers. So the long integers end up being extraordinarily efficient, obviously to store so if you actually looked at our index table, you just see three integers here. They're also very, very fast for comparators. So as we're joining data, you know we have very simple ware clauses here obviously a lot of ware clauses would be combining multiple sets. And as we're combining multiple sort of statements here in our warehouse we've got to do joins between these variables or comparators between these variables. And as we're doing fast at doing comparisons on integers, they're pretty slow, most of them in doing comparisons between strings for example. So the more we can keep in integers we're lightning fast. The predicate I feel like that would be like the object and subject, you would have potentially a lot of unique values the distribution the data or the values would be just like predicate is a friend of instance today. There's only so many predicates I've imagined these databases that you could do your pretty heavy compression like RLE, or other things to reduce the index even further. Are you guys doing those kind of things or just you're just doing 64 bit encoding. Yeah, so we do represent the predicate as a standard 32 bit integer instead of a 64 bit integer just because we technically never need more space than that. But they're still integers and keep in mind in a semantic in a true semantic graph database, the predicates themselves are actually rows, they're actually subjects. So that allows us to, you know, everything is data everything's a triple even the predicates themselves are represented as separate triples. So technically like is a friend of here is actually another set of triples it's a subject itself that might describe things about it. It might describe that it's multi cardinality single cardinality it might describe that it's required or unique. But the number of unique values for predicates I think would be much less than objects and subjects so like, again, for your example here you only have two different predicate types is friend of an incident. So like, if I did run life encoding, I could press that even further. Yeah, I mean, you could accept ultimately these are aliases for other rows. So if all of our rows are represented by an integer. Then the only thing that we can do to reduce the space because they're frequently used is to use a small end of the integer range to end up representing the numbers. So is friend of you represent with the number one, but you have what you have. So instead of storing 111111 over again, you store one followed by the length of the runs you store one, and then say I have six of them. So you just store the number six so that gets you two integers instead of six. Yeah. Yeah. These are standard like columnar pressure techniques. I was just curious if there's something different because it's already F versus no, no, but one of the things that we are doing is that we also do focus on is just using standardized standard serialized encoding so we use Avro to do this so in this particular case we're using Avro. So we're just following whatever sets of, you know, optimizations Avro is going to allow which is primarily, you know, in maps around key values, etc. Thank you. Okay. And Okay, so I'll skip over those. So then, because we're running low on time. So let's look at our original index right that original lists that sorted by rows then columns and values, and turn it into something that looks like a B tree. So here we have a B tree that is representing all of those values we just looked at, representing those values across a few leafs and of course those leafs all point to a root node or a branch node. So now we introduce this idea of a query server. So the query server just started up. It has no data. It knows how to connect back to the ledger servers, or it technically can even point to another query server that has the ability to connect back to a ledger server it really doesn't care. It just needs to connect to something that can relay the source of truth, or the actual data to satisfy queries as they come in. And now there's an app sitting in front of that query server, and the app issues a query, and here are very simple query is select star from Ken. So again this query server is cold. It has no data and memory at this point. It needs to satisfy this query. And for one, we can look at the query and we can say we can solve this query from our SPO index. So remember, you know we have three technically in flurry we have four different indexes, but we can look at a query we immediately know what index is going to be most efficient to satisfy or indexes because many queries may hit multiple indexes. But which index is going to be required to actually satisfy this query SPO index so that's fine that's what we're looking at here. First thing we need to do is load up the root node. Now in flurry, these are actually stored as every one of these nodes is stored as a separate file. What we're doing is you're pulling up entire files to these query servers. In this particular case we have four files right each leaf is a file that's three files and then of course we have the root node which is itself another file we have four files that house all of this data. So the first thing we need to do is go to the root to look up where the data for Ken is going to be. And so once we pull up the root, it's going to allow us to know where we need to go next to find that. And it basically tells us that that is in this particular leaf. We now pull up the leaf, and we're able to satisfy the query. So this is all we needed to do we needed to load two of our four files to answer this query. The query server now has two of those files in memory resident indexes like everything in flurry are immutable. So the query server has a guarantee that the data in this leaf index will never ever ever change. Cassandra with its SS tables for example works much the same way it'll never update an SS table we only throw garbage collect throw out and create new ones. The query server has these two files, which contain this data in memory, and it treats it like a stack so if the query server has lots of memory, it might be able to hold a million of these files. If it's running in the browser, pro resident in the browser, it might only be able to hold 10 of these files. So it's very flexible for the query server, depending on how much memory it has determines how many of these files or pieces of the stack it can hold, and then it just uses an LRU cash to basically kick out whatever it can't hold as it needs to satisfy additional queries. So new query comes in. I don't know if you just saw a change at the top select star from Jane. So here now the query server first has to go to his root and say where's the data about Jane. And it turns out it's in the same leaf that had the data about Ken, and it's able to now answer this query it did not have to go anywhere else to answer this it already had everything it needed in memory to answer this particular query. And now I have a new query my zoom controls are in my way a little bit. And here again, we went to the root we said where's the data about Bob, oh it's over in this leaf we had to now pull in this late now our query server has three files in memory. Right. So this is all fine and good except databases get updated. So how do we handle keeping the state up and this is where some of the challenges come in. So let's spin up another query server and some app over at that other query server inserts a new record about Tom. And I think you can easily see that the Tom data is going to set sort of in that third leaf that the original query server doesn't have. And what we do in flurry, because we don't want to re index data all the time and we want our indexes to be immutable so they're very cashable at the edge is we have this concept around novelty novelty where we put all of the changes since the last time we ran an index. So, as that transaction, you know it was proven with identity past smart functions validation permissions and ended up getting transacted that gets pushed out as a stream to every query server. And all these changes, the query servers. End up putting into their novelty bucket. What we can very quickly do is merge all the data and novelty into the index leafs when they need to do so. And we have two components and novelty we have sort of the base novelty and then novelty max. And I can get into those. I know we're we're starting to run low, but this helps control the index or the ledger server and tells it when it can run new indexing jobs, or when it should run them new indexing jobs. And then when to stop the world when basically you're saying all of our query servers have filled up the maximum that we ever want them to hold in memory. We're not going to accept any additional updates until a new index is able to fold the novelty into the industry. But say we've reached this novelty threshold with this one update. And we need to run a new indexing job. So again you have control over this with flurry how much data we hold in this memory queue. But say this was enough to push it over that queue in the background now we can start running a new indexing job. The indexing job is going to create in this case two new files. One is we're going to create a new leaf we're actually going to throw out the old leaf, because again we never update a leaf will never update anything. That would break all of the caching upstream and all the scale characteristic that we're trying to do. So we created a new leaf. And then of course we have to create a new root, because it now points to a different file. We created two new files by folding in that, and we now have nothing in our novelty our novelty is fresh now it can now accommodate new updates as they come in. And now we have another query that says, you know select star from Bob again identical query that we just had. And now we had to load up a new route, because the old route was discarded, but the new route actually pointed to the original first leaf there, and it was never updated. So it remains in cash and never had to retrieve that file again. So the idea is a longer these query servers run the more data they can pull up in memory and cash. And of course is indexes are generated. They're only going to affect the leafs of the data that physically changed all the leafs of the data that hasn't changed remain in memory and fully cashed. And lastly I'll touch as to how we look at how you know what these be trees effectively look like. And what we have settled in on is that a branch node holds about 500 children in each leaf, which physically holds the data. So we try to keep around 100 kilobytes, because again these are the chunks that are being requested by these query servers upstream and they have to hold in memory so 100 kilobytes. We found is a pretty good spot in a typical database about 100 kilobytes of data represents about 3000 triples. Triples fit in one of these leaps, because each branch has 500 children, we can look at the size of the database on the left. So, if we only go one level deep we have 500 children, each set of nodes can have 100 kilobytes or each leaf, that means a 50 megabyte database can fit one level deep to level deep work to 25 gigabytes. Levels deep were up to 12 and a half terabytes of data and four level the levels deep were at but little over six petabytes of data. So that's how we end up organizing these leaves and again every every branch and every leaf ends up being a file that gets cash upstream. And that's it I know I have about four minutes left but I'll end the talk. All right so Brian I will clap on behalf of everyone else. So let's let's open up to the audience and if you have any questions for Brian go for it. So, while you want to ask your question. Oh, that's work. So in the chat he asked, do you have benchmark data against Neo4j and tags graph. Yeah, so again there's two different, the answer is no. And part of the reason for that is those are in a completely different class of graph database. And I know it's a little bit confusing sometimes because people I think think graphs or graphs, and there's kind of two categories of graph databases property graphs, which those would fall into. And then, you know, it goes by many names, but the other type of graph which we fall into knowledge graph, triple store semantic graph name graph. So, two very, you know they're both graph databases but two very different approaches that graph databases. And if anyone wanted to I'd be happy to get into some of those nuances. I'll be selfish and take all the time and ask questions. Just quick question. I kind of am interested, can you just clarify briefly what the difference is between a triple store and property based graph. I thought. I guess the is the primary difference that the triple store stores the property as an individual edge. So in the property graph allows you to put properties on edges. A semantic graph does not allow you to put properties on edges. So that's one big difference. So in the semantic or in the property graph that that property is just going to be like a pointer to an actual table for the relational data or the column our data right a pointer to another node, or just to a scalar value which might be a string or a date or an integer, but yeah, or a pointer to another note. So in the case of the triple it's like in edge, or it's like in vertex out vertex and then third value is the property right is that correct. So, the, how they're technically described which I don't think is a great description is subject predicate and then object. So the predicate is basically a row predicate it would be your column, or basically your edge. So that's kind of what your edges is in the middle, and then the last value is either going to is another node or an object. An object technically could be a scalar value, again, like a string, but it could also be a different point to a different node. Okay, got it. Thank you. I appreciate it. The other big difference of course is that semantic graphs are designed to leverage global vocabularies, and even have global identifiers or unique iris, even for your roads, which allow you to combine across multiple data sets very dynamically, and in a standard space way, property graphs were never really designed with the standards kind of open data approach. There were designed more like, you know, here's a, here's a graph database version of MongoDB kind of designed to be the same sort of database you would end up running behind a single application, not an interoperable set of data that can be connected across sources. Thank you. What is the consensus protocol between the writers and the ledgers? Are they, are you assuming non adversarial nodes in a cluster in a group, or are you running, you know, like a blockchain BFT kind of thing? Yeah, so we, we have a pluggable consensus, but at this point, the only thing we formally support, because it's for almost every use case we found it's the only thing that's needed is raft, and we've developed our own raft library that basically runs inside of Flurry. Part of the reason why a raft can be very sufficient for a lot of use cases is that every single message is cryptographically signed as well. So, you know, one of the things that, so PBFT is going to reduce your redundancy, business team fault tolerance is going to reduce your redundancy, and it's going to increase your latency. So, you know, the formula for redundancy in like a raft or a Paxos network is going to be 2f plus one is how many servers you need to run, where f is the number of failures. So if you want to support one failure, two times one plus one is three, you need to run three servers to support one failure. That's two failures would be five servers. This is why it makes almost no sense to ever run an even number of servers, like in a Paxos or a raft network. PBFT, you're going to really be a business team algorithm, you're, I think always going to be running a 3f plus one, which means that if you want to sustain one failure, you need to run four servers to failures you need to run seven servers. You need to run a lot more servers to sustain the same amount of failures. There's also additional communication points that happen through that. So those are, those are kind of some of the trade offs you end up having to deal with and when you're cryptographically signing all messages so you can prove identity across all messages. There's not a lot of reason to absorb the extra overhead. And like what's the point of being, it's like, what's the point of being decentralized if you assume everyone's a good actor, right, then why not just slap it on Amazon and be done with it as a centralized service. Well I don't think you assume that anyone's a good actor so there's two different components to decentralization so one is the traceability or the provenance of all the changes. Anyone can independently validate that the data in the transactions have integrity. The second part of consensus and decentralization is who gets the votes, who gets to vote on whether a transaction is valid as it's going through that doesn't mean that anyone else can independently validate that it was accurate but at the time who gets the votes. So, flurry is what is considered a private or sometimes federated blockchain, which means there's a set of parties that don't necessarily trust each other, but they, you have named parties who are the decision makers, something like public black chains, black, you know, Bitcoin, Ethereum, etc. Cardano. Those are truly decentralized decision making, but then that also takes you completely out of the category of even something like the Byzantine fault tolerant algorithms like PBFT, certainly out of the category of raft, because all of those count on voting mechanisms but you can't obviously vote if you don't know how many participants are because what's 50% of the vote, when you don't know, you know, how many people are there. Again, all the bad actors make that you couldn't use RAF for that because you do. But again, so I understand the verifiability part and that when you want to expose an API expose the data in such a way that someone can prove that things are all done correctly. Merkle trees would do the same thing. And that's, that's exactly basically what, what, you know, flurry ends up constructing is a Merkle tree. Yeah, okay. It's the same with public black chains. I mean, that's what they use the version of Merkle tree. But like it's not completely trustless right like you have to trust that the raft leader is writing making rights that are, you know, true to true to the things that it's receiving from the. Well, if you're receiving rights, yes and no from a voting standpoint but as you're receiving the data you have the ability to independently validate that. If that data comes through and you say no I don't trust this information or you know this identity the signature doesn't mesh out. You do not have to accept that. So, um, sorry, that's a denial service right if I know there should be new updates, but someone has is peed in the pool, so to speak. And I can't get the new updates because I can't, I can't verify it where if you had essential activities had full control. That's right. Yeah, no, and that would be the point where you would have to say okay you know we have a bad actor in the network. And we need to kick that person out and we need to sort out what that is. And you have all the traceability there to be able to determine that. So, you know what what, for example, a Byzantine fault tolerance will get you is the ability to deny that to a certain degree although there's still ways of doing denial of service attacks on the fly but you're paying an immense overhead to have that either way you have the traceability. From a software engineer standpoint, like you're pushing this now down to the end user of the application to say hey, it's up to you to figure out that this bad thing happened and it's up to you to figure out how to reconcile it. It's almost like vector clocks right like those things are fantastic because they allow you to be scalable but like you're pushing it on the application developer to have to reconcile these things and fix things up. And I think that's a trade off and I don't think that's the right way to go in my opinion because I think people don't have the mental bandwidth and they just want to write queries on their data but they have to do much extra stuff because of this extra you know this trust mechanism that you're forcing upon them. Well, I completely agree which is why for example you Andy might run your own query server for a set of data at Carnegie Mellon and me as an app developer I may trust you enough that I'm just going to trust you to act as a query server. Otherwise I can run my own server, but the identical thing happens in public black chains. If I actually want to trust Bitcoin, I have to run my own node. I don't think your competitor is blocking is like you know the Bitcoin deal, your competitor be, you know, whatever the Neptune from Amazon or CODB or Neo4j right these big mega corporations have a lot of money that can run these services, and you know people trust them. The core of what I'm getting at is like, what's the, and I asked you this before, like, what's the sweet spot for you guys that like, you know, you would never want to run on Amazon. You never want to run on another cloud service, you want to run it sort of the decentralized model you're proposing. People who are looking for interoperable it's it's the data centric positioning that I started out with people who want interoperable data that's going to be shared across parties in a trusted manner.