 Hello everyone, I'm Damien Tourneau, you might know me as Damzy, I'm the city of commerce guys and I'm here to talk to you about an idea of refactoring how we store stuff in Drupal 8, that's what I call document oriented storage. So just to, we don't have any, really a lot of time and I realized yesterday that we only have half an hour and not one hour, so I have way too many slides, anyway we are going to try to go through the guest of it. So just for the purpose of this conversation, you can hear a document here as entity, if you are familiar with the Drupal 7 concept. At its core it's basically just a piece of unstructured data, you can see it as a JSON for example, just a bit of JSON. So I have four points to make and a vision to present, so I'm sure we are going to manage to go through the four points, I'm not sure we are going to go to go through the detail of the vision but whatever. So four points, follow me. First, when you think about documents and what Drupal is, Drupal is a CMS and we store documents, the point of the CMS is to manage content and this piece of content, for this piece of content, for a CMS, we need fast retrieval. We need to be able to, the guest of our load, the main load we have is horizontal load. We want to be able to load a document quickly, so when you visit the node page, the node slash one page, we need to be able to load this node entity and we want basically all of it, we want to be able to display all the fields that this document carries. Also, most Drupal sets have a relatively low right to read ratios. Most of the time you read something way more than you write it, it definitely depends on the type of site and community sites tend to have higher right to read ratio but generally you read a ton more than you write. We need, the second thing we need is scalable querying. We need to be able to increase the size, the number of documents we manage and still be able to query on those documents quickly. Scalable means scalable to the size of the data, obviously, but also to the number of servers that participate to the processing of a request. If you start having a really large site, you might want to have several database servers, several front-end servers and we need to be able to scale to those new servers and to the size of the data. The third thing we actually need in terms of document storage is an abstract interface. Why? Because there are a ton of super fun, super great data stores out there that have been specifically designed for exactly what we are looking for, that have been specifically designed for the type of document we manage. I'm thinking about MongoDB, about CouchDB, about many of the no SQL databases, especially those that can actually store documents. So there are a ton of things out there that are been precisely tuned and precisely designed for exactly what we want to do. So we should be able to use them and we should not be stuck in SQL. Second point I want to make is that, and it's very important, the way we actually store the data has absolutely no impact on the scalability of our querying or the scalability or performance of our querying. Why? Because of this. Let's call that damsy axiom. For every normalized way to store the data, an infinite number of queries will not possibly scale. And I'm going to give you an example. Who can tell me what this is? So it's a SQL query, but what is it? Yes, so that's basically the taxonomy term page. You want to display all the nodes that belong to taxonomy terms that are tagged by taxonomy term, and you want to only display those that are published. And you want to order them by sticky first and then created. So that's basically the gist of a taxonomy term page. The problem with that is this. The problem here is that we have two conditions on every side of a relationship, on every side of a join. So the way we store this data, the way we store nodes, the relationship between nodes and taxonomy terms, is purely normalized. It's the textbook way of storing that in SQL. We have a node table, we have a taxonomy term table, and in the middle we have a taxonomy term node table that does the relationship between the two. That's a canonical way of storing that. It's perfectly normalized in every possible way. But it cannot possibly scale, even if it's in a request as simple, in a query as simple as this one. That's basically the most simple query you can imagine on that type of dataset. You can do way more, way worse, and you have done way worse on your project. The second point is that we don't actually control the way the data is going to be queried. We cannot continue trying to tune the way we store the data because we imagine that it's going to be queried in a different way. And that's best seen in this quote, this large quote from Eton. You remember the debate when we introduced field API in core about per bundle storage and one table per field and one table per bundle? Eton wrote that. At present, the ability to keep things in a single table is the only thing that keeps a number of high performance sites I work with online. And I have to say Eton is right. He's wrong in saying that we need one table per field. We need to keep the per bundle storage thingy. But he's right in saying that at this point, this is the only thing that keeps us alive. This type of structuring and tweaking of the structure is basically the only thing that keeps us alive. So as a consequence, there are two consequences of that. The first one is that because the storage has no impact on the performance of the querying of the performance and because we want fast retrieval, we should store the data in the way that allows fastest retrieval. And there is another consequence we are going to talk about a little bit later. Third point I wanted to make today is that we need to look at what's happening elsewhere. And I looked at what's happening elsewhere. But it's not actually the end of the story. Because there is plenty of things we can still improve. We need to learn from the outside that there is a ton of things we can do ourselves too. So the first thing we need to look at is inspiration outside of Drupal. And I know it's scary. So I looked at two types of tools. The first type of tools are what we call object relational mappers. It's a traditional way of storing documents everywhere else in the world. In PHP, it's stuff that is project like doctrine. Especially something interesting is the fork of doctrine for managing MongoDB. In this Java world, it's called Hibernate. For example, in Ruby, there has been active records that have been around for ages, which is a part of Rails. So those things are the canonical way of storing data. But they don't put any focus on performance. They put way more focus on normalizing the data and mapping them to native objects than on the scaleability of the querying of the subjects. In a way, there are more object-oriented, so they want to map data structure to objects than document-oriented. They cannot store anything you throw at them if it's not mapped to an actual object structure. What's really interesting is that is that when you look at, when you search for, for example, active records and performance or active records and denormalization or stuff like that, you find absolutely nothing. It's really interesting. Apparently, we are the only, the only one to have stability issue with SQL databases. I don't know. Or there is no big, big Ruby site anywhere. I don't know. The only, the other source of inspiration is what the Java world has been calling content repositories. It's originally a Java specification. It's JSR 283. And there is currently a port of this thing to PHP that has been initiated by the Flow 3 project. It's part of the refactoring of Type 3. I looked at that very precisely. The conclusion, I don't have any hard conclusions yet on it. It's a very interesting, it's a very interesting project. It's, it's also very, very complex and for a way more complex than we actually need than what we actually need. It has no specific focus on performance and that's a big concern to me because it's already very complex and making that scale is going to be probably a bigger challenge than doing some stuff by yourself. So there are discussions going on and they need to continue. But it's a very interesting piece of inspiration. So there is, there are also inspiration inside Drupal. Inside Drupal, you can look at 10 years of scaling Drupal. So the Drupal project is more than 10 years now. And we have come up with ways to scale Drupal, especially we have come up with, David Stross has come up with MaterializeView. We have alternative field storage. So I wrote with the MongoDB storage. There is a Permanent storage. There are, I have an experimental elastic search field storage. Well, there have been some work in that area. Obviously, there is the EntityCache project from Cache. And also a very interesting project, which is Entity API metadata, which basically adds a schema on top of our EntityField model so that you can introspect them and introspect it. So all of those things are stuff we need to look at. Based on the, on the first three points, I think my, my, my fourth point is that we need to implement automatic denormalization of our data to guarantee querying performance. It's actually the only way to guarantee querying performance. And when I, I emphasize automatic here because, because we do that manually all the time and that really needs to stop. So the basic idea is to duplicate the data in a form that is easier to query. So instead of storing the data once, you store it once in a canonical storage and you denormalize it. You duplicate it in a way that is easier for you to query. So in a way, in a way, for example, in which the database engine will be able to constrict indexes on, mostly in a single table. So there are plenty of, we have plenty of experience in that. Just on Drupal.org, for example, we have three or four materialized views that powers basically everything. Most of the queries go through materialized views today on Drupal.org. But I was, I was saying we need to stop manually denormalizing. We need to stop implementing manual denormalization ourselves. So just a quick poll here. Do you know how many denormalization implementations we have in Drupal 7 core right now? We actually have five different denormalization. We have the taxonomy index. We have the form index that are completely different, different beasts that are supposed to improve the scalability of taxonomy queries. We have tracker 2. Since Drupal 7, we have the tracker 2 project in core that has two different types of denormalization. It has a global and a per user denormalization. And, and it's very important to note that inside the field storage itself, we have a denormalization. So we have two tables per field. We have a current table and a revision table. And this is actually a denormalization. We have a duplication of the, as defined by a duplication of the data to improve query performance. So it's basically denormalization. So we need to stop doing that manually. So that brings me to the vision. So please don't kick me just yet. And the current state of the world is the current state of the game is this. So it, it all starts with the entity controller we have. The entity controller queries the SQL database to get the base fields of an entity. And then it calls the hook, hook entity update. It calls the field API to grab fields. The field API relies on the field storage to, to load the fields in the field storage query, field storage database. So it's basically, it's by default, it's the same. This thing and this thing are the same. But it can be different. And we have a production. We have the MongoDB field storage, for example, in production in, in a couple of very high-end sites. And it's working, it's working very well. So this, we have this split brain issue where we have on one side, we have the entity thing. And on the other side, we have the field thing. But wait, it gets worse. We added to that entity field query, which is a monster in the middle. Entity field queries queries the SQL database directly if you only do a property-based query. But it queries, but it relies on the field storage to do the query if you have a field part in the query. So, but if you have a field part and an entity part, for example, you want to, precisely, you want to get all the entities that have a field of a certain value ordered by a created creation date. In that case, the field storage has to know about the SQL database. So it's basically an architectural monster right now where everything is depending on everything and the SQL is still very rooted everywhere. So what happened is in the MongoDB project, and the MongoDB project is working for only one reason, is that it's not only, it's not simply a field storage. It's, it's both an implementation of an entity storage and a field storage. So it implements who can get the update and who can get the insert to be able to actually store the whole entity inside the MongoDB database. So it's basically working around this architectural craziness. So what I suggest is that we move to this. We move to, we still have the entity controller up there. Oh, I forgot to mention that, of course, in core, we have only support for load in the entity controller and everything that is right-related, like update and insert, is actually very messy, but whatever. We are going to fix that at one point. So what I suggest is that we keep the entity controller that does, that supports full crud for entities and that relies on a document-oriented layer that I'm going to describe to actually store the entities. The document-oriented layer relies on a document store, so which will be SQL by default, but could be MongoDB, an Elasticsearch, or whatever. And what I want to do is to make the field API only a provider of meaning for the field. So the fields are there in the document, in the entity document. The field API gives them meanings. It gives them a schema. It gives them widgets, formatters, etc. What the field API is good for. I want to remove the whole field storage part of the field API because it's not what the field API is good for. Here we come to the part where you are going to kick me. Some details about what I see for the SQL storage. For the SQL storage, I see us using a two-part storage, a canonical storage and materialized use. The canonical storage will be one table per collection. Listen here, one table per entity type, for example. So one table for node, one table for users, etc. In that table, I want to store the data as is, directly in a serialized form. For example, in an XML column. It has a lot of advantages to actually use an XML column in that case. And I want to store each revision of the document as a different row. So it's in an app only way each time you modify a document, it creates a new revision. Associated to this canonical storage, which querying this is obviously not going to scale. It's definitely not going to scale. And the point is not scaling. The point is allowing fast retrieval and normalization of the data. Scaling the querying is the responsibility of the materialized views. And what I want here is one table per view, per view, not in the views module, but one table per query if you want. Those tables could actually be built using the XML introspection capabilities of every SQL database. Most of the database has that. It's in my SQL starting in 5.1. Something. It has been in PostgreSQL for ages. It's not in SQLite, but we can actually implement it in user space, so it doesn't matter. I did some testing. If you take a 10 million rows table with some XML document in that, doing a full scan of that and extracting data using XPath takes a few minutes. So something like two to three minutes. So rebuilding a materialized view if you lost it is something that is an operation that could actually be made very cheap. It's not cheap as transactionally cheap, but it's cheap as you can rebuild it if you need it. It's not going to take you six hours. Currently, the materialized view we have on Drupal.org, which are PHP based, takes six to seven hours to rebuild, and we have about 600k documents, so it's orders of magnitude slower. Those materialized view could be either be made permanent or created on demand depending on the way we see that. So we still have ten minutes. So we could actually go through the preliminary API, but I can answer questions first. I'm going to go quickly through that, not in detail, but the way I see it is I designed an API based on what MongoDB is doing, which is probably the closest to what we want to do. Basically, you create a document like that. It's an array of arrays, and you can insert it in a collection with collection insert. You can remove it from a collection with collection remove. Everything is based on UID, so you get a UID when you insert the document. You can update a full document or you can do partial replacement. We can do stuff like you can load a single revision, multiple revision in batch or single, and you can do queries like that. You can do a simple query like give me all the nodes, all the documents of type article, loaded by created date, descending, and the result of that is you get an iteratable object that gives you all the real document. But you can do also deep query, like give me all the documents that are tagged with a given taxometer. You can materialize those stuff, so you can decide I want to create a base query that I'm going to materialize that has this data I need to be able to query per tag, and I'm going to use that differently. Possibly we could allow you to query across joints, across relationships, like querying for the tag name instead of the tag ID. Versioning and synchronization. One of the main design constraints here is to allow easy synchronization to kind of solve the staging issue. So the idea is to use UID everywhere, both for identifier of the document and identifier of the revision, and to work in a purely app and only way. So each time you do a modification you basically create a new revision. And something I'm completely convinced of is that we need to allow tagging of revisions in different states, so that you can say give me a query the data store, but only for the latest document or queries data store for published versions or queries the data store for published versions for site X, if you have a multi-site instance, so that you can do querying across different states. We can discuss the Schemales or Schema full of this. We could also implement aggregation like MongoDB style map reduce and a couple of technical issues with that. And I'm done. So now opening the floor to questions. Chicks. Just to be clear, the first GA of 5.1 is something like 5.1.30. No, no. The first usable version of 5.1 was the 30-ish. Before that there is even bugs with the query cache, so it's not even usable. Okay, the question was you are going to mandate a very recent version of MySQL 5.1. Yes. And I think that Moshi actually has a talk of upping our hosting requirements, and I think it's really a good idea. Krel. Yeah, I'm going back to you just after that. Larry? No, that doesn't surprise me. Yeah, so the question is if you get rid of the field storage, you are going also to get rid of the feature of being able to store fields remotely. That's... Some fields are remote. Yeah, some fields remote and some fields. I am pretty happy to... The way I see it is that this is basically an extension of the field storage, moving it just one level up to the entity level, and would be happy to throw away the feature of being able to store some of the field of an entity remotely and some other locally, but you can still be able... You will still be able to store full entities in different places. What you will not be able to do is to store part of an entity locally and part of an entity remotely. Okay, we need to discuss about that. There was a question here. So the question is about the performance of rebuilding views. Yeah, I mentioned that on Drupal.org, we are using three or four different materialized views and it takes something like seven or eight hours to rebuild them, so it's really a pain in the ass. The idea is to use the XML introspection features of SQL databases to speed up the process, and in my testing, rebuilding a view for 10 million rows should not take more than a few minutes. So it's a different approach than materialized view. Materialized view is used to pipe everything through PHP. That's actually a different approach. Yeah, Peter? Do you see an intermediate state where we go? Yeah, I actually gave some thinking to the strategy to move towards this. The first step would be to move. So yeah, the question is about the strategy to implement this basically, and is moving entity cache to core an interesting intermediate step or not? I actually gave some thinking to that. To me, the first step towards that is to actually move the entity system as a module and there is already an issue for that. Add crowd support to that, and as soon as we have that, we basically already have a consistent API. We can modify the underlying stuff way more easily than what we currently have. So yeah, for me, the first step is to actually implement full crowd for entities, and I'm going to give the next question to Fego because it's a perfect transition. The way I presented it, basically the controller is using the document storage as its storage. So the controller is basically just a wrapper around the storage. I think it's probably a good idea to build it as a different system because my first idea when doing this talk was to also use that for configuration. So for the configuration management stuff, if we build a sufficiently abstract API, we could possibly use that for configuration. I'm not sure it's a good idea anymore, but that was the initial idea. I think it's good to keep that separate from the concept of entities in Drupal. The way I see that is that the entities give meaning to the document. The document storage is only responsible for storing it. It doesn't care about its meaning. So the entity part basically is responsible for giving a schema in a way to the storage. So storage is basically an XML blob, and the entity layer gives a schema a meaning to that. And by the way, I think it's really a good idea to have a metadata framework in core. Just saying. Yes? Okay, the question was about what do you mean by XML storage? And yes, I mean just a very simple serialization of PHP array into XML. Yes? I'm not sure I understand the question. Yeah, no, I think what we need is a system that does that for you. I'm pretty convinced about that, and that probably means that we also need to figure out maybe to improve our job curing system and our background processing stuff, because we might need some more advanced stuff. But yes, the API needs to do that for you. You don't have to care about it for a developer. Basically, as a developer, what you are going to do is to implement something like a hook materialized query info, and you are going to give the API your queries, and it's going to do everything for you. Larry? Again? Yeah, definitely. So the question is about relationships and cross entity queries. So that's clearly the hard place. No doubt about that. I've gave some thinking to that. It needs more thinking, but I think we can build an API that allows you to do that, and that allows you to materialize those queries transparently, because the most typical type of queries we have across relationships is queries like that, where you want to query on a property across a relationship. So you want to display all the nodes authored by a user named mz. So you don't have a user ID. You want to query across the relationship. That's the most typical way of using relationship, and I think I have a plan to do that, to do this one. This one is actually simple. There are other types of issues across relationships, and those need more thinking definitely, but for example there is one of the questions is do we want to allow you to query for fields across a relationship, and stuff like that, and there are open questions. No doubt about it, but I think that in most cases we can actually build something that works really well. No, no, no, no, no, we are not going to do that. No, we are not going to, the question was, no, we are not going to mandate you to deal with that. Peter. Strupple is analyzing the queries transparently, and queuing up materialized views, or the you as a developer, I think this is the question, do you as a developer have to know that I need a view on this query, or are you somehow going to have people look at the query and figure that out? So I'm using views and queries interchangeably, so for me it's actually exactly the same thing. It's a way to query data, and yes, the way I see materializing is a manual operation, so you as a developer will have to deal with the concept, you will not have to deal with the implementation. You will have to figure out that, for example, if I'm a module responsible for displaying tax semi-pages, you will have to figure out that you need to build, so the best way to do that is to build a materialized view that has all the information you want, and use that as a base for your querying. Yeah, it's a slight burden on the developer, but I think that we can make this understandable, and the way I inversion that in terms of API is to basically do sub-queries, so the idea I have is to allow you, from an API perspective, allow you to do a query based on another query. So you already have a filtered list, and you basically sub-query this query, so it's a natural concept for you, you already understand what it means, and we just had to add the concept of materialization on top of that, and you should be okay. Not convinced, but okay, need work. Any other question? That's also true, in most cases you will not need to build any query yourself, so that's also true. Yes, don't believe DuSki note. Yes, this has zero UI implication, and for the performance of small sites it's probably going to be better, so if you're on share dusting as soon as you satisfy your requirements, your site should actually run better than it does right now, so it has zero impact on the casual site builder. It has an impact on the developers that want to start, and one very important thing we have here is that we need to make sure that developers that start developing with Drupal forget about SQL. If they can't forget about SQL, we are okay. If they want to do SQL queries, they should go see another framework. Yes? Actually not, that's a very good point, so the point is I think, are we trading select performance for insert against insert performance, against an insert cost, and actually we are not. When you think about it, currently we store every field in one table per field, so that means we have an insert query per field, so if you have 50 fields on a node, we have 50 fields, 51 fields, 51 queries, and we have 51 insert queries or update queries, and even worse we have deletes followed by insert for the field, which is probably the most terrible thing we can imagine. Here we are storing everything, the document has is using a single insert or update query in a single table. So what we gain here, we are losing it by materializing, but overall it's not a trade-off. We are actually improving everything. Yes? No, because so the question is how much more writing are we going to do through materialization? I think no, and I think we are even going to decrease the right burden, because this idea of materializing everything is associated to another idea of storing the document as is. So if you want to store a document, you only have one insert query to do. Sorry? Yeah, but it's not a cost until you reach the number of fields you were using in the previous model. Oh, yeah, it's we are out of time. Yep, that's exactly what I am saying. Yes. And it's actually a very good idea, because in today's world it's easier to scale web servers and database servers. So if you remove burden from the database server, it's always for the best. Question? Yes? Yeah. Yes. Yes, you have more writes, but you have less writes because you don't write one per field. Yeah, that no, probably not actually, in most cases, probably not. But yes, it's a trade-off. But what I was emphasis is that it's not actually clear if we are going to be more right heavy or not. We are, in my opinion, most sites will be less right heavy than before. Okay, so just for the record, Chicks was saying that he agrees with me. Just for the record. Okay, you can stop there. Thank you, everyone.