 Oh, hey, hey, did I do something with this? Is it turned on? Hello, hello? Ho, ho, is it on? Is it sound now? Okay, good, good. Okay, before we start, I'm going to just say a couple of things about the numbers in this very, very long title. So when I was writing this presentation or the notes for it, the September 4th, there were 22,694,113 products. But I just want to point out that if you try to sort of go to that site and go to the search result or combine some filters and try to count them or something, you're gonna come to the conclusion, it's not 22 million, but just so you don't do it, here it is. It's seven million in the search result. So that's true, of course, but that's only because they have a catalog of 22 million products out of which they don't have them all for sale all of the time, but of course they could, so. That's what we're talking about today, not like the actual search result. And there's also this additional fact that you can keep in the back of your mind. A product in triple commerce also always consists of at least one radiation entity and usually also an image. So 22 million products would be usually around 66 million entities. So with that out of the way, I want to elaborate a bit on the title, which is Pushing Drupal Development Limits. So I've done some very, very rudimentary polls in the commerce channel on the Drupal Slack to see a bit about other numbers for other Drupal commerce sites. The numbers we're seeing there are not really something we are comparing ourselves against here. So that's an interesting limit to be pushing in and of itself, of course. But being on that scale, it presents some other interesting challenges and other limits to be pushed. And I think those are interesting to talk about. So we are going to do that today. Pushing development limits will also touch on local environments and onboarding on a project. And it will touch on QA, project management, in-house product development, and knowing what code is in your contribute folder, and knowing what is in Drupal Core. And some of these points are connected to each other in ways. These are at least some of the things I want to talk about today. So pushing limits, it's not strictly an exercise of making a website run, despite being in the category of big scale, whatever that means. Of course, it could be a short-term goal to sort of create a website with 15 million products and then go to the clients and say, here you are, here it is. They have 15 million products. And then good luck, Kevin. This is not the case for us. We believe in having long-term customer relationships. We're like a partnership, which means we also have to sort of live with the solution we are creating, maintaining it, and developing the code base. And at this point, I'm a bit worried that some people in the audience are feeling this because they were looking for numbers and graphs and, I don't know, war stories with millions of entities. And I just want to put that to rest and don't worry, there will also be that. But at the end of this talk, I hope the attendees here will have some inspiration, at least, to reflect on, on different aspects of pushing the development limits. I don't think we have any absolute truths here. That's just like a recommendation for everyone. Maybe some of the points I want to bring up here is definitely not something we feel we have solved for our case. But maybe there's some ideas you can take home to your own large triple installation. Maybe you yourself have a large triple installation with some other good ideas that we can learn from. And either way, if that's the case, I'm happy to talk about that. Send me a message or poke me on the shoulder in the lunch or something. So that's why I have this slide right here because those are my contact details. And it's also the slide where I'm introducing myself. My name is Erik. I live in Trondheim, in Norway. I work at New Media. We are a digital agency specializing in tailor-made e-commerce solutions. We have more than 15 years of experience making e-commerce solutions. And we work exclusively with the Drupal commerce when we are delivering web shops to our clients. And so there we're going to talk about pushing Drupal development limits. This is our agenda for what we're talking about. First about the initial project, how it came to be. A bit of history since we launched it until today. And we have a point in edge and the call nerdy numbers. My favorite is a point about development environments. Scale on QA, scale on project management. And something I like to call being the country edge case and being the core edge case. So our first point in the agenda today is the initial project. So we are delivering tailor-made Drupal commerce. That's how we ended up with this project. And I don't want to go into much detail about the process before we started the project. But I have some fun facts and numbers about the initial delivery. But first I should probably say the site is called Akademica. They sell books, they sell a lot of books, very many books in physical formats and in digital formats. And we actually did a presentation about this in Drupal Europe in 2018, which is where these slides are from. Did any of you attend that by any chance? Not many, okay, that's good. There's not much overlap. So this is just the meta slide, slides within the slide. So just some facts here, it was an upgrade from Drupal 6 to Drupal 8. It was 2,500 working hours, one project theater for developers, six integrations for data sources in different formats with 15 million plus products. And the launch date was non-negotiable right before their busiest period of the year. And the total timeframe was three months. And this look, well, like I think that's supposed to illustrate quite terrific numbers and deadlines, pretty much a recipe for a disaster delivery and an unhappy client. But we made the deadline and the client put their revenue goals that year and they have kept growing. And here we are still developing their solution four years later, so that's something. I also wanted to include this from that 2018 slide because I found it quite funny when I was opening it. It's another slide from the presentation. The first image there is a count of their Drupal 6 site and the products they had there, just shy of 10 millions. The other picture here is showing the folder sites default, which were where we kept the custom modules for that particular project. It also includes a fun relic called Drupal 5 modules, but the number of custom modules was 50, apparently. So yes, we are rid of that, luckily. So a bit more history since the launch. After launching the site, we actually went on to launch another site on the same code base. I don't know if any of you get this from time to time, but it's something that clients tend to do. They come to you and say, you know that site you launched? Can we just have a copy of it? Just a different logo and it's just a few changes. Just a copy, it's already done, right? So it's an interesting exercise of specifications and you have several options, of course. In my opinion, only we'll have two options, either fork it or use a common code base. And I guess how to weigh the pros and cons is, I guess a presentation in itself for us, this really comes down to the development set that we have. It will be the same team developing the other sites, maintaining the other site as the first site, so it only made sense to do a common code base. In other companies, maybe you can imagine a scenario where you have sort of a handoff to team two, here is the site code base, and you would sync up from time to time with bug fixing or new features or something like that. But for the client, this is at least theoretically cheaper because they could pay for features only once, famous last words, instead of paying for it twice, which would be the case for the fork. I guess minus some initial costs since it's almost the same. So that's what we ended up with. And interestingly enough, that happened again, and we chose to launch the third site on the same code base. So there's now three sites that run on this code base. So I guess one way of putting it is in 2018, we had one code base with a bunch of products, roughly 45 million entities, and then in 2020, we have two code bases and roughly 120 million entities. And in 2020, we're closing in on 200 million entities in total for that same code base. So that's pretty cool. So that's a good indication that we're about to enter the nerdy numbers part of the agenda. And that's great news, of course, I'm gonna nerd myself. So that's fun. Having millions of entities is, of course, something that requires hardware somewhat and cloud computing and all of that. And unfortunately, I can't go into all of the details there within this nerdy numbers, but there are some keywords of the stack and hosting. It's hosted in AWS on the EC2 instances and it uses Aurora Serverless for database. I have no idea how to say that in English. I'm gonna say it in Norwegian, Aurora Serverless for database. They have AWS service for elastic search and AWS service elastic cache for this caching there. So that's sort of the stack. And then I have a couple of slides with some fun numbers. Here's one number, it took one week to set up 10 million products. This number comes from before we started to commit to the data model, the search back end and stuff like that. We wanted to do like a stress test. How would it even perform with such an amount? So we wrote a script to generate 10 million products and it took one week. So again, keep in mind one product is two entities. Of course, that's 20 million entities. We have up to one million product updates per day from many sources. This number varies wildly from day to day, but this number is something that is representing a busy day for them. The Q auto increment ID is at 1,967,78882. That's just the number I know, but the maximum auto increment ID in the Q table in the database is about 4.3 billion. So in other words, we have used 45% of the auto increment IDs of the Q table, which I found interesting. So as you probably understand, this site uses Qs for basically everything as this number shows. And I guess we have to do something in about four years or something. In August, we were processing 66 gigabytes of XML in 7,048 XML files or product import data. And I'm not a big fan of XML myself, but it would be very exhausting to be complaining about XML when it's in this scale. So they have it. Then I have some numbers here and some interesting things that happened because of it. Last year, in their most busy period, they had up to 30,000 select queries per second. So that's something we identified. Maybe this is a bit much. We should improve it. So this year, within the comparable same period, they had 3,000 select queries. So that's a big improvement. So the way we achieved that was adding a lot more caching using Redis for that. But then other interesting patterns sort of start to emerge when you optimize for more caching. So during this busy period this year, they were seeing slowdowns in page speed for no apparent reason. Now the front-end server was not overloaded in any way. There's no problems with the database, scaling like it should. The search backend was healthy and all of that. But what actually happened was we were reaching the bandwidth cap of the Amazon Redis cluster, which is interesting. So we can see here a graph about, this is packet loss, up to 65,000 packets lost per second. So you can think then that it's not something you think about at this point. They were running on the 8x large instance there with 32 virtual CPUs and more than 200 gigabytes of memory, which is more than enough for caching this site. But if we sort of zoom out on that, it also comes with its own different bandwidth tiers and it's capped at 10 gigabit, no sorry, 12 gigabit on the 8x large. So that's where we were hitting and we had to double that to get to 25 gigabits per second and that's what you are seeing there. It's when we upgrade the cluster. So basically what happens when you're increasing latency and losing packets in your cache setup is that congratulations, you have a slower site because of your cache. So that was interesting and as an extension of this, we're looking into how we can not use the bandwidth for a large cache objects for next year. So hopefully we can use different clusters and local Redis clusters for the largest cache objects like search results. So that was the nerdy number slides. Now we're gonna talk about development environments. I don't know how many of you work in this way, but a very common pattern and common workflow. Unfortunately, that I see is you download the production database and then you start to hack away and implement the features and so on. And I searched the internet and here are two kids saying what they think about that. It's a bad idea. Just generally it's a bad idea that you can't do that. So we don't do that for any sites, but if we were just looking away from having that policy and thinking that yes, on this project we should use that workflow to develop new features, download the production database and hack away. What kind of numbers are we looking at them? So the database dump GCP is just under five gigabytes, 26 gigabytes if you unzip it, takes up 60 gigabytes on your computer with like the database files if you want. And I wanted to check like for this presentation, how long would it take to import it on my computer like literally this one? It's not a very fancy supercomputer, but this one I just tried to do that for this talk. So first I tried to use the command Drush SQL CLI and it's crashed because of course the timeout is four hours. I should have thought of that. So I can get around that by using the MySQL CLI instead. So I tried that, it's crashed after five and a half hours because my disk was full. So that's also solvable I guess, I can clean up some. And in the end I managed to import all of the tables there and it took just over six hours. So what I'm trying to say is I understand how it can be practical to download the production database and develop against that for a lot of projects, but for this project it's really, really unpractical. But this comes with its own problem, the fact that it's unpractical means that we, literally can't do it even if we wanted to and that gives us different problems while developing new features. So let's imagine for a moment that our default environment is not this database, but it's just a clean install with no products, no taxonomy terms, no nothing. And we have a ticket to work on, for example, implementing a new entity field on the product called first published for example. So being a developer you would open your site, create a couple of products and then you would start to write an update hook. Maybe it looks something like this, load all of the products and set a field on that and then save it and then you try to run it and it runs in a couple of seconds for your couple of products. Solve the ticket, maybe a person is reviewing it with a similar setup, runs in a couple of seconds, solves the ticket and then you deploy it to production. And I don't know if you know this but while you're running update hooks, the sites will be in maintenance mode of course and I did some rough calculation. If you did this on this project it would take down the site for about a week and a half running this update hook. And I mean that's okay if you shoot for 97% of the time but then you can only have one of these deploys a year I guess. So our problem is having relevant development environments. I generated an image called relevant development environments and we have resolved here to having some set of scripts making it possible to import a couple thousand products but of course it's not even nearly enough to have like the full pictures. And I think this also bleeds into my next points which is scale and QA and let's ponder a bit about the nature of QA, what does it actually consist of who are the ones doing it. QA is that really the job of identifying and pointing out that an update hook would run a week and a half or is it to look at the site deployed and then decide whether or not it passes a list of expectations of the new features and the existing features. Like if you go back to this theoretical feature of the new entity field, what would it look like for a QA person to test that feature? Probably one would spin up a new instance based on the code in the branch where you implement it and you would test that. Maybe have set the scripts and even test the update hook. But of course if spinning up the QA instance is not done by a QA person themselves and it does not include applying the code on top of the production database, there is no way to actually like programmatically catch the fact that this update hook would last a week and a half. So that makes it very hard to actually know what to test as a QA person and it even makes it hard to sort of underlines the problem that are usual gateways of having checks and feature branches, code reviews, everything. It might not actually catch such a huge problem as this. So it might not catch taking down the site for a week and a half basically. Now I'm exaggerating a bit here but I think the point is easy to understand. Then I want to talk a bit about scale and project management because this problem I was describing with the QA person not being able to identify different things is sort of fine. We can work on our internal processes and sort of try to be better at identifying that because this QA person would probably be employed at the same company as they are doing the development. But of course a project manager at the clients for the client or a project manager that is sort of steering the product for the client is probably employed somewhere else and are not sort of subject to the same processes and the direction we want to sort of catch the different problems while developing. And I just want to point out that this example is purely theoretical. I'm not trying to talk about anyone specific at our client or anything. So let's say a project manager is focusing on SEO at the moment and they find that our site doesn't have a site map we should have it. It's going to boost SEO, it's going to be awesome. And they were earlier working with another Drupal website and they know it's just install a module, do a couple clicks and then we get a site map and maybe that will actually have a boost in our SEO. But for this site it's like a whole process of analyzing and it's an entirely different cost benefit analysis because you don't really know if it will boost your SEO but what you do know is that it's going to take a lot of work because you can't use the default setup out of the box. So the sort of the symbiosis here is completely different because where in other projects you could have a client create tickets and then you could start working on tickets in their prioritized order. Here we're going to sort of identify together what is going to be beneficial and smart and even possible. Maybe something is not possible. But it's also about just knowing the system, I mean for the client to have a site with 20 million products between five and 10 external data sources and a complex set of rule-based system trying to determine prices for roles, currencies, regular sales, wholesale prices and it's just an advanced machinery of setting up matrices actually for which products should be for sale and in the search at which dates, visible and published all kinds of things and it's a complex task for a programmer actually and requires a lot of dedication for an internal project manager and it's specialized knowledge for this site only. So it's not about knowing Drupal really, it's about knowing this exact site. So if you're sort of switching people here you're losing a lot of knowledge and a lot of valuable resources for the project. It's a real challenge. So what I'm trying to say at this scale there's an administrative overhead that is sort of like a technical administration and we as consultants need to be extra careful how we advise on their project development decisions. So it's hard to just think about which feature you want and not care about the technical details which would be the case on many like normal scale projects I guess. Then I have a slide or a pointer on the agenda called being the country edge case that's supposed to be the type of and I found it funny. So this is the agenda point I guess. We're a base. So the first example I can think of is in this category it's for Drupal commerce which up until version 2.22 would not use SKU as a unique key in the product variations table. This was a problem for us because when we were importing products we were using the SKU to look up if we already had this variation so we could load it and save it instead of creating duplicates. So this here is the query that ends up being run and when you try to load variation bytes SKU and this is a picture from before we add an index to that table it took as we can see here almost seven seconds. This is from August 2018. So you can imagine that having this overhead for every SKU item that will run every single product import every day, all of the day. It's a potential for saving a huge amount of CPU cycles if we can get that down. So for example adding an index and now it's zero seconds which is better. This query was run just last week so it's actually real numbers right there. And as I said currently we are running one million product imports per day and so if you do the calculation here per one million product imports this optimization saves us almost 1,700 hours or 71 days if you will for every one million products. And then I have a point here about being the core edge case and a story about this update right here. It's about when path aliases was converted to entities. And first of all, just let's get that out of the way it's a tremendous effort to get this in an awesome feature. So I'm not complaining about the work being done here by people often for free and in the spare time. Awesome work, just wanna highlight that. So in this example I would say we were a typical core edge case. This is one of the updates hooks that is included in the upgrade from a triple eight seven to eight eight and the update took in question here. In the middle here we can see it's going to run in increments of 200 over and over until it's finished. And at the moment this update was landing in triple core we had around 15 million entities I think. That means that this update could run 75,000 times which it's a computer, it can probably run fast, right? Well, let's assume a pretty normal shop with 2000 products. This update hook would take a bit more than one minute totally tolerable I would say. But if we do a bit of math here and say it would take around one minute for 2000 to be nice, it would amount again to one week of running the database updates on production for this site. So that's not really an option to be offline for a week. So clearly this update hook would not be sensible to run on production. So let's look at what we should do. Could it for example be an option to not run it? Well, let's look at the other parts of the eight seven to eight eight upgrade did. It switched out the alias storage service. So basically I guess this table is going to illustrate it's using a different table for looking up aliases. So if you don't run the update hooks, you're going to lose the URL aliases. So if that was the URL before then it's going to be that and the old one is going to give a 404. So that is ugly looking in the browser of course but it's pretty catastrophically for SEO. So this is not really an option either. Basically we can't run the update hook and we can't not run the update hook. So it's sort of a problem. We have two problems. They seem mutually exclusive, I guess. So let's look at what we could do with that. First of all, if we look at the start of the update hook we can see it's going to bail out early if the storage is not an instance of all the storage. So we switched out that. That's a good start. Now we can run the update hook. It would finish instantly. Good start. And then we replace this path alias repository service. This way we can look up the alias by its new entity table. And if we don't find it we can fall back to the old logic that was deleted. But of course since the alias has not been run, the aliases are getting converted and we need to also do all of that work that would take one week somehow. And to doing that we added an update hook that adds all of the items into Q with the module update worker. I'm gonna dwell a bit on this right now because this module was actually created for this project and it's a very useful module to make it possible to do large amounts of work in an asynchronous way. Perfect for looping over one million node IDs or something and that's what we have here. We're looping over 15 million database rows. So basically we had two problems that was mutually exclusive. But that created through solving them, created two new problems. We have four of four pages and we have not created and converted the path aliases to entities but they can be solved and they are at least not mutually exclusive either. They can be solved by creating a new service with a fallback and queuing the work with update worker and running that while the site was online. But of course doing it in the most naive way would be to create a custom update hook that would just loop over the items and then create Q items. But this would also loop over 15 million items and take a long time to run. That's also not an option but of course it's easily solvable by just creating that one item that will create all of the other items. So we have one update hook that runs instantly and that will create all of the other conversions dynamically. And this approach made it possible to do this migration at the speed of around two million URL aliases per day. I did not have an alert set for when it was completed. I was looking at the Slack history. It's my only data point unfortunately. It was at least done 12 days later but I'm assuming like 10 days. And since that's the amount of time it took I'm just going to say it's not really acceptable to have the site offline for 10 days. So I'm sure there's plenty of other ways you can solve this, not doing it to million URL aliases per day for example. But our goal here was that we could keep the site online while we were doing these conversions asynchronously and safely. And that way it didn't really matter if it could be optimized to take 10 hours or if it took 10 days if it did. The site could just operate as if nothing was going on. So I have some conclusions and takeaways. So what is really the conclusion? I guess my main point I'm trying to make is that scale is not only about hardware or gigabytes or requests per seconds or something like that. I'm actually going to be so bold and say that scale is less about that and more about the human factors actually. Of course about knowledge, yes. Also about processes, routines, business decisions for the company that is operating the site. And it bleeds into all parts of a project. Everyone, it's touched by something that the scale that might seem like it's purely a technical aspect, but it's not. So for us this is a continuous process of learning and we constantly see that we need to improve and be proactive to stay on top of it. And we're not always managing to be on top of it also but that's half the fun I guess. And if you also find that fun, you can send me a message or something. Always interested in telling to the people. Just get in touch, share my contact details. And yeah, thank you. Time for questions. I will start. You say you import one million products per day. What do you use? How do you import them? Do you only use built-in Drupal module like migration or anything else? And what about performances? Good question. So when we started developing the site, we tried to use Migrate for importing the products but we were hitting limitations with the framework basically. It's a complex framework with a lot of additional features that we really don't need in this case, which was slowing it down. So with the first iteration of our product import based on Migrate, we were doing 40,000 import items per day and so we rewrote it from scratch using Drupal Qs and now we are processing one million items. So that's why we are not using Migrate and it's a custom solution but it's based on Drupal Qs. Yes, from Core. So a question from Federico. Is the front-end decoupled? I guess views are pretty complex and can be slow. It's not decoupled. It's a full stack Drupal site. The search part is I guess a hybrid approach with our React app there but we don't use a lot of use now. We rely heavily on the search backend which is elastic search for fetching related products or search results or categories or recommendations and so on. So we're mostly using the search backend for lists and not views. Anyone? Can you come here please? Yeah, great presentation. Thank you. Where do you save images like on a server or on some external service like AVSS3 or something else? No, it's actually saved on the server and it takes up a huge amount of space. Yeah, interesting if you can say how many space it takes. No, I can't, I guess I could find out but I don't know of the top of my head. That's a fun number, I should have put that in. Sorry about that. I can check it later. Do you consider to migrate to some scenarios? Yes, definitely. It's something that we have considered but it's a project in itself. You can't just move that folder to S3 for example which is the option we are looking at. So yeah, I'm not running an update hook for sure. That's going to take down the site for a couple of weeks. So yeah, it's a careful consideration but it's also cost because it's cheaper to just have a huge disk than, yeah. Okay, thank you. Thank you for your presentation. I was wondering on production environment when you will have to process the queue items. Can you decide on a period of time when there is less traffic maybe and have you thought of a way to log the entity that are processed by your script maybe? Well, I guess yes. The first question is it's not the same server that is processing the queues and serving the traffic. So of course it increases the load on the database but it's an AWS service and it's supposed to scale so if traffic increases while we're processing it should theoretically be okay. So we don't have to think about that luckily as we can process queues all day long every day and that's what we do. But your other question was regarding logging and for the specific update Tuka was referring to earlier we were not logging it because it was very easy to see the amount of items that were left in the URL alias table but I guess you're asking the question because that could easily fill up the database log and for this project or for our project we don't use the database log we use a centralized logging system so we can throw how many millions of log items we want in there so it's not a problem really. I don't think we logged that specific one I don't think we did that now. But we could it would not be a problem if we did except I guess kind of noisy but you can filter that out yeah. Hi. Hi presentation. I don't know if you mentioned the solution for the database exporting and maybe generating content locally for developers. I did not know. So part of the reason is because I don't feel we have sold it properly because it generates the local content on a scale that is so different compared to the production environment but what we are aiming for is to mimic the import process of the production server so we kind of emulating the fact that these XML files are coming in and deep quitting in queues and then we are storing the metadata from that in one elastic search index and it's like you run the same import routine as the production server does only with a few select XML files that will give you so and so many products so that's the process of it it's highly customized and it has yeah I would say loads of room for improvement but we don't use any solutions like Devil Generate or something like this because we want to ensure that all parts of the processes are working all the time. So you are using the database just to save the products and then the products are being retrieved from elastic search for presentations, right? And having the database storing so many entities it's pretty heavy and it fuels up your database you have been thinking actually that you can implement it in a way that you can have all your entities living only inside elastic search and not as entities in Drupal Well, we did consider different options but it's sort of a hybrid approach because of course I don't know if that was in any of the pictures but every book has a lot of different seemingly fields for example what type of what are the authors what are the translators of it who are the illustrators all of these different methods that you would usually store in fields but we only store in fields that things we strictly need to store in fields like the SKU because we need it for referencing the entity and the order items but all of the other things that are seemingly fields are coming from elastic search which saves us a lot of a lot of database storage so we sort of have a pseudo field which is the metadata for elastic so that's sort of the hybrid approach but we have actually not considered just moving it all together there but certainly an option very exciting to try I guess Yeah, we've done it and it works Yeah, I'm assuming, yeah, cool Anyone else? I finish with another mine Can we talk about products but do you also have limitations due to orders maybe with revision issues? No That's an interesting question they are not revisions at all and it's an issue in Drupal Commerce about adding revisions to products and I think if this was suddenly landing and we were running that updated care it would probably increase the database storage with a very steep curve so we're trying to make the case that this should be opt-in we're not looking for a revision entities on the products at least now If that was the question And nothing about orders Orders, okay Orders have revisions in its own way I guess but the orders are not in the same scale as the products so in that way it's a normal site I would say so we're not doing anything special with the order entities in itself at least now Thank you Alright, thank you everyone