 Thank you so much for coming. It's a huge pleasure to be speaking here in Canadian home event and it's such an excitement being Ottawa. So thank you so much to the organizers for having me here and I'm excited to teach you all about custom imports and custom migrations. G-Guard did a great presentation on how you do migration with migrate framework. Today we will be covering a little bit more advanced topic on how would you do it in a custom way in case you have a need of completely custom imports and you facing some challenges. But before we dive into that let me introduce myself. My name is Anna Mihailova. I'm a web developer with Digital Kidna. We are a Drupal shop located in London, Ontario and we are very passionate about Drupal so we always looking for new talent and new opportunities to connect with us at the Kidna.com. So you start your job and you end up and a task when you have to do an import or migration. First of all you need to consider what type of import or what framework you want to use. Of course there is a great framework in course that we just heard the presentation on and migrates. It is now in court and it's very exciting. However sometimes you are facing more challenging tasks and part of this task could be that you need something that happening recurrently because somethings are happening as often less sometimes every 10 seconds or every night or you need completely custom control or when it is happening how often it is happening and what if exactly it's doing. So if you are using migrate framework you might be running into challenges of you know bending framework into something that it's not meant to be doing. So if you want something very lightweight like changing just one field on the node you probably don't want to write the entire huge migration and update on the nodes because it's mostly right now is focused on create plugins versus updating or changing the status of the node. If you have something like a multiple external sources and they all need to be used together then it's a good idea to also consider something custom. For external sources you might face something like you have part of the data that is coming from the third party API like REST API or some JSON or XML feed and then it needs to be processed and combined with something coming from a different source of data and then when you process it together then you save it into the node. You can split it into multiple sections but it will hold some sort of a critical information and cannot be saved somewhere outside. You might be facing a challenge that you need two source plugins or your source plugin is getting to be very complicated. Then that is a good option for a custom one just because you cannot incorporate it in the more flexible fashion. And of course if you're processing a lot of information like let's say 500,000 nodes or more you have more control over how you create your import in a scalable, architecturally correct way so your site and performance resources are only targeted when it's not a huge load on your site and it doesn't affect your service. So let's say you consider it all of these factors and you identify your task as the ones that you want to do for the custom import. What are the next steps? So that's a good question. Before you start your journey let's get a quick survey. Who here is ever written a custom module? Okay good. Who is considered themselves as backend developers? Most of the room. Who has any experience views by API in Drupal 7 or Drupal 8? Okay so let's do a quick introduction on that. So before we will be doing our custom imports we need to think about how we are going to do it and what are the initial requirements or goals for the input. So the first question to ask is it going to be periodic or is it going to be one time input? And that's a very important question because usually we have some environment sometimes we have client cases where the client says we would like just migrate everything and forget about it. But is it really the case? Is it going to be one time migration? Then what will happen if you have something like a huge newsletter site, newspaper site and they have hundreds of publications throughout the day? You run one migration and if you don't consider for the periodic one then you don't have a second chance to re-migrate the data. So at this case you might need to consider to be a periodic one and then you can run updates in a iteration of fashion. The second question to ask is it manual or completely automated? This is the question is do you need somebody to actually pull the trigger and run immigration or do you want and some sort of a robot to do it for you? And the key here is a time when the migration or import needs to occur. So let's say you are a high education institution and you need to migrate import, continue education courses data every night, right? You don't want to do it during your peak hours when students go on the site and basically register in full of the courses because the site will slow down, students might get confused and mixed information in terms of when the courses are available, when they're not available, the prices for the courses and so on so forth. So you would like to have it at night when nobody is preferably on the site. Obviously nobody will be sitting there at the office waiting to pull the trigger at 2 a.m. and basically trigger out the import. So in order to avoid any sort of problems and confusions, you might consider an automated import, whereas this continuous integration automation will basically pull the trigger for you and you can just get in the mail and in the morning when I'm back in the office you can check if the import ran correctly or not. And for this purpose you can do a Chrome job or continuous integration and we'll cover that more in detail a little bit later. But the prime difference is the Chrome doesn't only do your job, it does a lot of other jobs on the Drupal site, right? So it maintains your Drupal, it keeps it clean and it does a lot of housekeeping for you behind the scenes that we don't really pay attention to. So if your job is a huge gigantic import that needs time frame, that is time sensitive, then chances are that Chrome just will not suffice and will not fulfill your requirements in order to run a very sophisticated big import in the time you need to complete the task. In this case, you need a continuous integration. So what's continuous integration? Has anybody heard about continuous integration? Yeah? So what tools are you using? What are you using for continuous integration? Yeah, what else? Does anybody use Jenkins? No? So Jenkins is a is a butler at your service. I like to call this little body a cronon steroid. What it can do for you, it can run any Drush command or basically any bar script for you. And it's like you set the time and it will schedule a job and if your job fails it will just try to reattempt it multiple times. So good thing for it, it's very easy to install on any server environment and basically all you need to do is write a custom Drush command. So wrap your import in a custom Drush command and after that it will just run it for you and you don't have to worry about learning any new language or integrating with like Java or what not or like whatever might be behind your server to just run it. You just say, hey, run this Drush command for me and by the way, clear the cache every 10 seconds and it will do it for you. And second part of it is think about the architecture. When it comes to architecture, it's not only architecture for the content types. It's what we call Drupal architecture when we do site building of our new D8 site or when we evaluate the site building code, the scene that we're migrating. Let's say we have Drupal 6 or Drupal 7 website that we need to migrate but also the architecture of your backend classes. It's very important and the reason behind it is don't reinvent the wheel. So if you can extend something and reuse something that's currently in core, then do this, reuse some traits that are in core because you will get all of the support from the community because they maintain it and you don't have to worry about reinventing something or writing something from scratch. So that will save your time. And secondly, you can do your class here. You can do your class hierarchy in a way when it's a base class and you just extend it and reuse it for multiple parts of your input. And the third part of our puzzle is sources. So when it comes to sources, talk to a client or define your own requirements in terms of what source needs to be used for the input. And when it comes to sources, it can be so many different ones that it's hard to imagine. It can be JSON feed and XML and third party API integration and legacy database integrations, something completely custom. So you need to identify these sources before you start so you can plan your source plugins accordingly and your source classes accordingly in order to figure out what parts you need, how you go into queries, how you go into access. Once you gather all of your requirements, you're ready to start the journey. So what's the option? The option that we currently discussed is like manual import. And this is the easiest one. So all you will need for this one is you have your import code, but also we'll need some sort of a UI. This option is good when you're creating something for editors to be reusing later on. So let's say you have a newspaper client who wants to import a bunch of new articles or like an Almanac or archive of articles once a year. Obviously, that's a very easy option. If they upload the articles in a certain folder and then you have your CSV of all of the articles with titles and descriptions, you can just do that, trigger it through Drupal UI, you provide some sort of a form. And that's it. You just watch the progress bars by GPI running and the editor sees how it's all processes and everything goes amazingly. Second option is the Chrome job. There are some limitations. Depending on your hosting, the hosting provider might limit the Chrome job run to up to three minutes. So if your import is going to take longer, then it's okay, it will pick up next time. However, your Chrome may run once an hour or even less often. It depends on the load of your site and how popular it is. Let's say some sites like to put environments to sleep if there are not a lot of visitors in there. So consider that your Chrome job does not provide final integrity. And by that I mean it doesn't guarantee that your import will be processed in a timely manner. So it might process only half of the import and the rest of the nodes will just be sitting there waiting for another Chrome job that might never happen. So if you have some sort of a timely information, let's say prices on something, you definitely want updated prices in a timely manner versus having them like lagging behind a few days. And the last most complicated one is continuous integration, which is Jenkins and will cover that last day. So let's talk about sources. The simplest scenario is when you can coordinate your actions with a client or IT and basically get your sources dropped into the Drupal file folder. This is the easiest way because Drupal is aware of that. You don't have any security implications. Just manually upload them or automatically upload them through the SFTP protocol. You know where they are. You just pick out these files and process them out. The file formats can vary, but the easiest one are obviously JSON, the most supported one, and CSV. XML is a little bit of a legacy format. So if you have a lot of XML files, it may be very difficult to process depending of if they're using custom schema there and the real PHP partial is not fiction. But there are scenarios when files cannot be dropped, Drupal. So that's the most complicated part is when you have file sources outside of Drupal. So you have an external feed. It's good when it's public external feed and you don't need any special credential to access it. It could be external database. In the best case scenario, it's minus scale database, but it could be MSSql or external file server like SFTP or FTP access. So once you define your sources, you can speak about architecture. Well, everything in an hour will start with structure and then structure starts with foundation. So let's talk about what part of structure we need to identify before we run in the input. First of all, it's a content type structure. And it's not only content type structure for your new site, but it's also structure of your source content types. So how you map the fields, what fields go where, what other entities might be related to your content types. So let's say you're having a request to publish an article. It sounds very easy and very simple. You have just news articles that goes into your article content type in Drupal. But there are hidden obstacles that might not be seen immediately. Is this article having any tag system? Is it something like having article categories? Do you have others associated with this article? So you see how it just grows from just being one simple content type to related taxonomies, related user entities, or maybe sometimes a custom entity. The second complication for this is translations. How identify your current site or the source is multilingual and how you're going to handle the translation. It's great when you have Drupal to Drupal migration and you can do Drupal 6 to Drupal 7 and you can map this TNAD or you can pull out from entity translations of table. However, if you have two different fields, like let's say two XML feeds, one in English, one in French and you have to merge them together, then you need to do custom processing in order to maintain the mapping and attach the sources together. So the sources are very, that is, and a translation of an entity versus two separate entities. And sometimes it's not possible. I had scenarios when I had to migrate press releases and one day in French, one day in English, both feeds had unique IDs. They just did not map together. So in this case scenario, you may need to go away from translations and just create just English and just French content types. And consider amount of information you want to migrate or import on a daily basis. The good example would be prices again for something. Let's say you have a bookstore and you have a lot of prices. And you need for bookkeeping or whatever other, like, you know, purposes to maintain the amount of prices and how they change to grow throughout 10 years. Let's say the bookstore has 20,000 records of inventory. Over two years, it will be 200,000 records. If you want to represent them in Drupal views or something like that, it will be a very huge impact and performance hit on your database because that will be a gigantic query. And if every book has, like, let's say, 20 to 30 fields on them, like SQ, number, title, description, author, any other, you know, bibliography, metadata fields, that could be impossible to have all of these joins and the amount of books to be present at all. So in the scenario like that, when you know you will have a very complicated content type and you have a lot of data, it might be good to obtain for the custom entity. So what are the advantages of custom entities? Well, you can construct your own architecture in the database. You can build your own database table and can be flat table. You can add indexes wherever you want. So it will be fast to query and it doesn't have to have all of these joins that Drupal gives you by default. And also, if you don't need translation for this one, you can completely make it untranslatable, so kill the overhead of translation. Another advantage for that is if people only access the inventory of books sometimes, then you don't really want to overload the note table with all of this data that is used once in a year, let's say. So consider this option if you have large volumes of data and if you want to query them effectively and you see that your site is basically choking on the amount of data. The second part of architecture is your structure of your feed. So there are couple challenges in here and the first challenge is everything is changing. We are living in a changing world and feeds are changing too. You had one head of IT department who created a feed, then they wanted to integrate the second feed with completely different fields, then another one and another one and just grows exponentially and really hard to maintain. So when you create your import, it's good to communicate that the structure of the feed needs to be set in stone or at least if it's changing, you have to communicate together to get the changes and incorporate it into the import. A second stone that you can hit is format and encoding of the feeds that you're getting. By that I mean, encoding is very important for PHP. PHP is only very comfortable with UTF-8. However, the source file can come in multiple encodings. It can be Latin, it can be unicode, it can be some weird characters. As a result, you will get a broken content. So talk about encoding and about how you can make it as easy for Drupal to consume as possible and there are a couple libraries that can help you with encoding. I use encoding library a lot but you can only clean up it to an extent. Sometimes there is a need of manual work to be done for the import. However, if it's progressive import, if it happens every night or if it happens more often than every night, like every ten minutes, then it's very important that you have a clean feed coming in. And the second one is access credentials. So that situation becomes very important when you have something from external sources. Let's say an external data server, SFTP server, anything like that. If you don't have access credentials, if you might have login and password but that doesn't provide you a needed access level, that might be very difficult for you to debug and identify. The question here is everything is working on your local Drupal machine environment. When it comes to the development server, nothing is working just because the access credentials fail into login. Once it runs that, think about the key. So as we just covered, access is very important. So the questions to ask to IT people is how am I going to access, what protocols you're using, what libraries you will need to use. So if you have a very complicated XML feed, I recommend to use QueryPass. It's a very lightweight library that basically parses XML the way you would parse, Query parses the DOM. So you can just query it in exactly Query fashion way and it's very easy to parse the data of any XML feed. Encoding library, as I mentioned, it basically encodes and converts any feed into UTF-8 for you but as any library has its limitations. If you're hosting the site yourself, make sure you have all special modules installed, like a TURL. Sometimes you don't have this installed on the server and make sure you have all permissions and credentials and the files coming in under the right user group and under the right owner. So Apache needs to know about the files. It's using Apache as a server. Foundations of the import. So the import itself is very easy. There are only three steps. Get data, parse data and save data. So what can be easier? You just get it, parse it and you save it into Drupal. However, there are problems. The first problem is the time of the import. Of course you don't want to provide a batch interface where your editor will have to sit and stay at the progress bar for two days. That's just not going to happen. So you will need to think about how to make it scalable, how to make it being processed in multiple chunks and how to make it efficient. And the second problem is your server resources. So let's say you are basically processing thousands and thousands of data at the same time. Your server is getting a bit hit on that. If at the same time you have a big day of registrations for your online programs, then probably the server is not going to be happy with you. So it will not be the end users. So the foundation that I am going to tell you right now will help you to overcome this problems. So we will take an approach of batch API and a queue. The queue is essential for this idea. So who has experience with queue API? I explained a little bit what the queues are. The queues in Drupal are basically just records if we talk very simple about them. They are basically just records in the database table which has some data and a timestamp. And when Drupal is about to process, it grabs this record from the database, processes it out, parses it and does whatever the queue processing feedback tells it to do. And then marks it as expired. When it's expired, it just deletes them out of the database. So what's the advantages of a queue? The advantage of the queue is that you don't have to keep everything in memory. It's stored in your database and it's ready to process when you have time to process it. So that makes it very lightweight and if something fails, you can always come back and refuse this item. And you can have a good feedback and records on what items failed and why. And batch is basically having the same idea. I like to marry them together. They basically, when batch is processing my queue items. So what batch is doing is, let's say, a progress bar wrapper on top of your queue, which basically allows you to set a set of operations to be performed on certain items. So batch can not only handle the processing itself, but for me it handles the queuing of the items, the processing of the queue items and then finishing of the job. So this is when we do manual import. When you don't do a manual import, you can hand off everything to Drash and Drash will be your batch. So the first one, we will have the approach of three queues. The queue number one will just have all of the files queued. The idea behind this, let's say, you have five CSV files and you would like to have them all being part of your one source import. So you will just queue out all of these five files without getting the information from them, just the destination of the files. So let's say I have an URL for the file, google.com slash my awesome file one and then google.com slash my awesome file two. So all I will queue for the first queue that I call a data queue, all the data queue you'll do is just queue out all of this, my awesome file one, my awesome file two, my awesome file three, only passes to them and I will stop. Why? Because when the server is ready, I can move on and I can process the rest of the queue items. The second step is to get the information from this file. Let's say the server is ready to move on, it's not loaded, the resources are great, I can grab this, my awesome file one, curl out to that, get all of the information from this file and add it into the second queue, into the save queue. So what the save queue does, it just has the big blob of information, but I don't care if the information is correct or not correct, I'm just checking if there is an information, just queue it. Then when I'm processing this queue, it will parse it out and split it into multiple rows. So let's say I have a blob of CSV information, then I will verify, run a couple of validation codebacks and you can insert queues in the middle if you want to, if you have a lot of complicated validation, you can have an interim step where it queues out the files after you validated them. Then you split them rows by rows and you don't save it into Drupal, you just save it into queue. Why again, if you have a complicated row, let's say you have like hundreds of columns in your CSV, then you don't really want to parse every single special field and map it into the Drupal content type in one step. Why because it might take a lot of memory and a lot of resources. So all you need to do is just split out the CSV, maybe validate the data and if everything is great, everything is nice, then you can put it out in the third queue, in the third queue where you're actually saving out the data. In the second queue, when it graphs out the data, it knows that it has just one row of the files, so then you do your process plug-ins there. You process the data, you parse it out, so it goes into the correct fields in Drupal and then you save the node. So what are the advantages of this approach? You may process or get out multiple queue items from a multiple queue in the same time and process them in parallel. When you process them in parallel, every single node in Drupal will still be processed sequentially. So you're not losing any data, it's still a lot less approach, but you can speed it up and you can have multiple threads processing your queue items together and saving them together in parallel if you need to. And you can basically vary amount of items you really want to process in one step. So let's say you have a lot of resources and want really big speed, just increase amount of queue items you're taking at one step. If you are very slow and you need it lightweight, then just get it one item at a time. You have this flexibility with this approach. So now we can look into some triggers. The triggers is, as I said, if it's manual, it can be a custom form. Don't make any mistake, don't put everything into the batch. Your batch needs to only populate the first queue. And your process plugin for the first queue will populate all the subsequent queues. So your batch will be very fast and all it does is just basically populate the data queue for you. Or you can set it on Chrome. It also will be very lightweight. If you're using Jenkins, then you need to wrap it into the drash command. To that, let's look into code. And I apologize, I definitely meshed together the class and the procedural function, but just for the slide. So this is how you would do a manual one. You see the configuration form. And it's very simple one. The advantage for that, you can also set a couple of config options, like let's say, hey, yeah, run overnight. Or let's say you have some credential for the servers that might be changing from every three months, something like that. So you can set up there and read from configuration later on. Or if you're not doing manual and you're doing a drash command, you simply define as similar as you would in Drupal 7. You just do a drash command, a description, name, your callback, and any of the dependencies. I put my module as a dependency because you definitely need your import module enabled. And I like aliases and shortcuts. That's why I put an alias in here. So I love that Drupal 8 is object-oriented and I can now extend everything. So I actually just to go with the approach of three classes. In my case, you can extend it more, but three classes is usually where I start. You have a base class and you have your manual class and the Chrome class. So the advantage of that is you write your code just once in the base class. And if you want to change something, you don't have to go wander around multiple files and figure out where the class should leave, right? You just know that everything is, base functionality is defined in your base class. And you just make some tweaks in Chrome and manual. In here, I didn't delete any code from Chrome or manual. They just extend everything from base. And I don't have to actually put any code in a simple case. On top of it, you see the annotation. So in Drupal 8, we moved to plugins and which is great. This is very extendable system and every plugin needs to have an annotation. So Qvoker is an annotation. The only difference between manual and Chrome is that the Chrome one has a time. So when you have Chrome property in your annotation, you can, you are telling Drupal that this particular Q item needs to be processed on Chrome. So when you run Chrome, it will process your Q. If you don't have it, then you have to handle it manually of claiming the item. And other properties are just simple ones. The ID, you need machine name and title, which is basically identified for you what is the Q and what it does. So in our getter Q, we just parse the data block. Remember, we got all of the files. We got the data block and we parse it. And this Q populates the setter Q. So it parsed out in CSV example line by line and just saves in Drupal into the setter Q. So this is a setter Q and I didn't include the code because it will be different from import to import because that depends on your content type, depends on how you go into architect it and where you're saving it into custom entities and to nodes or anything else. You can also just save it in like files. We had a project when we had to generate a bunch of JSON files and import because JavaScript library would read from that. So you can do it in here and that's the same approach. You do have a Chrome class and the manual class that basically extends the base one. And that's it. After you save it in here, this is the end of your import. So very easy. You get your parts and you save. And we are done. Any questions? Really? Yeah. So you have a rest API? Yes, I do. So you have a rest API? Yes, I do. It wants like every time your rest API data changes to reflect that in Drupal. Oh, okay. Yeah, well it's a different approach. You're asking if like how you reflect the rest API in Drupal, right? So that's a different approach. It doesn't have to be an import. If it is an import, you will have a defined time when you're checking on this rest API. If you need an immediate like, you know, real time reflection, then you need a webhook or something like that that's either checking on the rest API or rest API sends you a webhook that actually gives you the trigger that you have to update the data on your website. In a simple scenario, if you have just a small part of the rest API, we had a countdown timer that actually was reading from a third party site. You can just scroll out from it on Jenkins on a timer basis, but that's only suitable if you're just clearing just a little bit, like a small box, because you definitely don't want to kill the cache and everything like that. So webhooks would be the ideal approach for that one if your API can send a webhook, or you can simply just read on in-controller for this particular page from this rest API. Get this as a rendered feed and then just parse it in your tweak template. Then you don't really store it in Drupal, you just render out of third party all the time. Did I answer your question? Yeah. Any other questions? No? Well, thank you so much, it was a pleasure.