 My name is David Turden. I work for the Ministry for Culture and Heritage in New Zealand and we've been embarking on a project for the last little while. Migrating from 777 to 89910 and we engaged Jonathan from Catalyst to upgrade our data model to something more appropriate. So this is a presentation on basically how that's gone. We're not quite live yet but we're not far away. Yeah so I'm Jonathan Hunt. I work at Catalyst based out of Christchurch and been working with Drupal for 18 years or thereabouts from anyway 4.6 I think was the first version I started with and have a particular interest in kind of cultural heritage and glamour sector, so galleries, libraries, archives, museums and there was a pretty good fit for the kind of material that MCH were working with. So obviously NZ History is a complex site. There are lots and lots of content types, lots and lots of information of various different formats. Some of the challenges there on the screen was that the data model was inconsistent. There wasn't really a data model to be. There was content types and that was about it. We had a whole bunch of files and media that had been added to the site on an ad hoc basis for various reasons over a long, long period of time. Again, inconsistently and I guess we needed to get that under control. NZ History is basically, obviously a government-based site talks about NZ History in a bunch of different formats, primarily stories, events, biographies and places and there's a little screenshot of one of the events, the Hawkes Bay earthquake in 1931. We collate a whole bunch of information based on tags which we call keywords and this allows us to present information collectively based on what that information has been tagged on. So we have events, images, biographies, so on and so forth places. Our New Drift 10 site has approximately 41,600 nodes and 18 different content types and just as an aside there, NZ History started in 1999 as static HTML, moved to JUMLA to 4.7, 5, 7 and now finally to 10. And just a few other points to be made, the information on NZ History has spawned several books including one called Today in NZ History based on all the events stored in the website. It contains several external databases of information, for example, all the suffragists that signed the Suffrage Petition in NZ in the 1890s, a database of war soldiers, a database of all the people who signed the Treaty of Waitangi, for example. And there's also a large section on the website aimed at schools, teachers primarily but also students and there is scope in the future to expand that because NZ History has now been added to the NZ curriculum which is a new thing over the last year or two. Weirdly enough, we never taught NZ History in NZ Schools. So that's the homepage we have. Obviously hooks off to a whole bunch of little rabbit holes there. So the primary aims of the upgrade was to retain all the functionality, to simplify everything as much as we could to align with triple standards, reduce custom code, which is always a bit of a challenge and to make it a bit more sustainable, we wanted to align it with other sites that we have in-house as well, things like consistent set up and use of modules and so forth. Thanks, Evan. So just to outline some of the challenges that we ran into so you can, from the history that David provided, some of the content's been around for almost 24 years. So the site's gone through a lot of evolution and over time had accumulated all sorts of dead ends in terms of things that had been used temporarily or experimented with or whatever that was still sitting in the code base, various content types and the use of different vocabaries and so forth that may have been useful at some point where not necessarily continue to be useful. So you spend a lot of time analysing the D7 code base and looking at the data structures, doing lots of exports of the database as CSV and things, often putting that into tools like Open Refine which lets you quickly facet by values that are in a column and it's a great way to get a sense of the shape of the data that you're potentially working with and especially where there may or may not be institutional memory that was before my time and so you might not have people on tap, you might not have subject matter experts who can explain all of the ways a different field might be being used because sometimes there may be theme functions or whatever that are using a certain value and it's not always obvious that that's the case. And then there's some data model challenges. So for example there was a D7 content type called Suffragist but in practice that was referring both to the signature of by a person on a petition, on a certain sheet of the petition but it was also used as a foundation for a biography about the person so that node is doing a lot of work and similar with Treaty Signatory, it was treated in a similar way. There was both information about what sheet of the Treaty of Waitangi did they sign on what date and where and also a biography of that person and their whanau and their whakapapa. We had things like a lot of images in the system but some of those images were then treated especially because they were images of New Zealand war memorials and they were showing up on a map of war memorials and so forth so it's really stretching the definition of what an image is describing. There was also a lot of files, if we know how many, tens of thousands and there was numerous unmanaged files so those were files that had been put directly onto the file system and then referenced directly in markup but they weren't managed by Drupal and they weren't treated as Drupal media and often they would have complex markup around them so they might be hyperlinked, they might have divs around them that placed them on the edge of the content and all that kind of stuff so there's a lot of not ad hoc but tricky markup and that's a challenge for migrating it into a Drupal 10 structure. There's also lots of embedded content, there's lots of rich media so there's panoramas, there's tiled images, there's audio, video all sorts of material. There was PHP in node content which is often problematic and ends in history because of its nature it's got some quite complex navigation. So we're just going into some of the details of how we dealt with those. So first was data model. The Drupal 7 site linked a lot of things together by just plain tagging. At least in some cases these tags were typed so there was a distinct field that would say whether this tag is a place or a person so we leveraged that to migrate into Drupal 10 into distinct taxonomies so rather than all these things being in a single taxonomy we could break it out into specific Drupal 10 taxonomies of person, geographic location, temporal subject, et cetera. Most of these we simply borrowed the existing configs for taxonomies from the Islandora project which is a Drupal distribution for a digital repository and it comes pre-built with a bunch of taxonomies that are useful for cultural heritage. For something like a Drupal 7 suffragist content type we split that up into Drupal 10 that became a suffragist signature so that was a content type describing the signature who made it and where it was on the petition and then separately to that mapping that signature to a person entity in the person taxonomy and that person entity can have date of birth date of death birthplace, biography all of that kind of information or at least a link to a biography. Similarly with biography content types in Drupal 7 we split that into a biography node which is the kind of narrative and then it has a relationship to a person entity and that's because when you conflate those things you have a poor data model. You want to make assertions about the person but they don't apply to the biography so the person might have a date of birth but the biography has a date of creation and it has an author and that kind of stuff and if you've conflated those kind of things it gets very difficult to make useful statements about them. So an example is Kate Sheppard so the suffragist signature has a certain node and the signer is the person Kate Sheppard which is a different idea in the taxonomy and similarly there's a biography node about Kate Sheppard but it maps to the same taxonomy entity and we achieve that by doing migration we run a migration across the biography nodes and build the D10 taxonomy terms for people and then we'd run another migration which is from the same material but it would generate nodes and that way we've just got multiple migration yamls that have got different destinations and after migrating treaty signatures and things we've got over 31,000 person entities in there and they're all distinct and one of the challenges for doing that is so that we can actually share schema.org data so if I click on here we should know not that one, this one. So just to give you some idea all of this can be serialised out as JSON link data which means you've got a structured data format describing the person or the place or whatever which is really useful in the glam sector for standard share across other cultural sector organisations but one of the challenges of having 31,000 person entities is triple taxonomy tends not to like or won't even render potentially if you've got thousands of nodes because by default it treats things as hierarchical but a person taxonomy isn't necessarily hierarchical so we worked around that by building our own admin view in this case using a module we've got on github at the moment called CCA taxonomy manager and that basically indexes all of the terms into search API by default it just goes into search API DB but you can just easily index it into solar and I find that interesting because solar has some algorithms for search that will be really performant for name matching for example so there's potential improvements that we can go onto and also with custom view we could add things like view bulk operations so if you've got tens of thousands of nodes there's every chance you'll have duplicates so one of the things we can do is term move or term merge based on views bulk ops and just to give you an idea that's the kind of view that we have so we tried to emulate some of the features of the default taxonomy administration so you can add a term from this page but it tells you how many you've got a search you can use the view bulk ops to select a few terms and then you can start using actions on those so the migration sequence that we followed is pretty standard we had to do some prep beforehand that included because we had all these unmanaged files our solution for that was to scan the file system and generate a JSON artifact out of that and we also had to bump things like file IDs and media IDs we tried to preserve the IDs across D7 and D10 because it makes debugging a whole lot easier then we moved on to migrating all the taxonomy terms migrating files and media migrating nodes and as part of the migration of nodes it often included changing image references into the media embeds finally migrating comments because they're about the node attached to nodes usually and then URL AS so that's a fairly typical progression for most migrations so for the unmanaged files for example we wanted to model these in media and part of the rationale for that we could have just copied all of the unmanaged files across but that kind of persists the problem right it means you're bypassing the whole functionality of a content management system because it's not actually managing half the material and there's risk of having unmanaged files on the file system and the other thing is that at least if we started to embrace the media model in D10 you could start attaching attribution fields and things like source and copyright and various other attribution and we based the model for that on Cread of Commons and they have some best practices for attribution and that way we've got some consistency so whether you're dealing with a major video or a panorama or anything else there's some commonality in terms of the data model around that as part of the migration we ran a preprocess that would scan through the file system hierarchy it would ignore certain file types that we didn't particularly care about it would generate some JSON we would store that JSON and then that was a feed into the next migration which was to read the JSON and start building files in D10 and then using the same source and the MIME type start to build the particular media types that we want so obviously image and video audio but then we had a bunch of other ones that we needed to deal with as well we used the media migration module and within that there's a process plugin called image tag to embed filters that will take your image tag from the markup of say the body field of a D7 node and then replace it with a Drupal media embed that ck'd at a 4 or 5 knows what to do with that because it was a big part of it especially and we extended that because we had to then pass a lot of the context of that because in the D7 markup might be attribution or it might be alignment or it might be sizing or a whole bunch of other attributes that we had to extract from the D7 markup and encode that as data for CK to look at and in a few cases and it's kind of a bit ugly but it seemed to work was we would while we were migrating the nodes we were lifting data out of the markup of the node and that was data that had a better home back in the media so even though we're doing a node migration as a side effect of that node migration we'll potentially find tracking down media in D10 and then adding data into the media and then just to abuse the migration kind of separation of concerns but on the other hand it was a way of managing when you've got this kind of really challenging source material it was a way of trying to preserve the data in the best possible D10 data structures having said that, even if image alt data is still somewhat of a challenge there was also a lot of iframes in the system in the D7 site where a lot of YouTube and Vimeo embeds and we could map those in D10 using the remote video module for Bright Cove we used DM remote which gave us a kind of a plug-in structure that we extended and used for a bunch of other external content that wasn't readily available in the contrib module so that included things like New Zealand on-screen so you could extend the media remote formata base and basically just describe a new formata that knows how to render the kind of markup that would invoke NZ on-screen embed data. Similarly for PodBean which is essentially remote audio, remote podcasts so we could just define a remote audio media type and start mapping that so we ended up with a fairly common pattern where in some cases we would define the intended media for Drupal 10 as a static migration and that looks like this kind of stuff so you basically define some yaml and you define the media that you want to create in Drupal 10 and this is something based on doing some SQL queries on D7 identifying the handful of instances or maybe tens or whatever of media and creating a static migration for those to generate the D10 media that we wanted and they were indexed by URL so when we were doing the migration of the node and we're parsing the body content we could detect the URL, figure out what the D10 media was that matched that URL and then generate a media embed based on the EUID so NZ History in Drupal 7 also has a bunch of other interesting media types that didn't really have any existing equivalents that we could track down in Drupal 10 so I've already mentioned remote audio but there was also things like panoramas using Pano2VR there were some custom interactives that had been built over the years that were using arbitrary JavaScript and showing slides and various other things so again we used a fairly similar approach and we tried to map them all to different media types so if we were embedding in content they would still go through the media library and they could use all the same tools no matter what it was that they were embedding in a narrative and we could also attach to the attribution fields and things that we talked about so we ended up with a fairly extensive set of media types so you know out of the box Drupal comes with image video, audio, remote video then we've added things like panorama remote audio steps through slideshow, zoomify which is tiled image zooming and so on so just a few examples of what those looked like here's an example of a Google map embedded through CK.0.5 and this is to get away from the whole idea of allowing arbitrary iframes to be pasted in by editors because it could be an iframe to a crypto bot or whatever and again trying to get consistency of approach across multiple users and multiple content environments memorials, sorry in this case panoramas so we'll just see if I can make that work click on here so in this case the files already existed but we could look it up and see if we can here we go so that one, thank you so this is what it looks like in practice we can go full screen on there so this is Gallipoli and you can look around but again from a it is a point of view it's the same as embedding any other material once the media is created this is an example of a step through slide set so again to show an example of what that looks like in practice still in about the infamous 1981 spring bot tour and this is an example of a custom interactive so there's basically a bit of CSS and JavaScript that lands when you're on this kind of content and you can step through like that but again from the point of view of migration and editors it's just another media type one of the other examples is the tiled images so they were generated in Drupal 7 through Zoomify that's a proprietary system that isn't well supported in Drupal 10 I was already familiar through Island Aurora with OpenSeaDragon which is a tiled image viewer and it turns out it's got a plugin that lets it use Zoomify tile so we didn't have to regenerate any tiles we just shifted those across and left those in place for now again we had a bit of a preprocessor to run beforehand because the D7 files had been sprinkled around with different naming conventions so we ran a we script to just tidy all of that up and land the content in a consistent location and then migrated the media and pointed it at the new location of the files so this is what it looks like in practice again you can go full screen and then you control down on the material that's in there you have to escape depending on the content it's a really great way if you've got high resolution scans those could be photographs or scans of maps it's a really nice way of demonstrating stuff the other good thing is once you've got OpenSeaDragon in place it can use the the AAAF image format and that's an international standard for sharing large format images and not just tiling but there's a presentation API and so forth associated with it so that's hopefully something that MCH will make more use of in the future and more of the challenges so again because of the way that the site was built inconsistently over a long period of time by multiple different people and agencies we ended up with things like views in bed PHP blocks right through the site and so we basically just moved them using the insert view module or just creating manually manually inserting blocks that were created by views one of the things we really aimed to do was reducing our custom code again because of the way that sites were manhandled over the years we ended up with an awful lot of custom code and some of it didn't really do a lot and some of it was actually outdated and wasn't even needed so Drupal 7 had 38 custom modules most of which were doing very simple things and we managed to cut that down in Drupal 10 to 14 modules there was about 5 or 6 that were exact replicas from Drupal 7 to Drupal 10 for example the complex browse slash menu system and of those 14 custom modules there were 6 just for that to handle the interactive migrations so CKEditor 4 to 5 obviously one of the things we have allowed in the past particularly in the internet history but also in some of our other sites as well we have full HTML access to all of our editors we tried to cut that down or we managed to cut that down to much more limited appropriate use of HTML tags in CKEditor 5 but that required a bit of forward planning what could we do without what do we need, there was a bit of analysis about allowing what we needed to allow and not allow in CKEditor 5 but that seemed to work out pretty well and lastly we're to from here okay so we have another site in the pipeline for migrations Te Ara is the encyclopedia of New Zealand believe it or not initially conceived as an online encyclopedia 980 stories approximately 5700 pages, 28000 we call them resources in house images, sound, video, interactives there are a few other things tagged onto the encyclopedia like the dictionary of New Zealand biographies approximately 3100 live biographies and over 4000 associated images and media and the good thing about Te Ara is that because it was designed right from the start as an encyclopedia this content structure is very consistent and should be much easier but however the actual data structure is in on of itself more complicated because of all the relations it has so we want to look at things like multi-site federated search, we have a little bit of that in Drupal 7 but we just want to make it a little bit more consistent across all of our sites we would like to look at exposure of some of our information through APIs sharing of data and for example Te Ara also a tag slash keyword system we would like to share content from NZ History and Te Ara together on each other's sites for the same sort of keyword or entity potentially we would look at being an authoritative data source for some of this information particularly for example the New Zealand biographies and more use of the triple IF standard for presentation one of the things that I'd like to mention about Te Ara is that it has one of the largest or it did have and I don't know whether it's ever been superseded but it has the largest body of online Te Ara content in the world and this has been provided over the years to multiple agencies to assist in the machine learning of the native language so that's something that we're interested in pursuing a little bit further we're not sure what that looks like in the future but definitely it's in our plans or it's in our thoughts Thank you very much for coming along we could take a couple of questions Can you use integration to read I guess in theory potentially we can the issue we have in-house at the moment is that we're not allowed to use AI or we are allowed to use certain sources of AI but not open AI so that makes it slightly more complicated getting Drupal to talk to Microsoft AI systems so probably at the stage to know but one of the things we have is that we're still actually working on that and as of a couple of weeks ago we did make a little bit of a breakthrough and we have not migrated all of it but certainly we're on the path of migrating all of the stuff that we have existing having said that we have still tens of thousands of images that have zero old text so something like that, something like using open AI or AI in general to do that would be a good starting point we haven't considered upgrading the design at this point we're just going to have existing design mostly because we don't have the capacity at the stage to do that work unfortunately it sounds like it's been a massive task so congrats to you guys for getting so far on this amazing work now that I'm interested I haven't dealt with legacy sites myself and 70 years of immigration, a massive work what was your sort of keys to success for getting this done, it's such a problem what is sort of a highlight sort of for things to do well Thanks for the question I think we had a really good work in culture we collaborated through GitLab and we drafted a whole stack of issues in GitLab describing each content type and each taxonomy and the fields and we were at, so I think we had we've had weekly meetings between people like myself bringing some technical capability but then the MCH staff have got the background and the knowledge and the direction of where they want to go and all that kind of stuff so I'd probably say the key to success was just having a really good relationship and a way that we could talk through whatever challenges we're running into and it would be easy to run away and start throwing technology at stuff but the trick was to know when that was appropriate and when it wasn't and I think things worked out Yeah, I think one of the other major factors to our success was that the person who wrote the site in HTML in 1999 has still the chief editor of NZ History to this day so the institutional knowledge was actually there which helped a lot and interestingly enough he didn't remember every single little question we asked him but definitely the big ticket items were at work so that was actually really useful. So did you ever make any considerations as you're dealing with people you know, I've just been going through this for months and it was very well and a lot of people? I didn't really because in this case I've been working with Island Order for quite a while and I knew that it came with a suite of taxonomies including fields on those taxonomies that were quite applicable and at this stage we haven't really needed to customise those we could add further fields but it just meant by standing up controlled access terms I think it was from Island Order we got like 10 different taxonomies so all relevant to the material we have used and it's fallen out of my head but the headless content type that's available in Drupal 10 we used that for points of interest for maps for example so we could extend that with some fields but it didn't need to be a full content time If people want to follow up afterwards I'm always happy to talk about this material and in particular Drupal I think struggles with large taxonomies and I'd love to make Drupal better in that space so if you have an interest in that then hit us up afterwards but thanks for coming.