 I'm here to talk about data pipelines, specifically why and how we engineered it, but not code. So I'm just going to be going into the top level of what it is and why we have it. So first, what we're going to cover today, a little bit about me, what the problem we had was, and the concerns we had with the existing solution. What we did to solve the problem and what we're limited by in terms of our current implementation. So my first Drupal project was a Drupal 5 to Drupal 6 upgrade. I have a fairly low Drupal.org number, and that's always something to be proud of. I'm currently the technical lead for the New South Wales Government project, and if you're not aware of that project, what it is, is we're trying to consolidate 700 government websites into one website. So that is going through all their content, streaming it apart, rebuilding it, putting in just one website. So that's the goal. I like camping, boating, and spending time with my two kids. And you can find me on Drupal.org and Slack as an interpot. Disclaimer, all the stuff I'm going to talk about, as a technical lead, I don't generally build stuff. That's what the other people are doing. So some of this already exists. Some of it will exist soon, and some of it is in our roadmap. I'm just going to talk about it generally in terms of engineering the architecture. Problem statement. The problem statement's easy. Take data from a range of external sources and display it. It sounds easy. As I said, NewSouthWales.gov.au is trying to take 700 websites into one website. That includes 700 times however many sets of data into one website. So we've got health data, we've got environmental data, we've got NewSouthWales education data, we've got COVID testing clinics, we've got all sorts of stuff that needs to come into our website and be displayed to our customers. The data comes in a range of formats. So we might get the data of our API. We might get the data by a CSV. We might get JSON files, Excel spreadsheets. You name it, someone's going to send it to us. And yeah, that comes with its own set of problems. So this is what we're doing before we engineer data pipeline. So a couple of examples. We use the entity model. Tick and flick, build your content type, put fields on it. Go through all the testing and displays and that kind of stuff. Set up a migration, take the CSV file, put it into a migrate, run the migrate, put it into the entity model, push it into the database, put a view on top of it, display the view to the customer, make sure all that's working, put QA on top. It's a fairly standard-driven model. It's what most sites do. As I said, it's tick and flick. You can do it out of the box. Yeah, that's the entity model. File model. So we had a couple of instances where we would have a custom block with a CSV file on the back end. You would run the block, run the custom code for the CSV file, for the data back display, that data to the front end, all that kind of stuff. That comes with its own problems, as you can understand. The CSV file might, the customer might upload one that's got bad data in it. We might need to switch from CSV to JSON. We might need to do all sorts of stuff. It's a lot of heavy development. Every time there's a change, we have to do development on that. Something that's in here, sometimes we did theme rendering. Sometimes we would do JavaScript rendering of those files. For the JavaScript rendering, the developers were generally pulling the CSV file, putting the CSV file into a JavaScript array using React to pull the JavaScript array, all that kind of stuff. Yeah, all of these were in our system, by the way. Many, many other concoctions. We had custom blocks of file read pushed into JavaScript array, which I said. We had Drush commands that pushed data into a custom table, which was a schema in Drupal. We had custom views that were pulling from custom tables. We had pretty much everything you can think of in the code base. And custom blocks with direct API password. Concerns. I'm pretty sure you all understand the concerns already from what I said. But New South Wales is a really big site. We have lots of customers, and we have lots of data and lots of things to display. And nobody cares where we got the data from. People only care whether it works. As an example, we get about 2 and 1 half million visitors a month to the website. They're doing 3 and 1 half million page views across the site. We built a COVID testing clinic application to display to the user, which had an address lookup. We rolled it down on Tuesday night. On Wednesday morning, about 10 o'clock, we had two pretty cool issues. Nothing was working. None of the address lookups were working, things like that. Jump in, have a look, see what it is. The New South Wales point API data set was telling us we couldn't respond. No more results. Call them up. What's going on? You've reached your limit. OK, what's our limit? Your limit is a million requests a month. But it's been three hours. Other concerns? Trust nobody. One of the ones we have is we're pulling data from, let's say, 60 agencies. There could be a vector in there. But where the data comes in, not secure. So when we were doing things like direct pass through, someone in an agency could put data into their system that would flow through to our system, which would flow through to the customer. And we had no oversight of that. Be a good consumer. As I said, a million requests in a couple of hours. Let's use New South Wales open data as an example. The DAC, the data something center. I'll remember what that is in a minute. They'll put up a JSON file that says, all of the toilets for all of the parks in these locations. If we use that data directly, OK, we'll put that on a map for you. We're going to send you four million requests a month. Is that OK? Usually no. The other ones we had, building all those entity models, building all those blocks, building all those file things, they all resulted in a lot of time. Lots of understanding how the system works. Lots of working out how the other developer built it because everyone had their own thoughts. Data might break. We're pulling from an API. And then the API goes away. Once that's broken, then our website is broken. The customers see it as newsouthwales.gov.au broke. What are you going to do about it? I can't see whatever I need to see. And load on the server. One entity, 10 fields, 20 database tables, 700 data sources, 700 entities. You can see where that's going. So what we did, we went through all of the current implementation, all the pros and cons of all of those different models I talked about. Decided a model view architecture would be the right solution. Had a look at NoSQL. So NoSQL lets us push a data schema directly into the storage without having to set up all the schema models first. And we decided on React for the front end. The reason for React is because we were already using it for various products. And it had direct integration with our design system. So the solution, the model. We wanted NoSQL so that we could reduce the overhead of storing data. We didn't want to write entity models. We didn't want to write fields. We didn't want to go through all the QA testing of that. We just wanted to store the data somewhere. We get a lot of requests, like a lot. So using NoSQL meant that we could go direct. In our case, it's open search. We could go directly to open search and pull that data out and bypass Drupal. Get Drupal out of the hot path. That doesn't sound like much. Drupal hot path, about 300 milliseconds average. Open SQL hot path, 50 milliseconds average. Just in the example for testing clinics that I gave you before, that's running a server for seven days, 24 hours a day versus not. One storage solution, so using open search and NoSQL for everything meant that every single developer in our team knew exactly where to find the data, how to pull it out. They had done it before. They built it for their own app. If they had to repair someone else's app, they knew where that data was coming from and how to pull it out because we're using the same components everywhere. And finally, a decision was made that our data would always be transient. So that is we can always recreate that data from SOAR. The reason we decided that, no backup of Elasticsep for open search. Don't need to do it. If we want to throw away the data and rebuild it because we changed the model or we changed the way we want to import it, just blast it, pull it in. Yeah, so that's the transient side. Data pipelines. So we knew what we wanted to do. We had the model. It's just how are we gonna put that data in? So that's where we designed data pipelines. What we did was we needed to take that data in various forms. We're talking 1,000 different CSV files or the different rows, different columns, different data. So we designed a YAML file that would define that data. COVID testing clinic. 11 rows, first row is latitude, second row is longitude, third name is title, things like that. And by defining that in the YAML file, we can go that field needs to be required, that field needs to be an integer, that field needs to be a date, and just define it one file really quickly and click import. Plug-in system. So we needed to push to, we needed different inputs and we needed different outputs. So as I said, we've got JSON inputs, we've got JSON from URL, we've got JSON from file, we've got CSV, we've got Excel spreadsheets, all of that kind of stuff. Input plug-in system handles that for us. Converts everything to an object that we can then just pass around data pipelines and do what we need to do with it. Output plug-ins. So we also have output plug-ins so we can do things like plug-in open search. That's fine. But sometimes we need to keep the data private. Let's say the education standards, they need to say that these teachers have a score and that needs to be looked up, but we're not allowed to give everybody all the data. So the only way you can get the data is to do a search, get a result out of the record and return it. That means we can't push it into open search. So in this case, we wrote an output plug-in that's stored in JSON so that we could use data pipelines to query that data and pull the data back out but in a consistent way. Drupal, what we're using it for, so far we've pushed everything into Elasticsearch. We're using Elasticsearch directly. We don't need Drupal, not quite the truth. Drupal does all the heavy lifting for us. It does all the UI for the data sets. It does all of the plug-in definitions and things like that. Permissions around who can import and who can re-index and all that kind of stuff. Validating the data, as I said, we need to validate the data so that we are both secure but also we have valid data. Like a COVID testing clinic, if it doesn't tell you where it is, then there's no point putting that on the website. So that's all part of the YAML file. We have transforms, so that's things like, I don't know, three different Excel spreadsheets could send me a date in three different formats. The transform goes, okay, take the date in that format that this customer uses and convert it to a UTC type. Again, it's making the same problem. It's making it the same problem. We know that the dates are in UTC. We know how to display UTC. So as long as we get it in that format, we're good. We run the data through security filters. So things like text fields will go through cross-site scripting filters. Integers will make sure they're integers, that kind of stuff, before we push it into Elastic Surgery. The reason for that is Elastic Surgery is open and at some point we'll be shared with other government agencies. So that we know that that data that we're giving out is safe. I've already mentioned, we've tried to reduce the pipe button to a single YAML file, consistent understanding across all devs about how things work. And finally, because we go direct to open search and that takes it out of the Drupal hot path, that means that we can do things like cache the entire page for 24 hours, but cache open search for five minutes. The data is as up to date as possible. The five minutes is there purely for denial of service protection. Here's an example of the YAML file. I was gonna say it's small, but it's huge. There's some things that aren't in this example, but the validations for the whole record, it says that there's however many rows expected, and then you've got the fields, and then you've got their constraints, which is not blank in this case, and the error message to show the customer if they're trying to import that file. The bottom example down here, one of the benefits which we'll go to in a minute is that if we have a React app that talks to a data file, and we know the, well, to a data set, and we know that the data set is the same COVID clinics data in this case, we're gonna have multiple data sets that use the one React app. All we have to do then is configure which data set the block is using in the front end. The example in this case, we've got two, we've got the COVID testing clinic URL, that's the one we use 99% of the time, but then 1% of the time, they break it. So we switch the block over to use the COVID testing clinics backup JSON file, keep the site up, they repair it, we turn it back to the URL. The view, so our React is the framework of choice. I said that already, that's mostly because we can get access to developers for React, but we can also use our React design system library. The problem now in terms of putting things on the front end is easy, right? It's a React app built to talk to an open search using a data pipeline, it's easy. Build the React app, define the React app in the custom block, put a drop down for the data pipeline, done. It's the same for all developers, everyone knows what is happening, how it works, and yeah, there's the occasional little tweak in there, but that's down to developers. I already mentioned this one, but we can write the app once, and we can back it up with unlimited data sources. As an example, we actually have a React app that will take a data source, any data source, and print a table with a search box at the top. What that means is we can have seven or eight different types of data from education standards, and we can say, display this as a table, these rows, this is the search column, done. High availability, as I said before, the Drupal pages can be cached for 24 hours, and we can cache open search for as often as we want. We can re-index daily, we can re-index hourly, we know that the cache is only five minutes, the site's fairly up to date. For anyone who missed the high availability talk a couple of years ago, I would like to say we can improve it, but we're at 98% cache hit. But the goal is to make sure that we can always respond with something for the customer. And there's a project going on at the moment which is server-side rendering. So we're now working one of the issues that came out of the build that we had. Was that you now have these React apps which means that there's a single div with nothing in it when the page load. What we're planning on doing is replacing that with the server-side rendered version of the React app and then hydrating that with the data from open search, what's the page load? Limitations, despite my rose-colored glasses at the start, it's not all roses. We still have data that can't be public and that data's currently being exported to Jason. Jason's not as good as open search. You can't say, give me all the data that's within two kilometer radius of this point which open search can do. So those data pipelines that need to be private, we still have React apps that need to do things like searching and filtering on that data which means that we have to write a GraphQL endpoint or some other endpoint that allows the React app to get that data but in the controlled way so that they can't get all the data at once. That means there's still a fair chunk of development in there. Don't get me wrong, it's nowhere near as much as writing entire entity model from scratch but it's still development. A lot of our team when we started weren't that familiar with React. We actually had to do a lot of work, train everybody up to React, make sure they knew how the process worked, go through code reviews, that kind of stuff and it's an ongoing process and occasionally we need to decorate data. So the data that some agency sends us might not be all the data that our editors want. So decorating that data is difficult with the model that we have. We can't say, well let's import all of the media releases from agency X and add our taxonomy term A, B and C. So that's a issue we have at the moment. How am I going? Outcome of all of this just because it's Drupal doesn't mean you need to use Drupal in the standard way. We are, as I said, we're using Drupal and all of its plugin systems and all that stuff for the heavy lifting but in terms of the actual display and the data storage, none of that's Drupal. Leverage Drupal where it makes sense and expand it where it doesn't make sense. Our solution while it works for us, doesn't mean it'll work for you. As I said, there's limitations, there's things like having to set open search up and set proxies up and all that kind of stuff. This is for a massively huge website. It might not make sense for your cafe website. We're still migrating data. It's taking a while. So we're probably at 30 or 40 data sets now with React apps on top of them and we probably have another 40% to go before we're all across doing the same thing. As I said, I'm the technical lead for New South Wales. I don't write that much code. So a huge shout out to all the people that actually help data pipelines get there. It was a lot of effort to get to this point. And we're still working to improve our React architecture. As I said, we all started fresh. We're all new React developers. I still can't really code in React. But that means that we have a little bit of tech debt to clean up components, make things reusable, work out how to do GitHub warnings. So that, shout out for the code sprint tomorrow. And finally, any questions? Yeah. Is there a contract module? So the module's data underscore pipelines on Drupal title. I have a question. Did you look at external entities, which is a contract module in Drupal contract? Yeah, so we did have a look at both external entities and using external storage for entities. And we actually looked at migration as well as an option for this. But it wasn't simple enough, quick enough, that kind of stuff for what we needed. And is it right that majority of the React apps, they consume the data directly from elastic search and not through the Drupal system? So Drupal doesn't really deal with a lot of the data handling through the entity in the field systems that you normally would deal with. So as I said, the start up time, the warm up time for Drupal is about 300 milliseconds. The warm up time for open search is 50. So it's a lot quicker to pull data directly. But the other side is we considered pushing the API through GraphQL. That's quicker, but it also means that whenever we want to use a new feature out of open search, we need to write all of GraphQL to do it. Whereas at the moment, if someone goes, hey, I want to do, I don't know. Usually it's the difficult stuff using geo location and mapping. So you can do things like pass a geo shape into open search and say, find me all the entity objects that are inside this shape. If you want to do that, then you have to write it all in GraphQL. Are you still composing or authoring any content in Drupal? Yeah, so this is just for, so we have a couple of rules in terms of our content model. If the user needs to edit it in our CMS, then it falls down to whether it's a node or whether it's micro content. If the user needs to edit it, it's a node. If the user doesn't need to want to know if it needs a page. So if it needs a page, it's a node. If it doesn't need a page, it's micro content. And then there's a split off from that. But if the user edits it somewhere else by default, then we use that apart. Okay, that makes sense. Oh, we've got a question over here. So what, how will the question be? So you store the original data in the Drupal, but then process it and send it to the open search. Is that correct? Yeah, so depending on the data pipeline, if let's use CSV file as an example, the data pipeline itself just uses a file entity field. So you upload the file like a normal file in Drupal anywhere, and then the processing kicks in and takes it, massages it, transforms it, maps it, pushes it into open search, and then the React app just pulls it back out from it. Yeah, so the editors just go in, put a new file in, and off it goes. One of the things that does happen at the moment is that if we get a validation error, then we don't index anything. We're currently working on an upgrade where if we get a validation error, we're just gonna drop that row out of the index. So the example for testing clinics is 150 testing clinics. We don't want it to be no results when one is broken. We want it to be 149 results when one is broken. So I think the screenshot one, the screenshot you shared, because it showed that you have these data feeds. Yeah, that was the one. So that you have a file that's hosted inside your Drupal code base. And is that right? Yeah. And this is a config entity, would be one of those rows in the screenshot below, something like that. That defines a data source that you're ingesting following that schema of that sort. Yeah, so we can do all the, as I said, Drupal does a heavy lifting. We can do things like just use the editor experience to create new ones, set them up, that kind of stuff. But in theory, we can revision control these. We can work out audit logs, that kind of stuff. Or all of the entities. And there's a value to being able to do that stuff through Drupal opposed to, say, moving that process into like a CI CD pipeline. Yeah, we did consider whether we put it in a separate application to do the indexing. But 90% of the time, it's the people that write content that also need to do this data. That's like you said, you've got some roadmap problems. Did you say large files or something like that? Like is there a... Yeah, so some people right at the moment who are messaging me are trying to push a 40,000 line CSV file into open search. And they're not so happy that they can't get it finished in less than an hour. So we can't run, currently run it for longer than an hour. So they can't currently import it. So that's a problem we're gonna have to sort out when we get home. No, that's not normal. Usually we don't see that much data. So that's one, another roadmap one, which I mentioned was the ability to just drop a row instead of dropping on a whole index. And then things like at the moment, there's no revision history on our CMS version of it. So I can't do something like as a government employee, I can't do something like New South Wales police come to me and say, tell me what was on the site on the 5th of May, 2019. So I need to be able to do that as well. And finally, I think we're ending. Finally, we are looking for a tech lead. Contact me. Thanks. Was that your resignation? No, I need two tech leads. Thank you very much. Thank you.