 Hello everybody. Welcome to Drupal as a data warehouse, everybody into the data lake. Today we're going to share our experience with data reporting in Drupal. For those who don't recognize the name on this quote, Edward Tufty is an American statistician and is Professor Emeritus of Political Science, Statistics and Computer Science at Yale University. He's best known for his writings on information design and he's also a pioneer in the field of data visualization and we thought we'd share this quote because it seemed appropriate to the talk. So my name is Gail. This is a picture of me and my siblings in 1999 hanging out my dad's barn door during a very long family photo shoot on my parents 50th wedding anniversary. I'm the one in the maroon shirt down in your right hand corner. I'm the baby of the family. I decided that being the baby makes me one of those customers. The ones I'm talking about, always asking why or why not. Researching what Drupal can do and then pushing for product changes that suit my needs and probably don't suit anybody else's. Learning just enough along the way about Drupal, HTML and CSS to be a little bit dangerous and being pretty annoying about all of it and if you ask these guys when I'm out of earshot they'll probably say the same thing. I'm okay with having that reputation with my developers and I mostly think they're okay with me too because over the past five years we've established that I pretty much care about ethos CE which is my learning management system as much as DLC the developers do. We use our learning management system or LMS to pull together the various types of education we provide making it easier to track progress and outcomes and to report annually for accreditation compliance purposes. In the past we would spend weeks prepping reports to upload into our accrediting bodies reporting system. So having everything in one place and having a prebuilt report for that has just been life changing for us which sounds kind of dramatic but you have no idea how much work it was to pull all of that together manually. Part of our educational mandate is to prove in the form of written needs assessment that our educational content is of real value to our learners. So we've always relied very heavily on reporting of pre and post test outcomes, post activity surveys, feedback and reflection forms and so on. Needs assessment is really the backbone of why and how we create content in the first place. And being in the medical education world I can't leave out grants reconciliation. We have to prove to the people who provide educational money for some of our content that we've accomplished what we set out to do and we have to report from those activities and so that's a foundation of building an effective outcome submission for grantors. The old version of EthoCE was based on an educationally oriented platform really tied us to some very specific course formats and reporting capabilities and so when DLC told us we were going to upgrade to Drupal we were pretty excited because we saw some really fantastic formatting opportunities that we didn't have before. However, we were still pretty much seeing the same reporting capabilities we'd already seen. Still had to do a lot of manual work to create a complete picture of our educational content which included merging some reports sometimes up to three to get a complete picture and nearly always had to delete columns sometimes up to half of them depending on the report that weren't relevant to us but were relevant to other customers. We also had to create visualizations in the form of charts and graphs to present to our educational oversight committees because there's no way they would bother to even look at what you see up there, that dense packed page of data. So for us cookie cutter reporting really wasn't cutting it so while we're very similar to other things that other EthoCE customers use or do we don't necessarily do it exactly the same way or need exactly the same reporting so we did have some reports with custom fields on them which wasn't necessarily cheap for us because it was custom work and it really created extra work for the developers especially when there was a new deployment they had to remember to migrate that change across and even with custom fields we still weren't getting what we needed. So there is a group of EthoCE users that has a monthly conference call we talk about what we're doing, we share ideas, we complain because isn't that what users do is complain and we help one another find solutions based on how we're all using the product and in that call over and over again reporting would come up every time we talked about the problem we all agreed we should put in support tickets and ask for more or better or different reporting and DLC also formed an online community which is really nice people can post questions and help each other out there and the same questions were coming up there too the bottom line was we were all looking for a better way to show our outcomes because we all in one way or another have to show those to an accrediting body so to re-emphasize my organization's needs and the needs of others in similar organizations we're talking about professional medical education so we need to be sure we're meeting knowledge and practice gaps to stay relevant to our learners and our accredited status depends heavily on our ability to effectively report on all of our educational activities in our LMS so after several years of asking and begging and nagging and not just for me, from other users as well DLC started looking into how they could better meet our reporting needs so my name is Ezra Wolff and I serve as the product manager for EthoCE which means I'm responsible for keeping customers like Gail Happy and so this is made about age 7 and it's not just a picture of a little kid it's a kid with a profoundly bad haircut and it's an unfinished haircut if you look at the top of the hair there and so I wanted to put this here not only because it's funny and embarrassing but also because managing the product is kind of like a haircut we have a bad product, we have a problem, it's going to grow out we can groom it and improve it over time so before we go on I just want to talk about EthoCE a little bit because Gail mentioned it it's a Drupal-based learning management system for continuing education and at DLC which is a company that produces it we're about a staff of 12, we have three developers so we're a pretty small shop but we have nearly a million health care providers who've used our system to maintain continuing education so that means probably some of you in this room have gone to doctors or nurses or pharmacists who have gotten training for continuing education through EthoCE which is Drupal so right on for that so we're really proud of our contribution to improving patient outcomes and we're also proud to maintain a few modules as part of that so now you know about me and the product I want to describe the context for the problem that we're going to show how we solved what we found in our application as Gail said that most of the requirements are the same and that's because we have an external regulatory agency in our world and in fact several of them that kind of set the rules that we have to follow but reports are a bit different everybody uses these reports for internal uses as well so business uses and internal stakeholders and bosses and grants and they all want them in a different way and so as the number of customers we had on the product crew this just became an onslaught of requests and so the business problem and how it applies to Drupal is that it just doesn't scale very well and I don't mean traffic scaling I mean that each one of these reports is a manual work building a custom view that a customer has requested and we put all our views in code so there's that whole workflow as well or we'd have some customers that would ask for an entire dump of the database because they have data sciences on staff some of the bigger hospitals and things but then they'd also want detailed modeling diagrams of Drupal core and all the contributed modules and all the custom modules and then they wanted to know how it all worked which is just again we're staff of 12 and we talked about and we tried just handing off views to the users at first but it didn't really work because we're not a training organization that's a complicated interface to learn we didn't want to provide support for it and the few times that we did this early on some customers created non-performative reports that basically crashed their sites by locking up MySQL so we learned pretty quickly that views is not a customer friendly tool or it's not a customer facing tool unless you have very technical customers and you trust them so we ended up building a lot of custom views and it keeps us busy but it turns out that that's not really a good thing for our organization because the developers are developers they're not site builders they want to write code and do challenging stuff instead of just clicking around and then each one of these views becomes another report that we have to maintain and when the schema changes we have to update them and change them so we ended up with just dozens and dozens of custom views on top of our normal views and there's really just not enough revenue from that because it became an opportunity cost for our business instead of building a product that worked better for everybody we were spending all our time building these custom views so with this problem then I turn to Devin who's our system architect and ask him to fix it so my name is Devin I'm a systems architect I unfortunately started with PHP 4 it's been a long time since then and I forgive myself for that as we said I needed an intro slide so I assumed it's a business requirement that I needed to fulfill so growing up my family had Dux's pets because everybody except me was somehow allergic to everything so no cats, no dogs just Dux moving on to the problem I thought hey this is definitely a big data question so I just turn on all these cloud services it all connects it's an obvious problem to solve so I split this up into three different phases the first one being big data the second one we'll figure that out and then the third one we're going to profit and that's it don't be the underpants gnome have a plan for what your data is going to do and so let's talk about the real plan our previous solutions as mentioned for custom reporting involved a huge suite of modules views, home box charts, high charts views, data exports about 10 other views related to modules that we had to you know maintain and upgrade it worked and we're building lots of custom views and charts for every customer adding per customer fields and calculations every single time which took time away from really more important Drupal development that we wanted to do as I mentioned we didn't have the capacity to train administrators using the site who really are not interested in using the views cockpit building our own tool would have been a distraction from our core business we're not data scientists but the functionality we did need was charting, report saving and sharing scheduling, CSV, Excel PDF downloads, embedded charts and reports and also most importantly handling the unknown so the reports and calculations that we didn't know about but our customers would ask for in the future so we needed a tool that we could give to customers to report on their own data instead of us being the ones to do it so once we figured out that we needed a BI tool we had to find the right one this is really a much bigger task than I had anticipated because the BI vendors they don't know about serialized data in PHP they don't know about web form, EAV storage they don't know about continuing education they don't know about Drupal so if you're in this room and you're thinking about trying to find a BI tool chances are the vendor is not going to understand what you're doing so it's really up to you to find the vendor that's the best fit for your organization or your business and so if there's one take away that I would want to leave from you as a business person it's that you need to be a shortlist of vendors as soon as possible and just be brutal with your vetting because there are dozens and dozens of them out there and they all want to talk to you and then do the most vigilant and intense due diligence possible and so just a couple examples of things that happened to us we found out in due diligence that one vendor had showed us some really great web based interfaces for viewing reports and building reports required us to use a Windows application to do the developer side of the work and you know we're a linux shop so that was a big problem for us and then another one we asked about their API roadmap and they told us that they were working on a Python based SDK and that's where all their energy was going into which makes sense because Python is widely used in data science but again we're a PHP shop and we wanted a generic REST based API and then the next question was where are we going to get all this data and that wasn't even something that I was aware of I thought we would just find a BI tool and they would bring the data warehouse to us and many of them do but the problem with that was once you pick that one you're sort of locked in and it also affects you know the ability to use ETL tools which is Extract, Transform and Load and ETL tools aren't cheap either some of them charged by the database connection so with multiple databases we have about 65 right now and one of them wanted like $10,000 for every two or three connections and something like that so that was just way more than our budget allowed our business model of software is a service so we're not building these one-off projects for our customers and negotiating a budget we're basically charging the fixed fee every quarter to get the services they want so we couldn't really go back and charge additional pricing for this and we certainly couldn't add it as an add-on and that's where I step back in and say as I said before I work for a professional medical association we're a non-profit organization and while we're not poor we aren't rich either and getting our board of directors to sign on the dotted line and this is in caps in my script for even more LMS costs is tough in theory they really do get that if we have the tech we have to maintain and upgrade it but adding features especially ones they don't necessarily understand and don't sound all that important to them is really hard so we have to decide what we can justify and what we can't and paying to retrieve our own data is not something they would sign off on so with all that in mind we needed to pick our products and to do so we really needed to dive in to how we were going to pull this off so the first questions we had were could we point a tool at Drupal and have it report out of the box and did we even need a data warehouse to answer the question about why reporting on Drupal out of the box is not optimal we have to look at how the data is stored so refresher for some you're in your first SQL 101 class and you're reading through the first tutorial you're asked to make a table of people and who their employer is and for most assignments not really a real world example but this is probably how you would have structured that table a field for user ID, name, employer and it does work but certain bits of data are duplicated across rows so you have to update the records or do some sort of textual find and replace in order to update a company's name for all affected users so take a name change for example you've got to make three updates either manually in your system or with some query and it might not take very long but at scale hover when there's lots of creation and updates happening at the same time you end up with table or row locking and it could affect the performance of your site you also can run into data drift just in case a finer replace goes bad and we have to normalize the data so the process of normalizing reduces data redundancy and improves data integrity we eliminate columns with duplicate data by creating separate tables we identify the data with a key we move data that's not related to the primary key so for example the person's employer is not directly relevant to the person themselves which is why we didn't even put the employer ID relation column on the people table and designing normalized databases like this means that removing the employer functionality from a system require modifying the base people table and the same would go if you wanted to add new fields and functionality so Drupal itself does this along with many contributed modules the core tables are pretty static while the add-on functionality uses relation tables for data storage so if we look at example of how we can efficiently store a user, a full name some location information and an employer we start with the user table we apply the normalization techniques that we talked about and we end up with about seven tables necessary to store the data and we can see why Drupal suggests managing fields in data this way custom fields can be added to just about anything that uses the field API and Drupal allows us to expand upon the base tables without altering the structure it also allows us to update only one row in this table versus an unknown number of rows if the company name changed so that's the basic concept of normalization and why we do it even though we're not talking about code here I think that's kind of an example of not hacking core we're not trying to alter the definition of core tables but instead you're safely extending them so don't hack core and close the database create a new table use keys to add your own functionality it's good design we just talked about normalization and this talk is about denormalization and why we need it so let me introduce the next problem table structure we just looked at got pretty complicated really fast and it's going to be really hard to write queries against it or as I said tell a BI vendor how to write queries against it and doing 10 million entity loads to get a whole bunch of data is going to take so long no matter how optimized your site is and if you have to do this on a schedule it's unlikely you'd be able to get updated data to your customers in near real time so the structure although very good for a production database is not good when you need to pipe this data to an external system that isn't Drupal the complexities in this design would also have to be understood in the reporting tool and if your third party system or reporting tool is doing all the processing necessary you're then locked into whatever tool that is so to summarize our operational database which is Drupal optimized for fast writes and data integrity it holds data but the relation model is not informative and that means without assistance you likely won't be able to answer a question about the data like account or an average or something split by another value so our proposed reporting database is going to be optimized for fast reads it's going to hold more information the relational model is going to be simple to users and also machines so we needed a module that could work with our Drupal data and get it into a format that would be conducive for BI tools and very early on I wrote a bit of code just as a proof concept and then I googled to make sure that there wasn't anything else doing it and my sandbox was the first thing to show up so there was really nothing out there that could do this and I'll leave it up to anyone else to decipher what that means about there being nothing else to do this so the normalizer module pulls from Drupal's entity so you can get more information defined schema you can also pass it external tables and it will programmatically build denormalized tables or views for use in data warehousing so simple example we have a node of type chinchilla we provide denormalizer with the denormalized table name the entity type to denormalize primary key to use and the change key to use what this is going to do is create a new table in a new database which contains all of the nodes of type chinchilla we also pull in the vote table here which probably doesn't have to be denormalized but it would be included in the processing so the queries that are generated by this module look a little bit like this and if you've ever had SQL preview on in views it's very similar, lots of joins but instead of taking this and rendering it to the user as content we're dumping the raw data into a table for further processing so after configuring and running denormalizer the resulting tables in our new database are only going to contain this denormalized data so for example that involved a lot of joins it's now stored in a denormalized format with no joins so this makes it much faster and easier to relate to other data in a data warehouse you don't want to run the denormalization process on an entire data set all the time and so on subsequent runs it will only update or insert brand new and changed records so we define those keys back when we set up our hook info which currently there's no UI for that yet but that will come later denormalizer records the water level so on the next cron run it will be encouraged so this is really useful when you're working with data sets in the millions and doing a full table load is going to be detrimental to performance or would make your reporting not in real time if it takes too long so at this point we created a new database in production with denormalized tables based on Drupal's own knowledge and combined with our own knowledge about how our data is supposed to be structured and only the data that you want exposed is in the new database contains the new tables and the complicated structures like fields have been flattened so at this point if you wanted to you could point a tool at this database although it's not really the most ideal you'd probably want to read only replica to handle queries so it doesn't take your side down but let's say you don't want to have you want to have my SQL as your operating database but you want to use PostgreSQL or Athenae or BigQuery or some other system that's a reporting database and maybe you want to bring in other sorts of data that's not related to Drupal for example we might want SendGridData tracking message opens or bounces based off the user's email Google add data to see how effective ads were so we needed to plan to pull in this data and then expose it and then utilize it in other systems so for this we have to change our approach a little bit and introduce something that resembles a data lake it's an aggregation of multiple sources of data the technology used in building your data lake depends on what you want to process for us it's a lot of relational data for others and maybe document storage but simply put in the presence of a data lake streaming from a source to fill the lake and various users of the lake can come to examine, dive in or take samples so now that we're not using our operational database during that the normalization process we can do some alters based on your own needs and do some things that really should not be done in an operational database like MySQL 5.7 and PostgreSQL have array and JSON type fields that can store multiple values without having to create sub-tables we can un-nest them later in a reporting database we can store parents in a more reportable hierarchical way you can report more easily on trees of data in data warehousing we really don't care about enforcing referential integrity like our operating database would we just care about the data getting into the system so the normalizer will also give you the ability to transform data in the case of any other processes that might involve manual extraction some of those examples might be dates if you want to convert all your unix times to isotimes key to value translations for translating keys to user values and then everyone's favorite PHP serialized data so while data lakes and warehouses can be similar without getting too in-depth there's some simple ideas that make them different with the data lake you're retaining all the source data in one place because you don't know what you're going to use it for you discourage creating data silos which can hold back data from those who need it due to technical or other barriers that may also make the data stale so you might want to create a data swamp a warehouse is generally reserved for processed data with a specific purpose so you may want to create a data warehouse that contains user demographics and nothing else you might then have another data warehouse that has orders and products but not demographics you could have multiple systems being populated from the same lake so you want to get all this data in one place to crunch some data but how do we get it there how do we pull something from somewhere else and put it into this database mentioned before one of them being BLT or ETL or ELT but simply put it's extracting data from your source transforming it via some process and then loading it into a target and recently extracted data from your source load it into a target and then do transformations people mix them around now so we need to do this we need a system to extract the information from a variety of sources and put them in a one system so after evaluating a few proprietary and sometimes expensive ETL systems we decided to try this relatively new project called SINGER SINGER is the open source standard for writing scripts that move data which is exactly what we want to do so the SINGER ecosystem consists of TAPs and targets so you could have a MySQL TAP for an external database a TAP for Drupal's database and then a TAP for SINGER emails, Google ads there's a whole bunch of projects that support the SINGER specification that will pull data so let's say you want to pull in and have the most hits and then compare them to clicks that were tracked in SendGrid to see if the emails were the reason for the traffic each TAP and target is its mini application they communicate via a JSON specification so your TAPs and targets can really be written in any language so here's the high level overview the TAP will extract information from the source there's a catalog.json file which is generated with a Drush command using the SINGER Drush module and it contains information about the structure of the data that we want to replicate so the target takes in data from the TAP and then transforms it to what the target needs so it could be CSV, Google Sheets, PostgreSQL whatever the target then outputs the state that the TAP sent over and that marks the water level or the last replication point that we talked about and so the state is then passed back to the TAP on the next subsequent run and then the process continues at the last replication point so this is running constantly and constantly pulling data into the data lake so the theme of this presentation is making sense dives, water level, TAPs, streams, lakes yeah, okay so here's a high level overview how this all comes together we can use SINGER streams in additional ways too so we're continually populating a central repository with SINGER so we can pull in all our sources put them in the data lake and then we can decide where it has to go later SINGER already knows the data format because Drupal generated this SINGER catalog which described the structure and all the field types of the generalized tables and once you have that catalog file you can replicate this data anywhere anywhere that has a SINGER target the target doesn't know where the data is coming from it only sees it in the SINGER format to add another example here literally just an hour ago in another presentation based on AI we had a requirement to dump data into Amazon S3 so we could run AWS Macy on this giant set and type what we're already pulling in and send that over to S3 so our solution in the end Drupal is being used to denormalize data into another database on production combined with SINGER being used to pull in third-party data also in the same location as the Drupal database our production database is on AWS Aurora which is like a souped up MySQL if you don't know that also provides a read-only replica of the lake which includes all customer data all of our customer's data SINGER extracts and loads data from the only copy of MySQL to a post-rescuel data warehouse Looker was our selected business intelligence tool that's used to transform and build reports on data the customer modeling is generated from Drupal based on the customer's custom fields and configurations so we can provision custom dashboards based on their configured fields we have about 100 million rows being processed the majority of that data comes from Drupal courses quizzes and web form activity it's about 1.5 million quiz attempts, 1.5 million web form submissions and it's replicating data from about 35,000 web forms so while that technical work was happening in the background I had to go and interview our customers to understand the product requirements of how we were going to deliver all this data to them we mocked up dashboards we took them to our user group meeting we posted them in the online community we then did a proof of concept and beta tested it beta tested the training and we ended up delivering 18 new dashboards they're all directly embedded in Drupal so we have single sign-on authentication with the business intelligence engine when the user logs into Drupal they're automatically logged into the BI tool but the number of reports really isn't important here it's kind of meaningless because we also rolled out self-service reporting meaning that customers like Gale can generate as many reports as they want in whatever format they want in addition the integration here with the BI and Drupal became a strong competitive advantage for us over our competitors the sales team got really, really excited about this it definitely moved the needle for them and it eventually became one of the key competitor differentiators in terms of winning sales plus building reports is really fun if you've ever done this I'll just give an example which is when we first got our data into the BI tool I basically lost the entire weekend sitting in my living room just building report after report after report and we always had plenty of data but we didn't have great information about that data and so this gave us the ability to do data discovery and turn that data into information which is really, really valuable in understanding how our customers use the system and how the end user learners use the system but finally and this is my favorite part of this whole project is that the developers don't build custom reports at all anymore our support desk manager has become sort of deeply invested in this BI tool she loves it and she's able to build these reports now usually in minutes like if a customer wants something custom and we often don't even usually we don't charge them for it so something that used to take weeks in terms of the development cycle now takes minutes and it used to cost hundreds or just low thousands of dollars to build reports as we do for free so it's been a really great success story for us it was a ton of work on the development side and they were really patient with all of us on the customer side who were pushing and pushing and pushing for something better and I can't tell you how much the customers appreciate the work that these guys put in to give us a better reporting option and while there are still things that need to be worked out on the development and the user sides we still have, we had issues until about what three weeks ago with pulling SCORM data out of SCORM files and it's working on it and it's working well now so that's good we really appreciate the work they put in and it's been a huge leap forward for all of us now on those calls we don't have complaints about reporting which is nice too so they took the plunge and they convinced us to follow them and with really great success that's them on your left and the customers on the right by the way so now we're inviting you to join us in the data lake because the water is fine I'll post the slides online later but here's some resources, all the projects that we used some of these are just approaching maturity a lot of this stuff really isn't that stable yet but it's a work in progress check out Singer, check out the taps there's a nice blog about Singer and Looker which is specifically what we use to integrate you don't have to use Looker, you can use whatever you want join us for contribution opportunities tomorrow and thanks for coming fill out the survey too please, thanks sorry we had no time for questions