 great everybody first thanks to the amazing organizers of our medicine this has been quite an experience there have been few if any technical glitches it's just awesome so kudos to y'all thank you I'm Travis Kirke I am Health Informatics Director and Scientific Director of Collaborative Data Services at Moffitt Cancer Center in Tampa Florida I'll actually explain what those groups are in a couple moments this is a co-presented talk with Garrick Aiden Bowie so this will go as I'll talk through my slides I'll finish I'll hand off to him to wind things down so since I don't own the last slide I'll inject a couple of formalities here that I would usually talk over at the end so first you're about to see some really cool slides and I can personally take credit for effectively none of them those of you know Garrick will know that he is really a community wizard and visionary among making sharing in slides really elegant and interactive and awesome so all the cool things you're about to see I kind of whiteboard a little bit and then he just took it and ran from there so it's been a learning experience for me and a lot of fun second we're hiring multiple positions at Moffitt that I thought I would mention we do have a specific eye always out for our developers and data scientists over the next few months so do for free to reach out to me after this if you want to come look with us all right so here we go to orient you to the origin of this talk where Garrick and I are coming from I'm gonna introduce a little bit of organizational infrastructure to those who are interested in building out these kinds of personnel resources and package resources and how it came to be that we gain high value from maintaining and developing our package universe at our institution I'll begin with a story which is based in Moffitt's experience but really could represent any institutions data-related journey once upon a time our organization conducted all data-related business in amorphous cloud known as the IT department this is a common paradigm for many health care organizations in early stages of data maturity the IT department had many roles hospital operations needed dashboards for planning purposes IT had someone that could do that researchers needed patient or biospecimen data for the IRD proof protocols IT also had somebody for that somebody of course you didn't know about data lineage as well as coding or metadata standards basically how the data got here and why it looks the way that it does that was an IT and organizing databases into a warehouse and granting accesses is of course important IT had somebody for that too but eventually some of these teams were operating at a scale which would better be situated as an independent entity of the IT gravity field one of these was business intelligence that person who's making dashboards for operational IE non-research stakeholders is now part of a larger team it creates such products at scale next our research focus stakeholders had many the same needs as the operational end users such as reporting dashboarding and importantly data provisioning the twist of course in research space is that such activities must be conducted in accordance with IRB and ethical approval and study design feasibility as it relates to data availability and structure requires specialized training data science biostatistics epidemiology and hence the collaborative data services team was formed this is one of the groups that Gary and I are representing here today the collaborative data services can't operate at scale on a vacuum either a critical and complementary team data and quality standards formed from IT's data historian person they ensure the data dictionaries are robust and data lineage is understood by the business intelligence and collaborative data services teams for appropriate downstream data usage as institutional data assets grew warehousing and access rules became necessarily complex data engineering formed a new continent within IT to meet the challenge and now with so many teams completing data-related operations at a rapid pace we needed a shellcraft to coordinate technology strategy and form general data governance and mine valuable software or from the asteroid belt the dad joke I know you're laughing I'm this virtual thing is hard asteroid belt hit it so that's health and mathematics team I spent a lot of time in that shellcraft making sure that the package is developed by people like Eric make their way through the appropriate groups and shared among our institution when these tools are ready for placement and maintenance and the institution was supported production environment the new applications development landmass of an IT can help out for example they maintain software such as our studio server or get have enterprise this whole story admittedly with some shortcuts for clarity mirrors the rise of the chief data officer role across health care industry indeed all these groups tend to roll up or be horizontally aligned in some way with the CDO's vertical taken together this is our first hint that scaling data provisioning isn't just about scaling data it's about scaling the people who are doing the provisioning and part two Eric is going to tell you more about the how so as Travis talked about scaling provisioning by scaling systems of people I'm going to talk about how we scale those people and their access to data systems through our packages I'm going to start with an entirely hypothetical but probably familiar story it starts with a question I want to connect our tissue sample inventory to a patient's clinical data it's not something I've done before so I'm not quite sure how to access the samples table or how to link a sample to a patient but obviously it has to be possible right so how do I get started well if you believe the big data stock photos I go to the self-service data wall and point at the numbers that I want in reality it probably starts with an email or many emails I start by reaching out to someone I know in data engineering who manages that particular data resource and I see what they can tell me dear friendly data ops person how can I connect our sample inventory to patient level clinical data I've heard that you know the secret thanks garrick I fire off the email and a little while later I get a reply hey garrick here's the query we use to populate the table good luck and look the email came with an attachment that I can open up and I'm immediately hit with a wall of sequel code okay this query doesn't look pretty but in a couple hours I'll probably get the gist of it and if I go looking in here somewhere in here there's probably some tables are referenced sample table there's a patient table there's a sample indicator okay and these lines here after a while of puzzling I realize that they're about turning coded values into text labels all right fine and hey it's at least code right well since we're emailing files around sometimes you'll get a query like this in a slightly different format like a word document where the query doesn't really fit on the screen or the page and let's just say formatting choices are fluid so putting aside emailing in word document format sequel queries are not a great vehicle for knowledge transfer they're good for precisely communicating data specifications in the robot language that databases understand but we have other ways of working with data that have been specifically designed with humans in mind for example de Plyer whose API is very intentionally designed in line with the philosophy that code is written for people to read and only incidentally for machines to execute this reminds me of a great quote from Jenny Brian that's of course someone has to write for loops I mean sequel code it doesn't have to be you so let's take a look at what this query might look like in an alternate universe here's the same query rewritten using a blend of de Plyer and custom functions that support our particular setup okay let's walk through the code step by step and see what it represents first of all we call our universe the Moffitt verse very much inspired by the tidy verse a single library Moffitt verse loads a common set of packages that we use for nearly every data request most of these packages come from the tidy verse but we also include our own supporting package Moffitt CDS specifically tailored to my team's workflow this creates a common starting point for everyone on the team it also gives us a formal on ramp to install and set up database dependencies that we can leverage in specific packages that interface with our many back end systems so this makes connecting to a specific database straightforward you call use back end and the name of the database of server that you need to connect to in this case the fictional ABC database behind the scenes this will load database specific packages including a specific package for this resource called Martha ABC and each of these back end packages has two primary goals the first is to simplify access so by default Moffitt ABC will not only remember the incantations required to connect to the ABC database but it'll actually manage the connection for users internally it also provides easy access to tables with functions like ABC tibble this kind of hides a bunch of other sort of less inviting supplier code and and it manages the connection for the user and it also connects to tables in the ABC schema directly so in this step we find we connect to the three tables we need samples patients and the sample indicators table okay at this point we've set up our workspace and our environment and we've connected the tables that we need so we can now focus on how these tables relate to each other how we can get from samples to patients through a series of left joins and finally the final lines speak to the second goal of the back end specific packages which is to wrap common tedious or error prone database moves into standard functions here because we're working in our we have a lot more flexibility to write functions use tidy select tidy eval and more and to do things that would otherwise be very hard to do in sequel like applying a not deleted filter to all of the tables used in the query or automatically looking up text labels of coded values okay let's take a step back and reflect on this code as a whole so it's really it's not that this is fewer lines of code or less repetition or a question of our verse SQL is that this code does a much better job explaining to humans how the data is being collected and transformed there's still plenty of assumptions here but as we'll see because these functions live in our packages they bring a lot of context with them so let's take a look at the source of the ABC choice replace function we've already seen that this function we've already seen that our naming conventions at communicate the functions intent right so we could read this like and then replace the choices but on top of that the function name is chosen to a discoverability so in other words a user can easily find other functions that operate on choice columns by exploring autocomplete and typing ABC underscore choice and seeing what other options are available because this function lives in an R package we can document what the function does and why right next to the code and the documentation is comfortably available right inside the data analysis environment the body of the function can be considered considered technical documentation recording how the function works it's more precise than just a description of what vast practices are and we've learned that when interfacing with more technical teams the function itself becomes specification for how we accomplish tasks which makes it easy to say to engineering this is what we do or this is what the new platform needs to support taking another step back this function isn't just about making life easier for someone working with this data we now see that it's a self contained unit of knowledge in this view in our package isn't a place to keep code it's where we store best practices or lessons learned it's how you share that knowledge with others on your team in fancy websites seriously ours tooling for package development is amazing and it's tools like packaged down that make they don't just make your code pretty and browsable and shareable and discoverable they make your package documentation a viable knowledge repository and a place to turn when you need to learn something new on top of this if you're using version control front ends like github or get lab you can also have a public place for sharing knowledge asking questions or getting help when things break down rather than sending emails that are only seen by the people copied in the email you can open an issue where your question is seen by somebody else answered publicly available for future reference and maybe becomes the basis of new functions and new functionality so I'd like to close with a few practical tips about how to make this happen in your organization and teams the first one is start small start with one team and make their lives better I guarantee you that if you look for look for it you will find a painfully manual process just waiting for a hero like you my second tip is to stay small so rather than throwing everything into one big monolithic package that everybody uses I've had success creating smaller more focused packages it gives me a little bit of freedom to experiment and also to make sure that I'm providing targeted solutions to the problem at hand my next tip is to use vignettes vignettes are a great way to document and share processes that aren't easily captured in a single function or even in in our code right so I've used vignettes to document database drivers set up in configuration or to show how you would accomplish whole game analysis from start to finish and finally be opinionated okay wait that came out wrong provide a happy path consider that your users are likely used to a range of workflows so help them fall into a pit of success by making sure the happy path is as smooth and as bump-free as possible none of this would be possible without a slew of packages and resources key among these are use this and dev tools which are great for package building oxygen to and package down for package documentation if you're new to package building our Hadley Wickham and generally Jenny Brian's our packages book is a great place to start learning about our packages and it's an invaluable resource to turn to when you get stuck we also use draft by Dirk Edelbeudel to create an internal Cran like package repository and it made package installation so much nicer and easier for our users and another option there is also our studios package manager and finally a big shout out to Mike Kearney's package verse template that make made it really easy to pull all of our packages into a cohesive unit and and to create something a fraction is cool as the tidy verse ourselves so with that I'd like to say thank you for for giving us the opportunity to talk about our experience building packages and I'll leave it here I mean I could talk about this for a long time but thank you for the opportunity and you can find Travis and I online if you'd like to talk more about this otherwise I wish you the best of luck building your own universe of our packages and happy our users I'm seeing questions I could answer some I thought a moderate would jump back in lots of questions about the slides Gary can I take that how do we make them yeah so sharing in there's a the package sharing in is the is what I used to make the slides and then a lot of extra HTML and CSS a lot of slide crafting I call it thank you while I'm browsing the questions if I could pick one more from Peter on your data managers like to randomly change names of tables and fields in the mood strikes that's a great question and that is why we have two teams that are complimentary in this regard on the data engineering team that I mentioned and then the health informatics team which has a governance function where we make sure that we approve any kind of table names that happen name changes that happen in field names that change we make sure that it's valid appropriately hey Travis another question how do you guys get over the learning curve to introduce people to functions so I'll take that I think I think definitely as a package developer you have two people maybe two or three people in mind you have first the very new users who are going to use your functions and and I see my role is sort of I do a lot of watching over other people's shoulders and seeing how they approach a problem how they how they tend to code with that and then often I start to see patterns emerge between how one person is doing one thing and another person is doing this thing and I guess by having those conversations then I start to think okay well we could build this into a function and so maybe I'm you know in a way a curator of these processes but then this is also something that you could eventually train your users to write them themselves as well okay there is another question here it well question or comment it would be interesting to see how you handle huge data sets in the packages any sense that yeah so we we often we write packages that connect to databases and and rather than then putting large data sets in those packages there's a lot of value in that though because we find that basically every database has its own quirks its own weird way of storing the data the you know one or two things that you need to know about that database to use it and the packages are awesome way to sort of document that knowledge okay great thanks guys that was great so really great talk thank you so much balance okay I will end the session then and move people over to the next