 Hello. So, I'm Lori. Most people know me as Lori Shepard. I did get married. So, I'm also Lori Kern. So, I go by either I've been part of the bio conductor core team for several years now. And as mentioned, I'm going to talk about the annotation hub and experiment hub, some key infrastructure pieces available for use to the community. So, throughout the presentation, I hope to cover what the annotation hub and experiment hub are, how you would find resources as a user to use. And then some common misconceptions regarding the hubs, and then get more into from the maintainer and package developer side, how you would contribute or add data to the hubs, and then common pitfalls when you're submitting data. So, just a poll here, because I like to be a little, little interactive. There's not a lot of it. But for those of you in the audience here, sorry, can't get to you virtual people. But how many of you have used the hubs before? Okay, good. So, what do you think the hubs are just in a general concept? So, in the heart data, the annotation hub and experiment hub is basically a database of pointers or database of basically resources and metadata about different resources. Most importantly, where the data can be downloaded from. So it's kind of like just a database of information that you can use inquiry to find resources of interest for you to using your packages or using your resources or in your research. Okay. So when you use the annotation hub and experiment hub, you load the library and you call the constructors, and you're going to get a hub object, and that hub object has an SQLite database in the back end that has all this metadata about particular resources. And the metadata is given by a contributor at the time of inclusion into the database that then you can query against. And most importantly, again, is where you can download the data from. So, files are stored remotely instead of in a package or somewhere, and then they're only downloaded as needed or as requested. If a contributor doesn't have a location to store that data, Bioconductor provides a default location currently be Microsoft Azure Data Lake resource, but they can be stored elsewhere on institutional servers and auto publicly accessible server. So they're stored remotely and then downloaded as needed and then they're cashed locally. So there's an expensive download once. And then if you try to call that resource again on your local machine, it'll be really fast because it'll find that locally downloaded resource. So the main functions that you're going to use to find resources are query and subset. Most people will find query the most useful and the recommended the subset is kind of more of an exact search and is a priority knowledge. And again, this is all against the metadata in the database. So what kind of metadata are we talking about. So when resources submitted we get information that must be provided and these are the required deals so a title data class species taxonomy ID prepare a class which in general it's a little miss screen but the prepare class is the package name so when a person submits data you have to have a package to submit data so it would be like the package name and other useful information that will get into in a little bit. So kind of as an example code example here. Sorry it's not interactive but we'll use annotation hub to start so we load annotation hub, we create the annotation hub object, and I created this slide as of July 6 so as of July 6 we had 65,055 records in the annotation field that could be queried and found and used in data. And it kind of gives you a snapshot, a snapshot here of difference metadata fields and what is available. So again to reemphasize that difference columns in the database and in that object, you can query against the title you can query against maintainer source type. So, for instance, if we had species. If we looked at unique species we have 2557 unique species available in some form of data in the annotation hub. And we can just kind of get a snapshot of the different species. And then similarly there's the our data class so our data class in this instance is what type of object when you load or read that remote file into our what kind of file are you going to get to work with or what kind of object are you going to get to work with so are you going to get a two bit file or you're going to get a g range object or you're going to get a summarized experiment object. So you can also kind of query for objects that you're used to using or familiar with using. So, as a further example of query sorry all you cat people I'm a dog person so we're going to use dog and on my land so I'm not going to try to pronounce it. But, so if we just did a query against the hub for dog, we would get back another hub record and kind of pairing down, but you see we still have 223 records here, so still kind of a lot. So, maybe we like g range objects so if we did a query then with dog and g range objects, get a little bit better we're down to 125. So you can kind of see where we're going the more specific you can get. You kind of get more specialized results so if any of you are in James McDonald's annotation workshop I think he kind of equates it to like a Google search right it's kind of hit or miss you might have to massage the query a little bit to kind of find what you want, but you can kind of pair it down and try to get objects of interest. Once you kind of have it in a more paired down list you can really dig into the metadata columns. So with a single sub single bracket with a given annotation have ID, you can get that metadata column and get more specific information on a particular resource to know if you really want to download it or not. To download the resource then you use a double bracket, and then you would actually download the resource locally and it would be cached. And kind of as a proof of principle if we did it again this like my art code so the first time we're downloading we get the long download it might take a minute or two depending on how the file size, and then it loads it in our and we have our g range objects. But then if we tried the double square bracket again to download it would skip that download step and just directly load it from your local cash. So kind of a time saver. Kind of continuing on because I came from the annotation workshop. Maybe in the end we like g range objects but we're really looking for a text DB object. We take that g range object that we had just loaded in our, we can load our genomic features and there's this make txtb from g range function, and we can get our txtb objects. And if we did that we can see that we actually had a lot more options and some more recent builds. So again kind of proving the point of you might want to play around with your queries. And there may be some ways to get different objects depending on how well you form your queries. Don't know why I went backwards experiment hub works the same way. Main distinctions annotation is for annotation data some sort of mapping from one source to another where annotate or experiment hub is for experiment data experiment packages. Right now bioconductor doesn't allow large data inside a package for submission so we'll say if you wanted to include experiment data with your package for principle or more detailed than yet. We would ask you to put it in experiment hub and then download it from there, which we'll get into in a little bit to the experiment hub works the exact same way. Same constructor. Again same date as July 6 but this time we have 6,332 records in the experiment hub. And again we can perform queries the exact same way kind of pare down. Again I talked about this prepare a class so especially with experiment hub. Maybe you know the package that had the data in it you could actually search for the package name to get all the data that was given with a particular package of interest. And just as a different principle with different metadata we can search for the genome and see what what particular genomes those objects came back. I did mention subset. So as a subset example subset is an exact match again requires a priori knowledge. So with subset you're giving the hub the objects and you're giving an exact match of a particular metadata column. So, again that's why most people find query a little bit more useful, especially if you don't exactly know what you're looking for. I guess before we get into misconceptions and everything. Is there any questions about like using finding data. We have any questions so far. Not right now we haven't defined an exact limit I would say if your data is extremely large reach out on the bioc mailing list we generally asked like to reach out there if there would be some sort of size limit but right now we haven't limited sizes. This one. And so query will search all fields. There is I think an additional argument where you could like limit what columns it searches for. But otherwise it will do it across all columns by default so you kind of get a more broad search to try to catch all fields of interest. And it would be believing and search. So, instead of an or. It could be, and we can kind of get into that when people are submitting data. The real limiting factor would be like getting it into our loading our but there's no reason why it couldn't be images and whoever submits those we would probably recommend using the generic file path. Load method which we'll get into in the second. Possibly. Sorry about that. So common misconceptions before we get into contributing data that the bioconductor provides all of these resources is actually very. Not true actually we rely on the community to provide resources to us so bioconductor will provide by default. So that's a few annotations at release time and when there's new ensemble releases, but other than that everything that's in the hub is user contributed. So we really rely on the community to help each other to contribute and be able to distribute resources of interest to community. Another one is that we update the data. So again, kind of along with that once it's in there we don't update it. So that's why we asked for a maintainer and why we started associating data with a package because if people are interested in a more updated version. They could reach out to the maintainer since they had provided and curated and are more familiar with generating it that they could probably provide an update and then put it in the hub again. And then one another one is that all data is hosted by bioconductor. Again, you can host it on your institutional server or something like a publicly really accessible server like snowdo. The only thing that we don't ask is that it can't be like on a personal level like a get hub or a drop box so we want it to be more public more stable. But again, if you don't have access to that resource that we could provide our bioconductor Microsoft deal like that Microsoft is so graciously giving us and helping us with to host your resources. So contributing getting a little bit more into contributing rather than using hum since we just talked about how most of the resources are user contributed. So starting with their, their package base so that we can track them so that hopefully to help you to if you have a script or something that has done the curation that has done a generation if you wanted to regenerate or do another version another subset of resources that the code is there and easily reproducible traceable. Again, if there's a problem with the resource hopefully someone from the community could reach out to you with questions. We do have a helper package the bioconductor package hub hub has been created. It takes advantage of Leo's bio see this package to create a template. And then it also has a lot of helper functions that will help you create the necessary files for a hub package. And a lot of that information is also in the vietnam yet create a hub in the hub hub package. So really what makes a hub contributed package different than normal package is very few things. So all bioconductor packages require a biocuse term in the description. But if you're going to use the hubs we asked that you put in the appropriate either experiment hub annotation hub experiment hub software annotation hub software keyword so that we can track it as a hub package. So instead of the data being directly in your package since I briefly mentioned but we don't allow large data files, especially with get that they're not allowed in the package so we would encourage you to use and distribute through the experiment hub. So those data would not be in the package itself, we would instead in the instant directory in the x data have a metadata file so kind of the fields that we just talked about the titles the descriptions. So there's a lot of information about the resource that's provided in a metadata CSV file that then could be inputted into the database so it would be there and queryable. Additionally, then we would have an in script folder that gives information on how the resources were generated kind of like a how to so if a user did want to recreate your data objects they would be able to. It could be code pseudocode text just some sort of indication, most importantly if there's any sort of source information or licensing that should be aware of for the resources that it would be included there. Everything else is the same as a normal package submission and you would submit it to the new package submission tracker for review and go on as normal. Again the metadata file is the biggest thing so I just kind of wanted to take a little bit of time and break down the different metadata fields. Really briefly, since that's the main difference and the important part because that's what people will be querying against in the database. So the required columns of the metadata file are a title. In experiment hub there is the option to create these accessor functions for a specific resource so if you titled it lori v2. Once you use this accessor function you could actually have a function in the package that automatically gets created so it's the same as the title so if a user calls that with parentheses it would automatically load in our. So if you use that we do tell you to try to avoid spaces and punctuations for obvious reasons so it can convert into that nice function name that the user could call. Description again kind of like an abstract for your resource a little bit more thorough description of what the resource is. We asked to avoid special characters just since it's going into an SQLite database doesn't always like special characters so we ask you to try to avoid them, if at all possible. And then the bioc version. So the bioc version should be the bioconductor version that the resource will be first available. So most often than not it's going to be the current develop version of bioconductor where that version is going in so it would be the current bioconductor develop version so if you submitted one before the next release probably 316 right now. For genome species taxonomy ID and coordinate one base system we realize these are more geared towards like annotation data but we do ask for it for experiment data resources to so they all can be any if it's not appropriate for the data. We do have some helper functions to to make sure that it's kind of consistent as far as capitalization and punctuation since ours case sensitive. So, we have helper functions like get species list invalid species to help kind of make it consistent. We do have a validation function for the taxonomy ID based on a given species with that matches against the genome info dp low taxonomy to be so you can always look at that for validation. So for any one based just for annotation based to know if you're 01 based since some platforms index differently it's important to know for annotation purposes. So the next three kind of deal with the original source of the data so source type so the format of the original data, and we do have a helper function for get valid source types. If you don't find a valid source type that you think is appropriate for your data you can always request it to be added through the biocdevelop mailing list, or the hubs at bioconductor.org email. And if your data is truly simulated. We asked we do have a type of simulated that is an acceptable type that we would ask you to put source URL would be the original location of the data files sometimes it's a combines multiple sources. And if you have multiple sources we asked that it's a single string separated by comma comma. And again, if the data is simulated we recommend either then putting your lab is a source URL, or be anticipated by a conductor package short link would also be a good or acceptable option as well there then source version. So if there's a given source version or date of data, we would recommend putting that there, and data provider so where did the data come from CSC ensemble, your lab, some indication of where it came from. I kind of briefly mentioned it so the our data class would be the type of objects that you get back to work with in our. So am I getting back a G range object a summarized experiment object or range summarized experiment object a single cell object. So people kind of get an idea of what they would be expected to work with, or can transition to. And dispatch class. So this can be a little tricky for submitters, and this could be where we get into like reading and loading. So a dispatch class determines how annotation have our experiment hub will load the remote file after it's downloaded. So it will use some predefined dispatch classes to automatically load file and are for immediate use. So, examples if you use like the generic save the dispatch class would be RDA so that would use that load method. So this is created with save RDS you would use RDS so it knows to use the read RDS read RDS load RDS, blinking at the moment, but kind of get the idea there. And there are a number of available ones and annotation hub has a helper class of dispatch class list. So when in doubt, if you don't know what to use we recommend the file path which is kind of a catch all. So this is where it would download and then, instead of trying to load it back into our for you. It will give you the local path where it cashed in on your system. So then you can use that path and whatever appropriate load or read method so if you had even some specialized read method because it was some object or specialized object that you were using or creating something new. That way you could load it in yourself. And then probably the most important columns are this location prefix column column and the our data path column. So these are the two that will define where that remote file is and should be downloaded from. So location prefix if you're using the bio conductor default location you would actually skip this column because we'll populate it for you. But if you're using your own server, it would be the base URL to that server. And then should have the trailing slash, and then the our data path would be the remainder of the resource. So when you're using a default bio conductor one, we ask you upload it in a directory the same name as the package. And then you could have any number of sub directories underneath it but it would be the name of the package and then your file or the name of the package and subdirectory and file that includes the resource name and the extension. So you can kind of think of these two is if you took the location prefix and our data path and can't need it together. It should be the complete URL of that downloaded resource, like if you clicked on it, you would start the automatic download and your web browser type thing. It should be that complete download URL. And again I know this can and we've gotten a lot of questions on it in the past so just to kind of emphasize a couple of different examples here so if you wanted to upload a directory. And you're uploading your package and you're using the bio conductor default storage location, and you have a subdirectory. Sub directory one and subdirectory two and each of those directories has two files, and we'll say, well, one of the files has a CSV and extension and one of them has an RDA extension. Your metadata would have something like this where there would not be a location prefix column, but you would have an our data path column, and it would be your package name subdirectory one with file one dot CVS, and your package name subdirectory one file to RDA, etc. So here's a second example say you're hosting it on some institutional level server. And let's say you had two files, and your past looks something like this to complete path download. It might look something like the following where your location prefix then becomes the base URL to the path, and then your RDA path would be the completion of it. So you see that we have there my institutional website and data server with the trailing slash and then the our data becomes the rest of the URL path. In truth, data server could probably be included in the our data path. I guess it kind of this is a little discretion as you're contributing so I would say the base pass should be the limited pass so if eventually you would submit other data sets. I would say use the smallest path that would be repeatable is like the location prefix path because then that stays consistent, even if you're adding more data and then the unique part would be in your data path section. That makes sense. Last two fields in the metadata column that are required is maintainer. So again, we asked to check so that if there's an issue with a resource if someone reports an issue with a resource. They can either contact us and we can contact you or they could contact you directly and is a name and email address, and then tags. So tags could be anything that you think relevant to the resource and additional that people could query against. We've seen multiple tags. We asked that it be a single string and then separated by a colon, and we do grab any unique biosc views or specialized biosc views and automatically put them in a static so like every resource isn't going to have the annotation that we just asked you to put in as a biosc views tag. But if you look into the biosc views it can get more specific to single cell type data or a specific platform. So you can put those in your biosc views and they'll be automatically applied to all the resources where this could be a little bit more specialized per resource as far as tagging a resource for querying. You can include any other sort of columns in your metadata column or in your metadata file. They'll just be ignored by us and won't be included in the database. But if you wanted to include additional columns for your own purposes or if you're going to use that file in your code in some other way you can include as many as you'd like we just won't use them and we'll ignore them. Any questions thus far? So there's a question in the chat from Mandy Griswold. Is there an upside to uploading data to the hub rather than hosting data on a separate server and then accessing it through bio file cache? I would say probably the visibility and getting it more exposed because then you have it in a multitude of ways because like biosc file cache is its own semi-contained self and it would be only discoverable through your package where experiment hub and annotation hub didn't put it in this presentation to but there's also shiny versions and web app versions so people could find your data then and possibly use it in different ways that you didn't even think possible or collaborative research wise that it could be exposed and distributed in other ways. Any other questions? So thinking of just sort of places where people dump data, I don't know whether it's Zendo or Dataverse or something like that. Is there any of them that make it better? Or is there any that you'd say oh this is a one that we've had problems or this is a one that you definitely recommend that works really well? I would say not offhand. I would say I pointed out it as a main misconception because most people think they have to and normally just upload to Bioconductor where we've had people being like oh well I have it on this server can you upload it for us here and we're like you can keep it there it's okay. So we haven't run into those sort of issues. I would say when in doubt you could always shoot the Bioc mailing list a question or the hubs at Bioconductor.org a question and I guess I would put it back out to the community to let us know if they're experiencing or if they've experienced any slow download times I guess to let us know because then we could be aware of it to tell people to avoid hosting it there. But we haven't seen any large drawbacks from any of the ones that have currently been used outside of us. I would say using the Bioconductor default one kind of guarantees it to work and you know that we're probably going to secure it and keep it around I guess is one of the plus sides because we like to ensure data reproducibility so we're going to try within our power to make sure that it never goes away. I have a question. What kind of data objects do you typically find are the most commonly uploaded and that you can download directly in the cell experiments in your examples. I would say probably the recent pushes summarized experiments and single cell experiments just because they're kind of the hot thing and the most widely used right now. Awesome. I would say an annotation hub to every ensemble release. There's the Hanna strainers. He provides I can't remember exactly what type of object but every ensemble release there's you know harnesses versions of them and we also include two bits and G range versions of all the ensemble releases that they provide. So that would be another big one that's popular and annotation hub at least. So any other questions. So I did harp on metadata files so just about the file itself because that was all about like the contents of the metadata file. So it should be in the x data directory of a package and it should have an accompanying in script file that describes how those resources were generated again it could be code pseudo code text we just kind of want like a little bit of a how to how it was generated so if people wanted to they could produce an object of the same or at least know how it was formed and most importantly if there's any sort of source information and licensing information that should be included there so we know that it could be distributed and that it's appropriate to include in the hubs. It has to be the CSV file with all the required columns that we just went over in detail. And we've been referring well I've been referring to it as metadata dot CSV but it can be called anything you want as long as it's a CSV file so it could be resources dot CSV it could be subset data one dot CSV as long as it's a CSV file. And there can be more than one. So, a lot of data submitters that do a repeat or additional data will often included as a separate metadata files so that they know when or have some sort of distinction of versions. It's totally acceptable when you submit it or request additional data to be added just let us know what the name of that file is, but it can be called anything and you can have multiple versions if it's easier for you to maintain. Based on those differences and as far as the submission process. Basically, you would recreate your template package with the needed bio CPUs and that metadata file. And you would shoot off an email to hubs at bio conductor.org. Don't know where I really put maintainer bio conductor.org the main one is hubs at bio conductor.org and give us the link to the package so that we know where to find it. And if you're using the bio conductor storage you request access because we have to give you specialized keys in order to upload the data. And we would give you a temporary access code so that you could upload the data to a staging site that we would move over into the location. Once that's available, we will let you know that's available in the database and then you could alter your package then to use the hub interfaces to download the data in your package or to make people aware and finish off your package documentation. And then you would submit it to the new submission tracker just like a regular package. So, fairly straightforward as far as submitting. There could be a little bit of back and forth with getting the data into the database. That's, oh, okay. I guess before I go into that. There's the hub package I mentioned before Kayla from the core team has kind of developed it and utilizes some of Leo's bio see this to create a template package, and then has a lot of helper functions to help add the data to the metadata file try to get in the right format and has a lot of validation so that hopefully it can be more straightforward and less pickups when submitting the data to us to get it into the database. She has a great package vignette to use those helper functions and then the create a hub package vignette is available for converting an existing package to a hub package if there's an existing package that you wanted to transition over or creating a hub package from scratch. Some common pitfalls and just some of the things that sometimes we go back and forth on with getting the data into the database. ours case sensitive. You didn't know it's case sensitive it is. So it does make a difference, especially with that download location. If it doesn't match and doesn't match exactly with case there would be download issues. And going along with that, just general typos be careful that what you upload to your server is the actual location path in your metadata file we get a lot of just random typos. Some of the other things is that like the number of uploads to match metadata we normally give you permission, especially if you're on a bio conductor server to check the downloads for the uploads. So in general, they should match. I say in general because there are exceptions, because there are specialized cases with like annotations that you can upload, like the BAM and the BAM index file where they're associated together so then your uploads aren't going to match because that would be associated with with one but in general, your upload should match the number of rows in your metadata column. So it's clear that there's at least two unique tags per resource. So either with biocviews or using that tags column there has to be two unique or two valid tags per resource. Again I kind of mentioned it with the description of the metadata columns but we try to avoid special characters because the database doesn't always like them. So that's it, just some things that we normally go back and forth on and comes back. And we do have validation functions in Annotation Hub and Experiment Hub called Make Annotation Hub Metadata and Make Experiment Hub Metadata. We run these to get it into the database. So if you run it and you get an error, we're going to run it and we're going to get an error and it's not going to get in the database. So we really recommend running these before trying to send us it. And if you're getting an error or can't understand why there's an error, reach out to us because we're more than happy to help figure out why or what we need to do to make it valid. So I'd like to acknowledge our grant funding for Bioconductor and the ITCR that funds all of our research with Bioconductor and specifically the hubs. Core team throughout the years, I didn't start the hubs, it was created by other core team members and we've had many contributors with package code and contributions and recipes to get the data in that we do for default and the maintenance and the infrastructure behind it. And then again, since it's user contributed, it wouldn't really be popular without you guys contributing resources to the hubs and using it. So all of you. And if there are any more questions, that's all the slides I had. So I'd be happy to take more questions. Yeah, so I feel like I have a lot of edge questions, I mean, edge cases or cases that can break it. And maybe we need to talk about them more. But one of them comes to mind is like Recountree, which now that I know that data can be hosted elsewhere, that's more than like, maybe a million files. Maybe that's too much. And then the other one is a family of questions related to the spatial transcriptomics where about like what resolution should be uploading. And then, like, should we upload like the data that it's easier reading to our should we upload also the data that is part of the raw data that then gets process that then can be easily ready to our should we upload the actual our data but like that package for example special experiment has had a lot of breaking changes. Like even even within a major version of by conductor like the data can become unusable, because the package will change. So, I don't know. And then, I don't know you have one that you want to get more into and or the ones we could. We definitely should talk because we definitely should try to expand it make it more robust, make it more usable and would say millions probably a little bit of an overload for the current implementation of the hubs. I didn't really get into the future and what we plan to revamp the hubs but we did want to look at ways to move. Because right now, when you use the annotation hub and experiment hub everyone that calls the constructor will download a copy of the database and then it gets queried locally. We want to try to revamp that over the next year to move it to a cloud hosted database and only download is necessary and still have some work grounds to be able to use it and download parts of it locally if necessary but that moving towards that seems like that would be kind of more of a solution where as long as it can get into the database and have those point references and that would make that closer to being able to be possible. As far as raw data versus process data and where to go. I would say, we're probably not the expert with that probably would want to kind of get a feel from other areas and researchers in the field to know if having that raw data and working from that point if they would have other use cases for it, that then more of the raw data should go in if it thinks it could be expanded out in different ways than your specific or objects. If it's more specialized or specialized use case in a package or increases functionality or productivity of your individual package then I would say maybe getting more into the process or object would probably be a better way to go. But that's my own personal thought I'm sure that other people and probably other technical advisory people would probably argue one way or the other so I would say kind of more feel it out for your audience. And you prefer to have those discussions on the mailing list or you have issues. Either might be good to have it on the mailing list because then gets more of the community involved and since it would kind of move towards a community feel and know where people want to go with it that probably I would probably encourage it there. I'm probably going to get a lot of heat on there but yeah, I would say probably the mailing list because it would be good to get other people's opinions. Awesome thank you. The next of my input to you, Leonardo is that if your package, have it go from the raw data. If, because if someone's learning how to use a package and they would have the raw data and then all of a sudden you're like okay we'll just upload this already processed data object that's already in our format you need that's not helpful for the user. If that's part of your package does not somewhere else. I think in general we recommend as well as possible unless there's like efficiency benefits to it. Thank you for you Laurie are these slides going to be available somewhere and can we link Tim from the website. Yes, I believe on all of the conference material will eventually be up on the website somewhere if not I can always send you a copy, but almost all the conference material gets posted somewhere post to the people in person if you say your name before you answer the question that the online people can probably participate. This is from Marvin Vansler from from for annotation data should variants be submitted as separate packages for example an annotation pipeline that generates different outputs depending on gencode versions. My good says yes. Again, I'm kind of not. I just say I'm not the expert in the field but I would say getting a feel for what people use more often would probably be the more appropriate way to go and again asking on the mailing list what people think should be included in getting. I guess people that would be using it more which is more useful would probably be my best answer. It's not really a question but more of a comment in response to Leo's question based on my experience using the hugs. And I've developed the current at TCJ data so my advice would be to use like a data frame and then construct your class based on those basic data types, because that would allow for any new versions of classes to be constructed on the fly rather than having them break if there is a major change to those. Thank you so much for joining me. Grab me for a cup of coffee slack me I'm always around. Thank you.