 Hi, I'm Tof Allen. I'm an epidemiologist and I guess director of data science on EcoHealth Alliance's technology and data team and I'm excited to be here talking to you all. I'm going to try and go pretty quickly because I went over and won for my rehearsals. So EcoHealth Alliance is a New York based nonprofit and we do research at kind of the intersection of public health and conservation, mostly studying emerging infectious diseases, a more big multi-disciplinary organization with field ecologists, public health specialists, and you know modeling analytics team and then this software team that I kind of co-steer. And yeah, our team works largely on machine learning applications for biosurveillance but we're interested in data integration too. So I'm going to hit on data sharing and data integration and kind of what that is. Talk a bit about tabular metadata and ontologies and then talk about what we're what this proposal that we're submitting is about. So data sharing is is going way up and this is a good thing. I went and looked at this 2014 blog post on the our open-side blog and so this is the number of packages on the Dryad repository and these are packages, datasets with associated papers and yeah, they're going way up and this blog post was in 2014 so yesterday I went back and looked and there are a lot more now. It's increasingly becoming the norm to share your scientific data publicly and this is driven by a bunch of different factors including good infrastructure built up around hosting data in repositories and the growth of data sharing mandators from publishers and funders etc. And in a scientific context, open data is a really good thing. A lot of its value derives from the ability to join multiple datasets together meaning that you can run comparisons or meta-analyses or studies with greater power because you can have new datasets or add other variables. And on a more philosophical level, I guess it means that it would be nice if data is all going into this kind of big shed effort but that's not really the case as it stands because just because data is available doesn't mean that it's integrable. I saw a really good breakdown once about the barriers to data integration and that defined them as you need to have stuff in the same place and so this is kind of a problem that's solved with you know data sharing being such a big thing now. You need to have your information in the same structure so this is file format, this is you know is it aggregated to the same level and then you need to have the same thing described in the same language. So semantic incompatibility is kind of a big sticking point and what semantic incompatibility basically refers to is you have something say this island here and you have a ton of different ways that you could refer to it. You might have an official or a short name in the same language, you might have different languages, you might have a set of one of many types of standard codes, you could have some proprietary codes or a set of coordinates or even just a reference to a shapefile and a hard drive and this leads to some I guess anti-patterns in data analysis so say you have a bunch of country level indicators that you need to run analysis of and they're from different organizations you know they might they might look similar but you wind up having a fair amount of fairly messy fairly ugly spaghetti code to get them to work together and then you wind up writing functions to score how good a match different things are and then when you have another project you can say indicators you tell your research assistants no don't go download these just use the script I wrote to merge these before because it works. So thinking about the barriers to data integration and the solutions you've got the data availability problem solved in data agreements and ways to transport files around you have the structure problem solved by the growth of software ecosystems and practices for tidy data essentially which make that much easier to reconcile and the way that you get around semantic incompatibility is annotate your tabular data with links to ontologies because then you can integrate your data set through ontologies so just around the same page when I say ontology I'm referring to like a structured dictionary that documents meaning in a certain domain so say I'm an ornithologist and I'm collecting sightings of swans in a CSV file and that tidy sightings I guess by day and I want to integrate another data set that another researcher created and this as you can see in these two data sets the swan species are referred to differently and you don't want to just rename them because you know that's not going to generalize so if you go and found if you went and found an ontology of swans you might want to call it a swantology and and you annotate your data linking it to the swan ontology then after you've done that that's the hard part after you have that link there conceptually at least it would be relatively simple to get your data to work together and to actually get this to happen a lot of the pieces that you would need are in place so you have your there are standards published by the web consortium about how to represent tabular data and metadata and in fact this particular standard CSV on the web one of the editors of the spec named Jenny Tennyson gave a presentation at last year's CSV conf about that one of the ways that this defines to encode your metadata is to have your CSV file and to have a side car metadata file and this degrades nicely because if you don't have the software to read the metadata file you still at least have a CSV and the metadata file is in JSON for links data so where a plain JSON file would just have a bunch of data in it JSON LD gives that data but also links to a schema that says what you're meant to expect what the structured bits of what the bits of data and what they should look like so JSON LD and CSV on the web a part of the semantic web set of standards there are a bunch of these RDF schema is another one and this family of standards also includes a way to publish ontologies which are key so our web ontology language is already used by groups like bio portal which is a project of the National Center for Bio ontology and Stanford University this is a centralized ontology I guess repository it focuses on very broadly defined biomedical ontologies so this includes you know cancer terminology but also place names and a sister project the Center for Extended Data Annotation and Retrieval they're working on tools including some like metadata recommenders which and they're mostly focused on laboratory data so data repositories like data one are also part of the kind of data sharing boom and they have support for a number of these kind of ontology and link data standards and they also have some interfaces that can help create them both both CEDA and data one are researching ways to predict structured metadata from unstructured metadata so say you have a textual description of a file or a paper accompanying a file they're working on ways to try and predict what ontology you might need to describe different bits of the file and they have some preliminary interface that's available on their website and on GitHub but despite all this despite the proliferation of standards and tools there's something missing because very few datasets have structured metadata still so these are two papers that if you're interested in this I recommend they're both from 2011 and they one is a survey of scientists and another is I guess an ethnographic study of a bunch of research sites which are implementing ecological metadata language I guess I would sum up their takeaway points by saying that many if not most scientists at least you know six years ago are unaware of metadata and metadata standards if there are tools to apply them and what the potential benefits of metadata would be and even for metadata savvy scientists annotation is is time consuming it best or impossible at worst because there aren't really many tools available to perform it and even when metadata is properly applied it doesn't really provide any immediate payoff to that workflow because again you're missing the tooling it's most likely uses for future projects other people and yeah I'll set you read this and so the compounding factors I guess I would say a lot of annotation when it does happen happens when you're getting ready to archive your data so at the end of a project you're getting ready to approach your repository and publish a paper and this means this means that scientists aren't likely to use metadata driven tools during the course of their research they I guess it also means that decisions about how you're going to structure your data or errors that you make will accumulate over the course of a project which means that when you actually go to apply metadata using the tools provided by repositories it's it's harder and also happens as you know you're scrambling to get stuff ready or you're shifting resources to your next project and yeah it just even a mandate is not going to help at this point so that's where we are right now and this project this proposal that we I guess we're in the process of submitting to the Sloan Foundation is about a set of tools that we want to develop to kind of bring an annotation forward in a project's lifespan because we figured that with the right set of tools you can reduce the effort and increase the incentive for scientists to annotate their data and I guess at the same time reduce workload and bring a bunch of those benefits that open data promises so there are two major parts to our plan I guess the first is you know we want to take our own shot at creating a metadata recommendation service using machine learning and heuristic approaches that we've used in in other projects that we've done and we also want to create interfaces graphical and command line interfaces that make it easier to apply metadata early in a data sets life cycle and make it easier to get some utility out of that metadata kind of have this virtuous cycle bring it forward lower the barrier and increase the incentives so with a metadata recommender service what we'd want to do is predict the metadata for a CSV file and do so in a kind of a two-step thing first you want to predict the column level ontology and then second you want to predict the cell level value and you we are going to try and do so based only on the contents of the table itself so we don't really want to rely on a paper having been written because otherwise it's not going to be useful for a random CSV file sitting around on someone's hard drive so essentially we want something that could yeah we want something that can prioritize amongst different ontologies and at least produce recommendations and you know have this package be something with you know usable API so that it can be useful even if the rest of the projects butters because you know that's good stuff and we want to you know train it on there's a bunch of existing data out there which kind of in the wild which might be useful for this so data one is already doing research into this sort of thing and they're already working on a recommender and we were chatting with someone from data one who's I forget who it was but anyway they pointed us to a repo that's linked here and they have a thousand datasets I think it is which are manually annotated specifically for the purpose of training these sort of algorithms now a lot of their approaches are getting pretty poor predictive power right now and there's no reason to think that you know we're going to have a silver bullet or do a better job than they are which is why I think the interface part is really key because good interfaces can make a really poorly predictive algorithm useful if it streamlines the process of kind of triaging recommendations and also if you hook it up right you can set up again a kind of a virtuous cycle where you're creating using kind of less good algorithms to create training data to make better algorithms so this is our tech teams main software projects kind of have done done this but in a different field so we're working on this piece of software called IDA Connect which is the emerging infectious disease repository and it's kind of doing knowledge-based population backed by machine learning and natural language processing and so it'll take text identify case and death counts and use those plus identified place names and dates to populate this schema for a spatiotemporal disease event but since you know we don't have training data to make it spot on 100% of the time right algorithm we work really hard to get this interface where a user can triage the suggested annotations and then we save that data in a way that it can be used by classifiers for kind of the next round of updates I think the interface is also important to give the user especially a non-technical user a really clear mental model of what's going on and here's that virtuous cycle I was talking about the graphical interface and the R package are also a good place for us to surface some of the useful functionality which I think properly annotated metadata would provide so writing out the sidecar metadata files is a great thing because other tools which write which read and write data in these standards can then make use of that I imagine you'd also want to be able to write out like a canonicalized version of a CSV file it would make it much easier to submit your data to a repository so there's not that mad dash at the end of a project and we could also provide verbs to operate on tables based on the semantic annotations for the metadata so like a left join based on the ontological value of your country column as opposed to the string value then we have our continuous integration setup which nom has been been a big proponent of and I think this is the part he's most excited about so the the continuous integration idea is essentially you set the software set the service to monitor a certain folder maybe in github or dropbox and the web service when it sees a change in in the data file there pulls up a report and maybe it emails you or could do a number of other things so it could automatically write back metadata files with recommendations above a certain confidence threshold or alert a user or give you nice little badges to put on a github repository essentially I think where we want to see metadata go in the next five years is to take it from the mad dash mandated task at the end of a project to be part of the procrastination yak shaving setup phase at the beginning of our project so to caveat this this is just a proposal we haven't done this yet but I I think this is where the field is going to go and there are definitely other people working on similar projects and you know we're super interested in collaborating on this sort of thing another caveat in this talk and in our proposal we're intentionally punting on some of the details I'm kind of being glib we're really defining stuff as out of scope because it gets really technical really quickly and a very layered problem and I think we're trying to focus to narrow our scope in a way to tackle a little bit at a time so I guess I want to thank Nome who has been a great help kind of crystallizing a lot of my thoughts around this and it's great to have a multi-disciplinary place to work at the Sloan Foundation hasn't funded us but they found a bunch of great things and it's nice to have an organization to pitch this sort of thing too and also it's really lovely to have a conference like this and an audience who's willing to come listen to me nerd out about stuff like this so thanks everyone