 Cool. Thank you Joe. So today I'm going to talk to you about frictionless data for wheat. This is a project that we worked on at the early Institute to myself and another guy called Shindong Bian and we work for a guy called Rob Davy. I'll talk to you about it now. So to begin with, we are funded as part of a sort of infrastructure program that covers eight different research institutes and universities and there's over 25 groups of scientists. And the idea is basically to try and increase the yield of wheat. So effectively by 2050, we need to produce 60% more wheat to meet global demand. So any research we could do to try and make that easier for users, for people to get. So what do we do under DFW? Well, we produce lots of scientific data and it can be in various different types. So we have field trial experiments, we have data sets and we have genomic sequences amongst many other types of data. On top of that, we have also different groups of users. So we have breeders that have particular needs and academics and data scientists and people in the industry and all requiring different needs from the data. So the challenge we have is how can we make this data accessible and usable for everyone? So our solution to this is we came up with a software platform called the Grassroots Infrastructure and this is a suite of tools that wrap up industry standard software tools along with our own custom open source ones. And it works on the principle of communication between different systems all using a JSON API. So it's programming language and platform agnostic. And the idea is that these grassroots instance can be federated between each other and data and service can be shared. And also we try to make sure everything's done in a fair way. So if you don't know about fair, I'll just quickly run through those. The four principles of fair data. The first one is findability, as in can you find what you're looking for out of a system? So making sure the data is well described and can be searchable for both humans and users. It needs to be accessible so once people can find the data can they access it to the need authentication or is it openly available and so forth? We have interoperability. So the idea that even though you've got open data you need to make sure that it can be integrated into other systems whether it be different workflows for analysis or so forth. So it needs to be well described. And in terms of reusability we might want to make sure that all the data is well described so marked up and with detailed provenance. So now that I've given you a quick overview of fair data I'll now go on to the grassroots infrastructure. So what we have, we have a set of common core libraries which take care of things of how we describe our parameters and our web service APIs, how you call different systems writing reading files and so forth. And we have that attached to an Apache web server via our own Apache module that we write so that connects all of the plumbing together with the system. The bits that actually do the scientific analysis are what we call services. So this is either where we've taken adapted existing programs and we've wrapped them into our own sort of API or we write our own bespoke tools. And as long as systems conform to our API which is a well-defined set of standards and adjacent schema then they can run inside it. So we have various different types of tools. So we've taken existing tools such as BLAST which is a really common one in bioinformatics which is where you find areas of similarity between different biological sequences as well as our own sort of custom services like field trials, field pathogenomics, searching and stuff. And to chat about these in a bit more detail. So typically when you do a web service interaction say you want to go and do a search across two different systems for a particular sequence. Normally in this case you'd have two servers, one for the elements that you hear and one for the University of Bristol. You log into one system, do your search and then you go across the other one and perform the same search and try and get your results. So the problem there is you have to access each one individually. And then once you've done that you have to collate the results together and do anything like me. You've prone to human error and not cut and pasting things together properly. You might mistakenly run the services within parameters so you might not be able to compare them correctly. And it's time consuming. So one of the things we tried to solve was the idea of federating services. So if two different places are running a grassroots instance they can be hooked up together on all the services shared. So the idea being it doesn't matter which one you log into you see the same list of services and functionality. So one of that you can also amalgamate services themselves. So the service can actually see the databases all on one system and all on the other but to a user they appear like they're in one place. So from user point of view they just log into a system they see a list of databases to search against and they can just run their analysis and that's fine, they get results back. For example, the hood obviously has communication between the two different systems and all the results get collated together. So that's sort of an instruction to grassroots and I'm now going to talk about a couple of systems that we've done. So the first is as part of this project, the DFW program we create lots and lots of data and all of this we need to make available. So we have a data portal for this and this is everything on this system is based upon the Toronto data agreement which is an excellent license which allows you to pre-publication data sharing. So the idea is if you do some scientific analysis you can make your data available for everyone else and they can access it straight away but they can't publish on it until you have. So you still get first rights if you like but it makes sure that everyone else can do their analysis and pipelines with it much earlier than they normally would. So where we store this, we store it in a system called iRODS which if you haven't come across is an open source data management software and it's used by a variety of research or commercial organizations worldwide and it has a number of sort of cool features. So one is that you can virtualize the data as you need it doesn't really match from the user's point of view it doesn't matter where the data is stored so you can shuffle it around on your servers and from their point of view it doesn't move. It has data discovery so because once doing the standard typical file and search things you can also attach metadata to the files. So for instance if you have you run experiment and you want to store the parameters you run an experiment with you can attach that as metadata to the results file and if the results file gets moved around metadata stays with it. And in the same way that we have the grassroots instances can be federated so too you can do that with the iRODS which is where we got the idea from. So what we did is we took an Apache module and we forked it to make this available on the web so the idea being that we can make the data appear in like a user friendly environment but add lots of extra features we want to expose the metadata so that can be displayed and that can be edited and give the whole system a full rest of web service API so people can program against it if they want to. So in terms of the metadata we store we use the minimal information about a plant phenotype experiment or MYAPI and that has a list of certain parameters or certain metadata attributes to store and keep so we stick with that. On top of that we have an extra field so we have things like the title of a project the list of the authors description of it and the license details along with the actual data itself gets indexed with an our own search engine which is based upon Lucy again everything we do is open source it's all available. Now last year we were lucky enough with a tool fund from the Open Knowledge Foundation and with the goal being to expose these data sets, the frictional data packages and if you can see in the little red box we have a little data package JSON file. So what is friction data? I'm sure you've heard lots about it over the last day and a half but just in case you haven't we have a simple container format used to describe and package any type of data and it has many advantages simple to use can handle anything, it's easily extensible that you can package your metadata in there and it's human-editable and machine-usable. I like the fact that users re-use existing standards and because it's in JSON it's language technology and infrastructure it's great, love it. So we've put all of our data into that so each of the projects that we have in the data portal now have these frictional data packages automatically generated it's not like the user has to go and create it and it contains all of the files that we have actually sitting on the website as well so the things like the license, the name, the description and so forth. We've done this by adding some extra functionality to the Apache module that I mentioned that connects to my rods so it has a number of features so that first is it can whether dynamically creating these files if a particularly large will take a long time to generate they can be cached and written back to our rods so next time that file can just be served that's really good, and because it's in Apache you can use all the standard Apache configuration directive so you can just say at a particular level in your file directory hierarchy these things can have frictional data packages generated. Obviously we can't control what metadata keys people use to describe their data within iRods so all of this is completely configurable so you can set up in your configuration you can say use these iRods keys you can use these frictional data metadata values and the iRods metadata values can all be combined together so you can take the results from multiple different metadata values concatenate them together and that becomes a frictional data value that you see so as well as the basic data resources we also support tabular data resources so these are typically things like spreadsheets or CSV files and in much the same way all of the column headings and variable types again completely configurable and can be pulled from your iRods system so we try to make it as little with no hard coding as possible so that anyone can configure it for their own systems so the second thing I'm going to talk about are field trials so field trials are experiments where scientists will put loads of different crops within a field and apply different treatments to them could be fertilized and so forth and then measure particular traits it could be how quickly it takes for the crop to grow or how high it grows or so forth so we worked with the data producers all within DFW and we came up with standardized template for submitting the genotype which is the genetic material of the crop i.e. what type of seed it was and the phenotypes you want to store which is typically the characteristics you want to measure like I mentioned plant height and so forth we're trying to facilitate all of this so everything is we're keeping the fair shared principles in mind of everything that we produce so in terms of findability we have a Google maps type approach you can go in on the map and search around to find any trials that are in a particular part of the world or alternatively you can go in for a normal text based search web page but eventually you come down to particular set of studies so everything we do is openly accessible and they all have unique identifiers that are fixed URLs so they're accessible and we we will keep them those addresses so people can reference to them and know they're still on existing in terms of interoperability in the grassroots API we also have adopted something called BRAPI or the breeding API which is a community driven standard for developing a web service API for enabling interoperability between different plant breeding databases so that's another API that we use in terms of what we actually store in a system we have GPS files for where the plots are within a particular field so as well as storing lots of different metadata such as a description of a particular study what is trying to calculate any design notes or anything like this we also have these on the individual plot level and we've allowed automatic map updating so someone can have their phone or tablet and walk through the field and know which plot they're on so they can then tap on that and see the details of that plot to try and make it as useful as possible so what we store for the plots well we have lots of generic data such as its dimensions, its length its width, when the crop was sown when it was harvested and any user comments for the person putting it into the field you can also attach images and they can be obviously taken with your camera or we can have drone based images along with what crops have been sown there where the accession and seed so you can actually go and order the seed for that particular variety the sown in that particular plot and then obviously what people do they're interested in measuring stuff so the phenotypes or the measurements that people are interested in are typically comprised of sets so core or triple of values the first is a trait as in what to measure and this could be the plant height, this could be how long it takes to grow the method of how it is measured it's the next part and finally the unit so it could be if you're measuring height it could be in centimetres or in metres or so forth along with the value of potentially a date now all of this stuff and everything that we do and all the data we store we try to make sure they're well defined terms so everything that we do comes from the crop ontology so everything is marked up and has proper provenance from there so as I mentioned we've got a couple of web service APIs the grassroots and wrapping but another thing that people requested that we work with is they're not way experts in the field way better than me but they're not necessarily comfortable scripting or programming and they just want to get the basic data and the ideal solution for this and I'm not being paid to plug it is frictionless data so what we've done is we've come up with a load of schemers for the different classes of data we have we store where would be studies the data and the plots and so forth and we package those all into frictionless data package and we're not just the only people on DFW working on frictionless data we have collaborators such as Richard Oslo Rotham's research who's working in a similar field and doing data carpentry lessons and frictionless data so what do we store well we've got within the frictionless data packages these are all again all generated the user doesn't have to do anything we have the various different resources so one for the program that's in then the field trial then study this inside that and then the plot data so we all of the fields that we have in there are standardized and on a study level as I mentioned there are standard attributes such as length and width and position and stuff well specific to each individual study you might be having different treatments or measuring different phenotypes so all of that again is calculated with the software and it automatically creates a spreadsheet including these values so it might not be easy to see from the screen on there but just an example of one of the sets of tabular data we've got and it shows the particular phenotypes within that study Simon just to say you've got about four minutes left that's awesome thank you Joe so as well as exposing this to the frictionless data package we thought can we make this even easier for users to use but so I know that it has great APIs and great tools in Python stuff but could we make it even simpler for them to unpack the data so it's another thing we've been working on it's a tool to extract resources from within a frictionless data package and the way this works is a standalone tool that you can download again it's open source with a link at the bottom and the idea is that it will read the contents of a frictionless data package and for each data resource within it it will look at the schema that it uses with the profile attribute go over to the internet and download it and then write out all of the data resource fields using from that schema so currently we support the data resources being written out to either markdown or html but we're looking to add more different types and any tabular data resources we can take them out with csv this tool is cross platform as we plan to make it available for as many different platforms as possible and it's absolutely nothing weak specific in it so it could work with any frictionless data package that you like and the URL is there for the pre-release version hopefully we'll have proper releases soon so in terms of further work we want to write frictionless data as many different places as possible within our systems and add machine learning to try and detect the scientific values from media such as photos and we're looking to try and standardize on frictionless data schemas between different types of field trial experiments you might have one we typically work with ones that are over a single year but in some institutions they might go over 10-15 years and so forth so finally I've got lots of acknowledgements of people who work with so various different people that work across the FW that have helped us enormously so with the Ironman Institute where I work and at the John Innes Centre and University of Bristol and finally a big thank you to the csv con organizers for inviting us and that's it thank you very much and thank you from us to presenting to you we do have a question I'm just going to see so we're going to give you again a very short answer for this one so maybe you can answer it in a bit more detail later because we've got the keynote starting in two minutes can you tell us a little bit about your users it's a question that I had as well actually yes so we have for many institutes themselves we have lots of research scientists that are particularly trying to do to get this data and like work out what particular things they might have particular phenotypic values we also have breeders that are trying to work out whether they're particular seed lines that they can cross as in like have two parents or and come up with a species that's particularly hardy to say a particular disease or a particular characteristic you might also have people working in the industry that are doing similar things and we have academics so we have like a variety of different users and that's part of the challenge is going well you can't have a one size fits all you've got to make it so that everyone can get what they need out of it