 My name is Eric Bespum, I'm the Founder and Director at the San Diego Regional Data Library. I also run a company called Civic Knowledge. They both do basically the same thing, depending on whether it's a volunteer project or a paid project, which is help journalists, governments, and nonprofits get answers to data-driven questions. And obviously data has an important component with metadata, so we're going to thank you for coming to talk about one of our plans for making it a lot easier to create metadata. So we're trying to solve the questions we had over here about what's in it for the person who creates metadata. Well, we may not have an answer to that, but we hopefully can make it less expensive and onerous. So this project started with... We were working on a grant-funded application called Health Reporter, very similar to Census Reporter. If you're familiar with that tool, except this one is designed to help journalists find patterns in data using pre-perceptual visualization techniques. It worked great. Everything about it was fabulous, except it didn't have much data in the repository because wrangling data and loading it into a repository like this is really expensive. And we talked to the health of the Census Reporter folks. They gave us numbers that involved many hundreds of thousands of dollars, I think $600,000 for their project. We said, no, we're not getting funded for that. So how do we make it easier to import data? Well, one of the answers is to get better metadata from the data creators so we can just import it programmatically. That was the idea. We found some really wonderful ways to do this. If you were at Adrienne's talk, you saw a little bit about the data packages and the data packages from Open Knowledge Foundation are just about the perfect solution. They're tabular data packages, they're simple and comprehensive. The set of terms they have is evidence as people who really understand how to work with public data. One problem, though, I have a bunch of epidemiologists who are creating data and they do not speak this. I love JSON, you love JSON. Most people think JSON is great because we worked with XML, but epidemiologists are like, I don't know what that is. So we need to figure out how to make this something that normal people could write. And part of the reason is that metadata is really important for all sorts of stuff. Here's a good example from data.gov. This has been true for years, by the way. If you do a search for diabetes in California, you would expect to get data sets with diabetes in California. Probably something from the CDC. No, you actually get diabetes from Allegheny County in Pittsburgh. There's a hit in there. This is all on the first page. State of Oklahoma and something from the city of Chicago. That's not the worst of it. You also get a comic book from the geological survey about California geysers. And I have absolutely no idea what these do about diabetes. There's clearly a problem here with the metadata because you can't search for diabetes in California. So the solution we decided on was that in order to work with epidemiologists, we've got to use a tool that allows them to use structure data, but there has to be something they've already got. We can't expect them to buy new stuff because then you go through a government procurement program that takes you forever. So the tool they already have is Excel. If we can get structured metadata, all that complicated lists and dictionaries into Excel, we can make the data easier to read, easier to write, and it can be included in the Excel spreadsheet. That's why it's called Metatab. Your first tab in your Excel spreadsheet is the Metatab. And if we can do that, we can improve workflow because there's one document that goes from reviewer to reviewer before it gets published on Socrata. You don't have to separate these things in two paths. The folks at various California departments really like that one, which is... So this is what it looks like. This is in Google spreadsheets. You can see our Metatab add-on up there. It's pretty simple. It's pretty close to what you would get if you were to write this all yourself. Let me get my stuff out here. If you were to write these... If you were to give an epidemiologist the jobs, they sit down, go into Excel, write your metadata. They would probably do something relatively close to that. This is, like I said, it's good because it's easier to write. It's much easier to read. You can give somebody a metadata in this form and they can read it. They don't have to parse XML in their head. They don't have to convert it to HTML with a tool. They just read the document. You don't have to solve a time with PDF files that have data dictionaries in them in a tabular format. Embedding the data in the spreadsheet makes it metadata and the data coincident. And because we're using terms that are already in all the other metadata schemas or schemes, rather, we're borrowing terms from primarily double-in-core. We've harmonized with tabular data packages. It allows you to take a metatab file, write it in metatab, and then generate your tabular data package. So now you can use all the open-knowledge tools. We're looking on using this for low-cost federation. Federation is not quite the right word, but I think you know what I mean. We want to get a bunch of data and bring it together, put it in a single databases. We want to be able to improve open-data workflow, particularly in the departments, health departments, county departments that publish this. And one of the features I really like for our own work is being able to publish to multiple places. Once I've created the metadata in metatab, I can use our tooling to send it to data.world and to secan and to wherever else I want. But I manage it locally on my own drive and my own github account. So here's what it looks like. That's a little slow here. It looks pretty much like you would expect it to look like. It's got terms and values. Now, if you just do this, you get a property list. It doesn't give you much, help you much in managing the full complexity of JSON. We have to add a few things, but we can't get too far away from it looking like a table. So the first thing we'll notice is that these terms declare... I've got a pointer here. Declare is, it's basically your include. It tells you a little bit about the profile of the data you're using. It defines the term. It's another metatab document, but we'll just ignore that for now right now. Let's go to the next. So we've got these columns of terms. These columns are values. And like I said, this doesn't give you all the complexity of JSON. But these terms really are having an implicit structure to them. If you don't put it in front, it's the assumption that the parent of each of the terms is root. And if you do put it there, you can describe a different parent for these. That allows us to generate lists and dictionaries by sometimes changing them. I've added a title term here. The parser says, well, it's got a root.title. So I've created a title for my document. It has the value of California accountability model. When it sees title.language, it makes this a child of the most recently created title. So now we've got root child that's language. If I put two title.languages in there, the JSON emitter will emit a list. If I put another one, like I say title.note or whatever, it'll make it into a dictionary. And depending on how you structure what you've described in the declared document, you can force it to always be a list or a dictionary. Sometimes it's always a scalar. Sometimes it's always a dick. But you don't need to worry about that too much. If you just emit this, you will get a data structure. Now the one thing about that is we've only used the first two columns. There's all this information, all these columns on the other side, nothing in them. That should be useful, right? We should be able to put stuff in there. And we can do that by specifying in the third column a property child. So the first thing we do is add a section. The section, there's always an implicit one called root. This case will make it explicit. The section term declares a section. This is called the root section. It's actually different from the root parent. A little bit of a bug there, but this is the root section. And everything from the C column on is a list of arguments to the root that tell the parser what the subsequent terms will produce if you have a value in this column. So in this case, we create a root.title term, give it a value, and then create a title.language property with a value of en. So when we do that, let's see, sometimes this is not working. Here's what this will turn out. This is a small example. We've got a section for resources with a data file. We've got a section for context. There we go. Right there. And that's the JSON, it turns out. Here's a full example. This is actually a working example for a data project we're running right now. We've got a set of terms here that are the basic stuff you'd get out of double core. We've got a set of resources. These are actually modeled on the tabular data package, Open Knowledge's resources. Each one's got our URL, but also have a name, a description, and whatever else you want to put on there. These are property children. Contacts, and over in the schema section, we have both of those ways of working. Here's a table. And then the children of the table are columns. And then these columns describe a data type on descriptions. So this is what I mean by making it readable. We can take what would normally be a complicated XML JSON or RDF file, and you can hand somebody at least the schema section. It's nice if you put the colors in there. And they can read that without having to know anything. They can open this up, and now it's a human readable form. It's nice to convert it to other formats like HTML. But if you're not doing that, if you just hand them a file, they know what to do with it, more or less. So you currently use the system by installing a Python module. It's called MetaTab, and it introduces a MetaPack program. You can also build packages in Google Spreadsheets. That's a little crafty right now, but it's going to get some love soon. And we're working on trying to find some funding to build an Excel module or Excel add-on that allows you to build packages in Excel. Oops, let's play this again. Once you've built a package with MetaPack, you can tell MetaPack to turn that into... Once you've built the MetaTab file, that's the CSV file with the metadata in it, you can build packages. You can build a zip file that downloads all of the URLs you've referenced in the metadata and sticks it in your zip file. You can build an Excel file that has all the data embedded as tabs. There's a couple other forms. You can put it on the web. You can use packages directly from the web. If you upload something to data.world, you can download those directly. But you can also use the Python module to get access to that package and then get data frames in... You can get pandas data frames. You can get geopandas data frames. A couple other forms because the data package understands the schemas. I mentioned all this. Here's a picture of data.world. One of the important things about this is that when you build a package, if you build a zip or a file system package, it automatically creates a data package.json file. So this also works as a tabular data package. You can use open-knowledge tabular data package tooling on that zip file. Here's an example of what this looks like when you're actually running it. This is not working out too well. This is a tab. Metapack minus C is going to create a new file. You open it up in your Excel, or open it up however you open. You open up OpenOffice or Excel. Then you can go and add your title, your descriptions, whatever else you want. Metapack minus A will go load in a URL. If it's just a URL to a CSV file, it'll stick it in as a resource. If it's a URL to a zip file, it'll open the zip file and look for CSV and Excel files and stick all those in. If it's a URL to a web page, it'll scrape the web page looking for stuff. And if it's a URL to a web page with zip files that have Excel files, you'll get hundreds of URLs because it'll go all the way down to the level of the individual tabs and extract the data and put them in these URLs. From there, you can generate schemas with minus S. It'll try to intuit all the types of data, all the column values and names. It'll produce your data dictionary for you. You can upload all the packages to S3 with Metasync, then Metacan, sticks it on to secan, Metaworld sticks it on to data.world. This, about 15 minutes, you've created multiple data packages on multiple repositories with complete schemas. And then can use those in a variety of tooling. There's a couple of neat features on the system. One is that you can use a program as a source of data or a Jupyter Notebook. Geocoding is a bit of a trick sometimes. And so we write geocoder programs, reference them from MetaTab. When I build a package, it does all the geocoding, it sticks the resulting row data into a CSV file and I upload that for my analysts. You can also define column transforms which are bits of Python code attached to an individual column which allows you to parse the column and then do work as that file is building. So we use this for if you have a geo ID from the census and you want to column that references the county, you know you're going to be doing aggregations on county, you can parse the, in the MetaTab file, you can parse, you can tell it to parse the geo ID and then spit out the county or the state. This is what it looks like when you upload a package to a CCAN instance. This is one of the projects where a volunteer project we're running for a group that deals with residential elder care facilities and so this is actually for real client. Once you upload you get the all the same thing. You get all the same documentation that you saw earlier. You also get the Excel file, you get a zip file and this is called a CSV package. This is the MetaTab file that references URLs from elsewhere. If you load that URL to that MetaTab file into the Python module and print out the doc in this is a Jupyter notebook, you get all the documentation including the names of the resources and if you ask for one of those resources and print that out you'll get the data dictionary and if you ask for data frame we get a pandas data frame. Now my analysts will go through this process a load of multiple packages they'll do all their analysis work then we take the resulting pandas document refer it to from another MetaTab package I build that upload that to data.world and give that to my client. So that's our workflow process. So at this point we'll either do a demo of it or just take questions oops you have a preference anybody got any questions? Alright we'll do a demo then let's see ok so so MetaPack-C that'll create the MetaTab file and let's go figure out where I put the here it is so if I go reload this so this is the default template it's got a UID and a name that we need to change we have these items here allow you to set the parts of a name our names are structured and we'll call this one food data and it has an origin origin is where it came from we'll just make an example.com but usually we put the agency that it came from we've got a version from let's say it's California and then when I save that that'll update and when we go back and reload oops let's see we want to go this we'll now have a new name it's updated the name and let me get rid of this one because that's an example let me make this a little bigger ok from there we can go MetaPack-A I think I've got so this is an Excel file it's got multiple tabs within it it'll go through and inspect them all load them in and reload again and so now we have URLs for all of the files and you can see their fragments the fragments name here in the URL is the tab that it came from and we've got a fairly complicated URL scheme so that just about any file that might have data in it will have a corresponding URL we've also it's also given us names that are a little too long so we're going to change those and you can see a start line and a header line it tries to figure out where the rows are if you have multiple headers it'll skip over there and try to figure out where the header line is and where the data start is it doesn't always get it right particularly if it's a lot of text in this case it doesn't know we're just going to have to change those but oftentimes it'll do a great job of skipping the comments and now if I save this and rerun it's going to run a schema ooh what did I do I need to save it oh because I need to get rid of this one sorry I forgot to delete that so now it's processing those files and it has added in a in the schema section we now have tables and columns you can also see that it has detected that these names are crummy names for columns so it created new ones for me and there's a bunch of other stuff this is where you'd also add the python code references to modify things as it gets built once you've done that this is basically a complete metadata you would probably want more stuff in there references to documentation we probably should fill in the contacts for who was the wrangler and who created it and all that other stuff but then you from there you can run um if you run minus F it'll make a file system oops let's do a minus E minus E it's going to create a zip a zip package and an excel package so now if we look in the packages directory we've got a zip package and an excel package and if we open up the um this is what our um it's not terribly nicely formatted but you can see it's pretty much the same data a couple changes our URLs have been turned into simple names because those are the names of the tabs in there and then it's loaded in all of our um it's loaded in all of our uh data as tabs using those names if we define any cleanup routines in the schema this data will not be clean with whatever parsing we declared my analysts use this a lot because we get lots of crummy data and we can programmatically declare what we're doing um I demand that because I don't want my analysts going and mucking with CSV files when we're publishing I want to know what did you do so I make them write Python code for that um from here you can upload it to Amazon S3 and then deploy it into your various repositories um so the let's see let's go back to um if you want more information about the project uh metatab.org we're at a stage where I've been using this in our own work and I'm looking forward to deploying it uh we're talking with a couple organizations we've been talking to data.world folks who are awesome and open knowledge about making this more or less a front end to datapackage.json it's harmonized with datapackage.json it will generate those files but it's easier to create and I'm looking towards having a more reliable expectation that data producers particularly ones who produce public data and health departments have metadata available and that it's available in some sort of format that I can read if it's JSON, great but if it takes metatab to get them to do that um that's why we're promoting this. We're very happy to have people help with the project and if you have any use cases for it and would like to deploy something like this in your environment please let me know particularly if you have a need for the Excel version of this where I guess we're looking for a core user that we can go justify asking for funding to build a tool that can do this in Excel. Thank you very much. We've got a couple minutes for questions.