 I'm Jenny Tennyson, I'm technical director at the Open Data Institute and I was here a couple of years ago talking about CSV on the web and so this is a kind of update on where we are but with all the background. So, I'm going to be talking about how we can make CSV a first class citizen on the web. I'm going to go through the kind of dream that I have about how CSV might work on the web. Then I'm going to talk through some of the experience of trying to create a standard around that and where we found there were difficult bits and highlighting what those were and the kinds of ways in which we've addressed them and then a little bit of what next because you always need to have the next step. So, first of all I'm going to talk about how CSV might look if it was a first class citizen on the web, the kind of experience we might have with CSV. So, at the moment we're in the run up to some elections in the UK, so going back on Thursday we'll be voting of course. Here is some data from the last time we had some local elections. This is data on Wikipedia about how many seats and how many councils different political parties won and you can see that there's a nice table of data there with political parties, number of councils, number of seats and there's also a map on the right that shows the summary of that information in a kind of map form. Now, as you know, if you've gone and edited Wikipedia or if you've done any HTML whatsoever, then all of that will be kind of hard coded behind the scenes with a table, table rows, all in nice HTML format, somebody has entered that data. If you ever want to get hold of that data and do something else with it then you need to be able to parse that HTML out, parse out that table. And so it's real pain because election results are often entered on Wikipedia really quickly. It would be really cool to be able to get hold of those election results very quickly out of Wikipedia but they're all embedded in this horrible HTML. And it's real pain because although you might have the data actually being used twice once for a table that shows you that data and once for a map that summarises it. So what I would really like to see is if we could have HTML that was actually referencing the CSV file directly and say bring this into a table. The kind of analogy in my head for this was like nowadays if you want to have an image in your web page then you can write it all up as SVG, as tags within your HTML or you can reference out to an image that is stored in some other kind of more efficient format for an image. Can't we have that also with tables and CSV? Can't we have as well as the ability to embed nice tags for tables? Can't we just reference out to a CSV file which is much more efficient format for referencing tables? And if you're doing that, why can't we have native tags in HTML that just say I want a chloropleth map here? Here's the data for it. Here are the boundaries that I want to show. Obviously you can put in a few other kinds of configuration things or whatever and just display that. Wouldn't it be cool if you could have that within your HTML pages? Wouldn't it be cool if that could just be driven off the same CSV so that for example I could have my CSV here which is a summary table that I want to have within my HTML page? But obviously I can process that in lots of different ways and I don't know if you saw but they have these nice colours for political parties down on the left there and those colours are used again in the map on the right. Wouldn't it be cool if there was just one reference source for party to colour and you could reference that and use that as the CSV file and you could bring it then into your maps, you could bring it then into your tables etc etc. Wouldn't it be nice if we could have these things being driven off data? It would help people who are presenting data because it's much easier to embed one of these tags rather than fiddling around or generating lots of HTML yourself or using D3. But it would also for those of us who like to have access to data and process it ourselves it would improve the quality of the data that's available for us. If people were generating presentations straight off CSV files then it would motivate them to ensure that that data was machine readable. It would motivate them to fix errors because there's nothing that shows you errors in your tabular data so quickly as a visualisation of that. You can see when things are coloured wrong or go completely off the chart when you are positioned in the completely wrong place because your data is poor. It would also motivate publishers to provide accurate metadata about those CSV files because if that metadata was being picked up and actually used automatically in visualisations then it would be really important to get it right. That was for me why I was really interested in trying to get that working through some kind of standard mechanism which would mean that browser implementers that also library writers etc could use to create those kinds of visualisations and tables. We started this CSV on the web working group a couple of years ago. It was completed earlier this year as the co-chair of it and the point that we built and learned from obviously Open Knowledge's data packages and tabular data format but also Google's dataset publishing language, the validation tools developed by the UK National Archives, a whole set of existing CSV parsers and the way in which those worked and went through and created a whole bunch of really good use cases and requirements out of looking at how people are actually using tabular data on the web. Now I'm not going to go into what I'd say are the bits that are well known and easy, particularly the stuff that is already covered in data packaging format. So the idea that you can have a schema that determines what type different columns are, that's all kind of well understood. I'm going to go into the bits that, as we're going through this process, were hard, the bits that were a bit more challenging, the bits that I found surprising, personally found surprising. And so there are three of these difficult bits. How do you get to discover metadata about CSV? How do you link between CSV files so where you have ones that reference each other? And how do we get this balance right between the need for machine readability, which is what we really want out of data, and also human readability, which is I think one of the things that makes CSV kind of unique as a data format that we expect it to be human readable as well as machine readable. So first of all about discovery. Right, so in order to properly present CSV in a table in any kind of, to get sorting right, to get any pulling in of extra information working properly, then you need to have metadata associated with the CSV file. So in those two CSVs that I showed earlier, you need to be able to say, right, these columns contain numbers, that's a bit of metadata, need to be able to say when you're displaying this as a table then display it as plus minus, not count seats plus minus or whatever, so changing the way in which it's displayed. And you need to be able to point out that this column here, the party column on the left here, is a pointer to the party column on the right and therefore you can pull in that extra information about colour. Okay, so you need metadata to add to your CSV file so that you understand it properly. And because CSV is the kind of terrible data format that it is, really has no mechanism for adding comments or any kind of metadata, any mechanism for adding metadata in the file, you have to really do that as a separate file. And when I say it doesn't have a mechanism for doing that, obviously there are like variants on CSV that do add that kind of metadata, but according to the RFC you can't do that. So if you assume that you're doing CSV according to the RFC, you have to put that in a separate file and in the same way as tabular data packages does, then we do that as an adjacent format. So this is the kind of adjacent format that we use in the CSV on the web standard. You can see it says this is a CSV on the web metadata for the file at UK Local Election Summary 2015.csv and the schema for it is it's got these columns, party, councils which is a number, councils plus minus which is a number, seats which is a number, seats plus minus which is a number, really simple kind of metadata. Okay, so you've got a CSV file and you have some metadata. The bit that is problematic is that in order to provide these two paired files, the CSV and the metadata as well, one way of providing those two paired files is to, would be to package them up in a zip. And you often find that happening, you often have packages which are a CSV file plus some metadata about it in a zip file. Another way which people might use is that they can actually point to the JSON metadata directly. So here, for example, if you pointed to the JSON, then you could find the CSV from the JSON itself. But when you embed a link on a web page and say you can download some CSV here, what people expect is that they can download some CSV here. Normal people, and we have to focus on normal people rather than just geeks, will not expect there to be a zipped package and they won't know what to do with it if there is. We'll really not know what to do if there's a JSON file at the end of that link. So when you're generally providing links to CSV files on the web, it has to actually be to the CSV file itself. So given that you've got a link to a CSV file, how do you from that CSV file locate the metadata that you need in order to interpret it? The right way of doing this is using an HTTP header, the link header. The link header can say go and look at this other file, metadata.json. That file describes this one, is using a relation described by, and it's in this format, the CSV on the web JSON format. So the right way is that you serve up the CSV file, you have a link header in the HTTP header of the response, and that points to the metadata and then you can get those together. And obviously only machines are going to follow that link and do something with it, but that's okay. That's the right way of doing it. Now, unfortunately, many people, many organisations, when they're serving up CSV files, when they're publishing data on the web, they don't have any kind of control over the link header or any HTTP headers whatsoever. If you're just publishing on through GitHub pages, for example, which lots of people do, then you have no control over those HTTP headers. And anyway, it's quite fiddly to do things with. So one of the things that we did within the CSV on the web work was specify some default file names, some default file paths that will be looked at. If you receive a CSV file, you can go and look at, find the metadata by adding on the end of the file name for the CSV file, hyphen metadata.json. So that just makes it easy if you're a normal publisher to publish some CSV and some metadata about that CSV just by using associated file names. And if you have multiple CSV files, you can also, or all in the same directory, then you can just use metadata.json and that will be the descriptor for all of them. There's a bit of really geeky stuff about .wellknown in here. I'm not going to go into it. But if you want to ask me about .wellknown later, then I'll tell you about it. So that was about discovering metadata. So you can use the link header and you can use just a default file name. Second thing is about linking between CSVs. So you can think of CSVs on the web as being like a relational database. So you can specify, for example, in my nice CSV file here where I've got parties and their colours. I can say that the party name is a primary key through which I can reference that particular line within the CSV. And then when I'm describing the data that's about the election results, I can use a foreign key to point from the party column in those results to the party column in the reference data about party colours. So that's a hard reference between those two files. When you set up a key relationship like that, then from a validator's point of view, any value that you find in the referencing file has to exist in the referenced file. And if it doesn't, then it kind of throws an error. But there's another way of doing it as well. You can also use kind of soft links. And I think this is one of the places where you're thinking about designing for the web. It's important to recognise the fact that if you're relying on somebody else's data over there, and we want to get to reuse of data, so we want to rely on somebody else's data, that they might change it, some of the values might disappear, some of the, we can't rely necessarily on there being these hard relationships between CSV files. So one of the other things that's introduced in CSV on the web was this idea of kind of soft links between files. So you can say that in the party colours here that each row is about a particular party and given URL that is an identifier for that party. So just through generating the URL based on the data within that file. And then when you want to refer to that party, you can use, you can generate a URL for that reference. And if the URLs match, then you've got a kind of soft link between those two sets of data. And if they don't, then it's nothing breaks, right? It's not invalid, it's just that you've got something that's referring to something that doesn't exist, or you don't have any extra information about. So as well as hard links, we've got soft links to deal with the fact that the web is not a distributed relational database. And then the final thing that I wanted to go into that was kind of hard was this duality between machine readability and human readability. So most data formats are built for machines to understand and we wouldn't put them in front of normal people. But CSV is different. We do expect CSVs to be opened up within Excel. We do expect people to do that. And when you allow for the fact that there's human variability, then you have to start supporting things like the fact that people will use different headers, even when they mean the same column. So we have in the CSV on the web standard an ability to say the first column is the country, but it might be called the country of the lower case C or with an uppercase key. The first column is called unemployment, but it might be called unemployment or it might be called unemployment rate or people might put a percentage sign in it. I don't know, right? So to allow for flexibility, for some flexibility about the way in which people provide the titles in their CSV file. That becomes particularly useful when you have some CSV format that you actually want to be used in different locales, different locations. So, for example, if you want to be collecting unemployment statistics from multiple countries, then you might want to enable people who don't speak English as their first language to use their first language in the titles of the columns because it's a human format as well as a machine format. So you can within the CSV on the web talk about that the title of this particular column will be country in English and it will also be land in German, for example. Well, it gets really sticky with supporting this human and machine duality is when it comes to formats for dates and numbers, where of course people like to enter stuff with commons in when they're doing numbers or they like to do dates in their own particular date format. What we've done in this case is that we've tried to hook on to the only kind of standard that there actually is for specifying any of these formats, which is the Unicode technical standard number 35. What we say in the spec is that there is a minimal set of, basically that specifies the formats that you can use in order to describe numbers or describe date formats. What we've said in the spec is that there's a minimal set that any implementation must implement but allow also implementations who want to go that extra mile to implement even more. In particular what we don't expect all implementations to understand all languages, all different versions of the names of the month, because there are so many languages and having all implementations implement all of those different languages would be too much. But implementations can support different languages if they want to. So when you have a date then you can specify that the formats of this date might be in the usual kind of British standard or in the German one or in the American one even if you really want. And so you can use those different, use day, month, year in the standard kind of ways for those. Similarly for numbers we can have just plain old numbers but if you want to enable people to put in commas, to put in decimal places then you can do that. And if they are using groups which are separated by spaces and commas for decimal points then you can specify that as well. So there's flexibility that enables people to write the CSV file in the way in which they like to see it presented to themselves. And then for the schema to be able to, the metadata to be able to be used in order to interpret that correctly so that you don't just interpret it as text. And you don't have to guess at what that format might be. So those are the kind of hard, difficult parts of making CSV part of the web. What next? Well, we've got implementations for doing validation of CSV files against those schemas. There are specs and implementations for converting CSV plus that metadata into JSON and also CSV plus that metadata into RDF. There are some implementations around that are for authoring metadata out of CSV files but there's nothing yet which goes to my very first point about wouldn't be really, really cool if we could display this stuff really easily. So what I find, what I think would be really cool is if we could have some libraries that put into a kind of web component stuff so that you can include those kinds of table tags and chloropleth tags referencing just out to the CSV plus that metadata, automatically getting hold of the metadata and then presenting it. You could create tables, you could create maps, you could create charts. Other stuff, you can also, there's scope within the spec for specifying annotations on individual cells for example on rows. I think it would be really cool to have displays that actually highlighted those and you could click on them and display them. A bit like comments in Google spreadsheet sort of thing. I also think that an area that is really interesting is around navigation. What is the right model for navigating between tables of data on the web? We're nowhere near that at the moment but I think that as CSV becomes a more native format for the web then how we navigate between CSV files will start to become a really interesting problem. If you want to learn more, that's the link to the primer which basically goes through how the spec works and that's it, thank you very much. I think I've got time for like one or two questions. The only time that that becomes a problem is when you're trying to convert that into some other kind of data format that you actually want to manipulate and do something with. There is a facility within the CSV on the web spec to have what are called virtual columns so they're columns that don't actually exist in the data but that you can use for static information like this is a value from 2014. In the mechanism for converting from CSV to JSON using this metadata then those kinds of values can be pulled out and added into the JSON which you can then use to create something quite generic. Does that make sense? That was very quick. The questions. Thank you very much.