 Okay. I think we can start now. So, hi everyone. My name is Otto Nam and I work on OpenRefine. So, can I change anything to the setup now? Okay. Great. So, I'm going to have a very, an original show of hands of who has ever used OpenRefine here before. Wow. Pretty good. But I was expecting people not to be completely familiar with the tool. So, I'm going to do a quick demo just to make sure you have an idea of what the tool looks like. Then, I wanted to tell you a bit about what we're trying to do on this tool. How do we want to improve it, revamp it, and hopefully, you'll have some ideas also how we can make that better. So, let me start with a quick demo. Can you see all right? Yeah. I think it's not too bad. So, OpenRefine is what people tend to call an extract transform load system. So, basically, the idea is you have data in some format in some data store, and you want to load it in your system, transform it, message it in a different format, and fix some issues in the data very often, and then push that to another format, to another database that has all the constraints about the data. It's a web-based tool, but it runs locally, so you need to install it, and all your data is on your computer. It accepts all sorts of input formats. I'm just going to use a CSV here to show you how it works. So, I just put my CSV in the tool. Okay. It's not really meant to be used with that much zoom. I might try to zoom out a bit. I can create a project with this data, and this is what it looks like. So, this dataset is about filming locations in Paris. So, basically, every time you want to shoot a film in Paris, you have to ask for permission, and they keep a record of that, so you can then get this information, and this is what it is about. So, you have the title of the movie, the director, and then all sorts of other things about this thumb. I'll just show you a few things you can do in the tool to clean up this data, to transfer it to another data store. So, for instance, the popular feature of the tool is what we call clustering. So, if you take this director column here, you can say cluster and edit, and the idea here is that it looks at all the values in this column, and it's going to look for things that could be duplicates, and you have various ways to do that. You have a fingerprinting method, which is the default one here, but you can also try to look for things that are near each other in terms of edit distance. So, if you play a bit with the settings, you can discover values in your dataset that might mean the same thing, and which could be errors in the dataset. So, here you can review these, so you can guess, okay, these are probably the same thing, so I want to merge them to this value, and then maybe this one is not actually a true duplicate, maybe these are two different people, so I want to leave that out and merge that, and then you can click here, and it just does the replacement in the entire table. The idea is all operations you apply are applied uniformly on all rows in your datasets. So, it's a bit more principle than Excel, which is very often the tool people come from before using OpenRefine. So, this is one thing you can do. Another sort of thing that I really like is called reconciliation. So, you have this column here of titles, and if you say that you want to reconcile it, you can select another database online, generally, that you want to match this column against. So, you have names, and you want to get unique identifiers for these films. For instance, you could want to take IMDB IDs, or all sorts of other identifiers for these films. So, here you can just pick the database you want to match against. So, I could use some of the ones I have here, or I could add another service if a database implements the API that OpenRefine expects to do this matching. You can use that, and you just need to add the address of the database here. So, it's really user-friendly. You just need to know one URL, and then you can do data matching against that database. So, I'm just going to use Wikipedia here. Wikipedia is this knowledge base created by the Wikimedia Foundation. It has data about all sorts of topics including films. So, this is what it looks like. You can configure the matching process in many ways. You can select a type that you want to restrict the search to. So, you know that these are films. So, you only want to match these with entities about films. Then, you can also use other columns from the table to refine the matching. So, for instance, you can say that the director of the movie is actually a very good indicator that you want to keep. So, you can say, I want to match that to the director property in Wikipedia. So, that's going to do firstly matching not just on names, not just in titles of movies, but also including the director name. And I'm not going to do that here because it would take quite a while for these 3,000 rows. So, I've already done it before. I'm just going to show you what it looks like. So, this is what it looks like afterwards. For each cell, you have candidates which are entities from the target database that they could correspond to. So, you can just review these manually, or you can also use some heuristics to make that's reviewing a bit more principle and also a bit more time efficient. So, if I click on this, for instance, I get to the Wikipedia item for this film. So, these are two interesting features that you can use to do data cleaning. And one thing I haven't shown you so far is the facets on the left-hand side. So, these are a bit like in a search engine when you have summaries of the values in some columns. And you can use these to filter down the rows you can see to a particular subset that you are interested in. So, by clicking on this matched value here, I selected all the rows where the matching was reasonably confident. And so, we already have like valid links to data in these cells. I don't have to review the candidates here. And now, with this filter applied on the left-hand side, every operation I do on the table is only going to be applied to these rows. So, that lets you build conditional workflows which can be quite advanced. You can combine facets together to say, okay, if this value is in this range and if that string is equal to that, then I want to do this operation. And all this is completely visual. It's just with this very simple UI that people can get accustomed to quite easily. So, for instance, one thing you could do here is just take this column and fetch more information from the database and add it to my local data set. So, Wikidata stores all sorts of information about these films and I can fetch that just by clicking on the attributes I like. So, for instance, the genre of the film, I can fetch that or the IMDbID. So, this is just a preview of what you would get in the table. I can, again, press OK and this is going to do it for the entire table which would take a while. You can see it's a process that would take a little while. So, that's basically a very, very broad overview of what the tool does. It's really popular in many communities. So, we have a lot of journalists using it, librarians, researchers in general, really, in a lot of different fields. Digital humanities, quite a lot. In the Wikimedia movement, and of course, many people we don't know because the tool is just a print source and you can use it wherever. So, let me tell you a bit now about what we do on the tool, the work we're trying to do. So, it's a project that started off 10 years ago now. It was initially called Freebase Gridworks because it was made by the company who run Freebase which was sort of crowdsourced knowledge graph, sort of structured Wikipedia, in a sense. And very quickly, the company got bought up by Google so the tool got renamed into Google Refine. And after a while, Google decided to stop running Freebase so they also stopped supporting the tool and they therefore converted the tool into a GitHub project. It was already open-source, and then they sort of gave the project to the community and renamed it to remove the Google branding. And since then, the project ran like that, as a GitHub project without much structure around it. And just last year, we joined, finally, a fiscal sponsor to actually provide some structure around the project and also to manage funding around it. So, what I want to stress about this is that the success of the tool is mostly inherited from the first few years of investment in the tool where it was supported by really big software houses which had professional software engineers and really actually clever people to build this tool. And not that much happened afterwards. So, we still have the challenge of taking this tool, and turning it into a viable open-source project that can run on its own, really, because quite a lot of things were done since 2013, of course, but comparatively, we're still resting on the success of the initial tool, really. So, what can we do to attract contributors? What can we do to make this a sustainable project? It's very standard recipes that apply to, of course, a lot of projects. So, we tried to reach out to neighboring communities. So, because the tool was originally built for Freebase, we migrated it to WCdata to be able to tap into the community of the entire Wikimedia movement. So, this is just a workshop in Amsterdam where we trained people to use OpenRefine for that. We also had a grant from the Google News Initiative to improve OpenRefine for data journalism, and that was also very useful. We had meetings with people from newsrooms in the U.S. to understand better what their needs are. We improved the localization of the tool quite a lot, also by making it easier for people to contribute translations. And for that, we used a tool called WebLate. It's a web interface to contribute translations, and it's really, really good. It creates this sort of engage page where you can showcase the translation effort to contributors. And then, in our experience, it really brought a lot of contributors to the project. And people who really feel like they own the project, it's not just they contribute a few translations and don't feel like they are part of the core team. They really get involved quite heavily. So, I can recommend that highly enough. We also started a .3c community group to standardize the API that underpins the reconciliation feature I showed you. So, to have OpenRefine talk to databases to do this matching, we used an API which was designed by the initial designers of the tool. And it was not very documented, it was just you basically had to reverse engineer the tool every time you wanted to implement a database for that. So, we're trying to bring that to a better level to document that properly. I can just show you a quick demo of what the specs look like now. It looks like properly nice .3c specs. And a lot of people who got involved in the group also started contributing to the tool itself, so it's great. We created a steering committee of high-profile people around the project who know the ecosystem very well and can help us find funders and projects to partner with, and other things like that. So, that's really, really early stages, but it's already very useful. And this year we want to apply to Google some rough code and our .3c. And the deadlines are just about now, so if you're also thinking about that, it's time to rush a bit for that. So, we're not really sure what to expect from that, but hopefully it's going to bring us nice contributors. And quite a lot of things. Also on the technical side, we try to revamp the architecture of the tool because it's quite old, and sometimes the age of the tool is felt in the development processes. So, we had to migrate the build system, get rid of non-free dependencies, as we mentioned earlier, and suddenly had to do with the data package integration in the same go. And also some other things. We still have a very old web framework that we rely on, which is completely unmaintained, and we're probably the only users alive. It's really, really crazy. So, we still have to migrate out of that. And in 2020, we have pretty exciting plans. So, we want to migrate the data processing back-end to Spark. So, at the moment, all the data in a project is held in memory. So, it makes it really hard to scale the tool to larger datasets. And it's a big blocker for a lot of users. So, we have regular user surveys to check about the needs of our users. And the main issue people have is scalability. So, we want to improve that. And we also thought it would be good to have documentation because for now, we only have a GitHub wiki, which is better than nothing, but really, really not up to the standards that it should be. So, we're trying to work on that. It's also early stages. And we're still not sure what framework we should use for that. So, if you have any idea about what sort of documentation platform we can use, get in touch with us, we're still not completely sure. And that work is supported by a grant from CCI from the Essential Open Source Software for Science program that has started a few months ago. And they have another funding cycle at the moment. So, if you think your product could be eligible, there's still a few days to apply to the second funding cycle in this grant program. So, do give it a go because it's really worth it. And I have quite a lot of open questions about how can we better take care of this really, really nice heritage we have of this really interesting project. So, because we're trying to do all these changes to revamp the architecture, we have to break quite a lot of integrations with extensions because Open Refinery has an extension mechanism where you can add new features on your own. And revamping the stack means breaking compatibility with these extensions. How can we do that in a better way? How can we make sure we're not putting too much strain on the developers of these extensions? If you have any idea, let us know. And also, a very, very important question I think is how do you manage which issues you tackle in the core team and which ones do you leave sort of open as potential hooks to bring new contributors in? Because, of course, we can do everything, but sometimes issues are strategic in the sense that if you don't work on them, people will come and do it. But sometimes it's quite hard to tell which ones are which. And if, yeah, also if you're familiar with the tool and have any ideas about features that are absolute blockers or things we should do better, do let us know because we're really trying to get that right. And that's it for me. So I think we probably have time for questions now. So the question is about adding new fuzzy matching algorithms in the clustering feature that I showed. So recently we made that part of the tool extensible. So from now on, extensions can define new algorithms like this. So you don't have to patch the tool itself directly. We're really keen to have more in the tool itself. I mean, there's no problem with that. It's just not clear for us which ones are needed. We don't have much feedback about what it's lacking. So yeah, if you have any particular example in mind, I'm keen to hear about that. I think about how to have more contributors. My question is like, do you have in your organization something like a community manager or a developer evangelist? Someone whose job would be to help people try to get into as a code or turning issues into public requests? So the question is, do we have someone in charge in the team to do developer evangelism and bring in new contributors? Right now, not really. Although some people do a lot of work in this regard. We're still a very small team and no one is actually working permanently on the tool so far because we've had some grants but nothing permanent. So I sort of see it a bit as my duty as maintainer to make that easier to ease and boarding of new contributors. It'd be great to have someone dedicated to that. A feature that I would like to see is I'm constantly when I'm using OpenRefine exporting data to CSV and then checking this into Git to preserve victory. I know OpenRefine has its own history so it's often relationship with the question and also with the data network. So the integration of all these work flow would be very interesting to be able in a single picture just to comment your history of this in Git. Yes. So the question is can we... You tell it out the changes and it changes because okay, data is precious. So you can easily make mistakes here. If that's some security then you should understand but it's always better to add something. Right. So if I understand correctly the question is how can we integrate the embedded history that we have in OpenRefine with external tools like Git or other pipelines systems. We're also really keen to work on this so well actually some work has started this year already to make that easier. So just to make sure people understand so this history which is the basic list of operations you applied can be represented as a declarative just adjacent blob which can then be applied on other projects which have the same structure. So it already gives you some reproducibility to your workflows. It's really popular for people who come from Excel who don't have any way to do that in a simple way. So basically you're programming without knowing it because you're just doing things graphically and this gives you this workflow at the end. So what we're trying to do is expose the operations in OpenRefine as a Java library that you could reuse in other settings. So that would be a start because at the moment it's even really hard to just do that. And then we'd like to have integrations with as many other tools as possible. We've been thinking about adding support for other expression languages so you could drop to error or drop to another language in the middle of your transformation and that would give you some of that. But it's still not quite clear to me what it would look like in practice. Martin, so do we see users turning into contributors? Yes, I mean, I'm one of them really. I just started just needing the tool and got roped into working on it gradually. Yes, I think that's a natural route although it's quite hard because it appeals mostly to non-technical users and people who don't feel like they're developers and they don't feel like they could contribute to the tool. But still, it happens. So it's all about trying to get that message out that you can contribute even if you're not like a Java programmer yet and you have many ways to do that. So yes, it's happening slowly. I'm curious about why do you think that the Wiki is not a proper documentation and which advantage do you see in another system? The question is why are GitHub Wikis not appropriate documentations? One thing I'd love to have is localization for documentation because we have a very diverse user base and very often English is a hurdle. Also versioning, I'd be able to tell, okay, this is the documentation for 3.0 and 4.0 is a different documentation. And also just the layout is a bit weird and not very easy to deal with. It's not super easy to read. Yeah, things like that. There's a lot of features in documentation systems that you don't have in GitHub Wikis. Yeah. My question is about licensing, tracking license information. It also applies to the virtual standard. So you've got the data set. Maybe it's under a creative comments by license. You'll modify it. That would be really, really useful to track that license problem. Right. Couldn't you contribute this so that the downstream users can then cite it properly? Right, that would be great. So the question is can we add provenance tracking for licensing in the tool? I'm sort of working on this this year by first making it possible to check which columns was a particular column derived from. And that should hopefully in the future make that sort of things possible. But it's still very much out there.