 Hello everyone. My name is Mathieu Jacomi. I come from Paris in France and I work in the biggest school of political sciences named Sciences Po. And here there is a laboratory named the Media Lab which is a hybrid laboratory where researchers, engineers and designers work together to let's say study the social. I don't want to dig too much in the details but let's say I'm a sort of social data scientist if you want. So I work with researchers and I come with research questions and data and I help them deal with for instance the technicality of the digital data but also the question is matching the specificities of the data with their research interest. So the the tool I will present is quite simple and there is quite straightforward but what's important is the fact that you have an exploration step in what we do. And exploring data is not about the classic statistical metrics because most of the time you don't know what you search for or you don't know exactly. So the perspective we are in is very well explained by John Ducking who was a statistician in case you don't know him. He wrote a famous book named Exploratory Data Analysis and so the first quote is very telling on that matter. The greatest value of a picture is when it forces us to notice what we never expected to see like about serendipity unexpected things in the data which is very important for social scientists. And I like a lot this other quote. Far better and approximate answer to the right question which is often then an exact answer to the wrong question which can always be made precise. So the point my the whole point of the tool is helping you finding what is the right question. You probably know already that very famous image. So it's Ben Fry he invented the port of his language and he explains that the chain from data mining to visualization has different steps but they don't only go left to right sometimes you have to go back. Okay so it's about iterating facet of the data. In particular in the middle of this image you have the filter mine represent and then go back to filtering. This loop the central loop is very relevant to us because when you have a CSV which is quite a simple format to read and to parse and to put into the computer you are mostly somewhere here you already have the data and then you want to dig into the data you want to filter and you want to mine the data. Represent it you have many ways to do so but then most of the time you realize that you don't you don't have the right question then you have to go back to filter and you have to tinker and to search for that. So with CSV we have a first problem it's that the tools most of the tools require a painful coding. So LibreOffice, Excel, OpenRefine they all allow you to code but the coding features are embedded somewhere they are secondary and they are embedded in the graphical user interface and they are designed for non-coders. Like if with graphical interface you would not be obliged to deal with coding but when once you have complicated data and complicated research question you have complicated filtering and only coding allows you to do that. So you still have to code so my observation is that as a non-coder you lose your time learning how to code in LibreOffice because it's complicated and it's non-standard, it's broken, it's limited. So why not doing it with a real language so let me give you an example this is exactly the kind of example you presented before I just downloaded that from the internet it's the wrong ranking of movies and you have many different perks so first the rating is with this directed by 10 chain that is not relevant we just have the rating but more bizarre you have the the years encoded inside the title of the movies which is somehow makes sense but it's not easy to deal with that so you have to pass the data. Let's do that in LibreOffice you can do it in one formula or at least I don't know how to do it so you have to use this first column to find where the year begins and then use a second column to compute the substring which means you have an additional column that is not really relevant but you can't get rid of them and also you have very strange things so because I'm French my computer isn't French so LibreOffice translates the keywords of the language in French so instead of having a search you have search which is the French word for search so this is like what the fuck if I'm working with students and they copy paste my code it doesn't work on their computer because they're Italian and English or whatever so this is the quirks of having a secondary language embedded inside something so the GUI the graphical perspective of that spreadsheet is a limit here in its problem so if you just had that JavaScript you just write a line of code it looks like complicated if you don't know how to code but because this is a regular expression but if you really look at the LibreOffice version you still have a regular expression because you can't do another way so it's still complicated so this works well as soon as you can put your CSV inside the coding environment let's say you know Python you load your CSV you do your filtering and then you save another CSV which is clean but then you want to visualize it you have to put it inside let's say Excel or LibreOffice or Tableau Software and then you realize remember the look you realize that your filtering was not adapted so you edit your filtering you save it again and you reopen it inside the software and this is complicated so this is the problem number two there's a gap between the coding and I mean filtering and editing the CSV and visualizing it so for instance inside Tableau Public it's amazing if you want to craft complex sophisticated visualization and you don't you really know what you want but if you just want to filter your simple CSV because you want to look at a specific facet you can filter but you have to open a model by a drag and drop and then go to a tab and then in a form you can paste your formula and then once you close and you save your filter you see the visualization but each time you want to do it again it's very complicated one of the issues is that the model the panel is hiding the visualization so you have many steps to open it edit your filter and then go back to visualization so iterations are painful as well so that's why CSV Rinse Repeat is just a way to shorten that gap and you have a panel for coding and you have a panel for visualization it's not very it's not as good as Tableau Public to visualize but you just have enough so that you can code and see live what happens to your data so it's an open source tool it's a single web app so you just have you load it and then you can deal with it you have standard JavaScript coding panel you can import and export your CSV you have a basic preview of the CSV it's very basic but you can still use it and you have a layout designed to get rid of the filter vis gap that is you have the two panels at the same time coding and visualizing so let me show that to you so unfortunately I think I don't really have internet so ah yes nice let's load the same so this is the webpage you can just uh you can look at this by yourself you have the examples on the first example movie is the same I have shown before is it readable not really can I do something better well you can read it so I'm sorry I'll sit down um you see here so you have basically you have a panel and that you can resize when you need it and you have this uh preview you have three random rows that you can reroll or you can choose more rows and you have here 1000 rows and your original CSV and let's let's code something trivial like output is just the input and then you have exactly the same thing and you can add some visualizations like I don't know the the top words inside the movies so you see that the top words are dead harry star life whatever okay so what's interesting is to let's do it again so what we can do is clean up the data so here I'm doing this I'm writing a column named year where you have the extraction of the year from the movie title then I'm cleaning the movie title by removing the space parenthesis and whatever the year and also I'm removing the divided by 10 from the rating and if you look at let's forget about that for now and then I have like a clean CSV and I could just use that as a quick way to to clean my CSV and download it and open it in the interface and you will see the columns are are clean you you can visualize the year here whatever and in the in the example you can filter not so you can clean your data and then you can filter it so for instance here we are just focusing on movies which have a number in their title which you could not do before because you had the year so you always had a number and you can look at which kind of films have a number and okay so this is like a very trivial example but I want to do something better so I have I don't know this morning a very big CSV so it's a CSV about uh it's tweets containing Shakespeare that we indexed for since a moment so now it's slower but you have the data request is quite big you have 170,000 tweets so 170,000 columns and in this column you have a lot of different things including the the text of the tweet the date the user and more things so now let's let's imagine you want to see what's inside the data and you want to do a lot of iterations so the first thing you want to do is let's look at the time and I know it's interesting so I choose just the daily volume and the date comes from the created ads column and then you see the evolution of the number of tweets every day since the beginning of the data set and see that there there's a very big peak around the 23 of April so I just proposed to look at that peak so let's write the code we have two dates so the dates here are it's complicated to deal with dates you know look at the format here you have your own JavaScript so it's quite easy you define two dates with a normal format but then you the JavaScript knows how to pass these dates and you can just look for the tweets where the date of the tweet is after date one and before date two and then you control enter it's filtering your data you can look at how many remaining tweets you have 48,000 and then you see you look at the peak so let's look at the top words let's look at the top words inside these tweets here they are and you have things like number seven is death the number 12 is died number 13 is anniversary so this peak corresponds to the date of the death the 400th anniversary of the death of Shakespeare and so now we'd like to know for instance so what's what's what's the difference between what we say what people say about Shakespeare before so we can just edit the filter so now I just want the tweets before the beginning of the peak and look at if they already talk about the death of Shakespeare for instance so you have now the curve before the peak and you can see that so death is now number 20 so there is a shift here and we can iterate over the filtering by searching for how many research questions we want so for instance there is a servantess thing here so who is speaking about servantess we maybe we can just filter that so this is done this way item this is my tweets dot text dot search for servantess if it's found and then I'm searching for the tweets before the date which have servantess in there in their tweets if you don't know what it's about you can just look at some of the of the tweets to get an idea so this is exploration so it's about looking at that so for instance you have these tweets and I don't understand them because most of them are in Spanish so here I found an hypothesis during my exploration so maybe servantess is about Spanish and we can test that so let's look at the languages because I have languages in the in the tweets so let's look at the main languages languages is encoded here and so Spanish has two more than 2000 tweets with servantess and English is second with like 10 times less let's say I remove the servantess condition from my filter to look at if this difference is from the whole dataset or not then you realize that actually English is like seven eight times more present than than Spanish so it's like if only the Spanish speakers are interested in the fact that servantess has a link with Shakespeare so which is the link this is the final very good story for today is that they both died the same day which is supposedly the the 23 of April except that for Shakespeare it's in the Julian calendar so the real date is the three of May it's today but what's where is that today we don't have any peak so I expected to see the peak today as well like for the year 2000 but it's not there just let me check if I'm not mistaken yeah so we are not tweeting or maybe not yet maybe later in the day we will see a peak about Shakespeare so this were my backup slides and let's wrap up this presentation so the main the most important point to me is that exploration requires iterating and the shorter the iteration process the more you can try to match research questions with the features of your data so that's why it says we repeat it about constantly rewriting your filters it requires to be able to code in JavaScript but it's not the complicated language in love you have a lot of help you want to learn that and it's useful so the visualizations are basic you can't do everything you want the preview is not comfortable that's okay because you have other tools to deal with that here it's just about shortening the gap so it's a simple tool and once you have your hypothesis you you go out of your exploration phase like I know that there is something happening with the Spanish and servantess and I know I want to to to dig into the Spanish tweets versus the English tweets and then you can like go in Tableau software and once you have your hypothesis visualize exactly that aspect of the data and craft a better visualization with adaptive tools or do metrics like which is the the p-value of this whatever that's it you just reach to another tool thank you for your attention if you have some questions it's so conscious it's on github already okay you can contribute and I also encourage you to contribute because I like adding visualization is quite easy it's so the global structure is in angular so you have a component in cards and you just add cards and it adds visualization cards you and the or the existing cards are commented in the code so that you can just look at them the visualization are just in d3 like the standards visualization javascript and um and you can post issues and so on but I want to keep it about exploration because you have many people wanting to to produce visualization it's fine but it's not the I mean at one point one of the software will be less the standard it would be awesome and I can't compete with my my small workforce with Tableau for instance so it's more reasonable to focus on one aspect but it's better also to like csv is a good vessel to transfer data from one place to another so let's multiply different tools for different uses and think that as a toolbox I think it's better we don't have a lot of energy in research we have the engineering time in research is is low so that's how we do it and also I forgot to say that except if you except if you use the mapping uh cardvis which open which uses open layers and maybe it sends information no data is uploaded so here the the csv is like 100 megabytes and it's just local on your computer no data is sent you can just keep it at home there is no server it's just pure javascript client side like your data is set so it depends on how many rows versus how many columns but like half a gigabytes still works and I think that the more efficient all the browsers are the better the limit improves by itself just because they did better with memory are there any other tools that the science media lab offers oh yes sure if I do I have internet so on tools.medialab.science-po.fr you have a list of our tools one of the other tools that is relevant to people working with csv on who want to extract networks is table to net it's another of my tools all of them are on github open source so I don't have the time to make a presentation of that and it would require to talk about networks but basically you upload a csv I don't know let's upload the same csv and then you can choose which column you want to match as nodes and edging to produce a network download the network and then visualize it in another software like github here and so on so the toolbox the toolbox periods feel free to look at them yes um you showed us you like exporting the filtered csv can you also export the the visualizations so I did not finish to to implement that in in all of them but in most of them you have inside um so this this one is just text you can copy paste it but so for instance this one which is quite complicated I think you can download it yes so you download the csv and once I have more time I will put that that feature in all the the viz as a standard all the questions yes what kind of data format that this system supports for the data source that's just a csv yeah because we have some problems with csv files I mean they have copies I would love to have because I work with data scientists and as you said they they don't realize what they talk about actually so you you get a file and most of the time you have to clean it before making it compatible with that like putting it in the active whatever I would love to have a small tool one day I will make it if no one else does who says in very clear words that what what is the csv okay or has issues with like you know new line or separators or like is your csv with a semicolon separator for example it comes from extra like just having a feedback for user so that I can say okay you put first your file in that tool and then tell me what it says and if it's bad then redo it and it's good really useful because the most of your time you lose is not about writing that is about accompanying people to learn to teach them like the procedures of the digital that's fine thank you very much