 Bonjour, je suis Sébastien Rochette et je vais vous parler de l'analyse transformée dans le documenté, reproducible et élevé. C'est une réponse de la conférence que j'ai faite il y a quelques mois, donc je vais expliquer ce que nous avons fait pendant cette hackathon. Je travaille à ThinkAura, je suis data scientist, expert et trainer et vous pouvez nous suivre si vous voulez et tout est là. Ce que je n'ai pas dit, c'est que la présentation est déjà sur ma GitHub, donc si vous voulez suivre la PDF, vous pouvez les trouver sur ma account GitHub. Cette fête de collaboration était sur l'air software, je ne suis pas sûr que tout le monde connaisse l'air software, mais c'est un software pour faire de la science. C'était en Concarno, en France, en Bretagne, c'était un bon endroit et c'était avec différentes recherches dans l'écologie. Le rêve de cette fête de collaboration était de parler de l'air, nous sommes tous ici parce que c'est un software. Nous avons trois jours ensemble et la première partie était de faire des présentations sur les bonnes pratiques avec R, comment faire des packages et comment utiliser le Git pour travailler ensemble, comment mettre tous vos packages et travailler dans un container docker et pour le partager en un façon reproducible. Et au final, vous pouvez aussi apprendre comment travailler avec l'interface de l'usage de la galaxie, qui est une interface graphique pour partager votre travail de recherche sur le plan, si vous voulez. Nous avons eu des étapes dans les scripts reproduisables pour partager ces packages et pour le partager dans un UI, comme le docker et la galaxie. Et ce que nous avons appris, c'était comment efficiently partager votre travail. Les principaux de base ont été créés pour des données. Nous avons déjà parlé quelques heures auparavant de Matheus sur les principes d'affaires, qui sont les accessibles, les réusables et les interprétables. C'est appliqué à des données. Vous pouvez trouver où l'accessible ou l'accessible. L'accessible est de dire que quand vous avez trouvé les données, vous savez comment vous pouvez accesser ces données. Et si vous avez besoin d'un passe-d'eau, ça devrait être écrit quelque part. Interprétables, ça veut dire que vous pouvez utiliser les données dans différents workflows. Donc le format des données peut être utilisé dans d'autres softwares ou quelque chose d'autre. Et le réusable est plus lié à l'accessible. Vous pouvez utiliser les données pour combiner tous les données. Mais aussi, vous pouvez reproduire les données parce que vous savez comment cela a été construit avant. Et si vous voulez ajouter de nouvelles données pour faire une analyse meta, vous avez les clous de comment aller sur le fil et ajouter de nouvelles données pour continuer à travailler avec ces données. Donc ce sont les principes d'accessibles pour les données. Et je pense que pour partager les codes et les softwares qu'on a, nous pouvons utiliser les mêmes idées sur le fait d'accessibles. Donc j'ai dividé cela dans 5 catégories, pas seulement 4, sur l'accessibilité, c'est la même chose ici. Reproduciabilité, documentation, readability et communication. Donc je vais expliquer dans les différents autres slides ce que sont ces 5 steppes. Mais on a essayé d'appliquer cela dans notre groupe dans le hackathon pour aller de ce code pour partager le travail que vous faites. Mais pour les recherches qui travaillent seulement avec 2 personnes dans le lab de recherche, il y a seulement des codes. Vous avez plein de codes sur votre computer et comment vous partagez ces codes à la communauté et comment vous pouvez utiliser cela d'une façon ou d'autre. Donc c'est un grand step. Et pour cela, je vous recommande de ne pas rester en face de ce grand step et d'essayer d'aller avec des amis parce que l'endroit est assez difficile. Donc dans ce hackathon, nous avons décidé de travailler ensemble sur un projet commun. Un des recherches était ici avec un projet qui s'appelle Vigil Chirot. C'est un projet qui regarde la distribution de ces bats en France. Nous avons des données sur la distribution et l'observation des bats. Nous avons des données sur la distribution environnementale, la température, etc. Et avec ces données, nous faisons des modèles pour essayer de modèler la distribution en France. Donc ils ont un travail sur ce travail que ils veulent partager et continuer à travailler sur. Et pour cela, ils ont beaucoup d'or scripts. Je me suis dit que c'était de l'or. Nous avons décidé de travailler ensemble avec les différents développeurs qui étaient là pour essayer de savoir comment partager ce type de travail et comment appliquer les bonnes pratiques que nous avons présentées le jour avant. En regardant ces cinq étapes, sur l'accessibilité, le travail était déjà disponible sur GitHub. Les recherches ont mis tous les scripts sur GitHub. Vous avez ici une partie de l'or script qui était disponible sur la route de l'opposité. La grande partie est hidden dans les folders que nous voulons hiding parce que c'était trop fort pour nous. Mais au moins, nous avons quelque chose à travailler sur et que tout le monde s'accueille sur la route de l'opposité pour avoir accès à ces scripts et pouvoir travailler ensemble sur ce projet. Le premier point était déjà bien. La principale partie de ce travail, c'était peut-être le management de ce projet. Cela veut dire comment apprendre aux gens qui n'auraient jamais utilisé ce type de tools pour travailler en collaboration pour aller de ce script pour une utilisable pièce de software. Donc, nous avions déjà besoin d'accessibiliser le travail. Donc, peut-être que vous ne le voyez pas, mais ici, nous avons pris un grand plan pour évoquer toutes les fonctions, toutes les parts différentes du projet et ce que ce script fait, ce que ce script fait et ce qu'il fait. Nous allons les coller ensemble. Il y a quelque chose qui vient d'abord, quelque chose qui vient à l'end, et ce que nous avons entre ces différents étapes. Donc ici, nous avons identifié que entre les différents étapes, nous pouvons avoir des données d'input et des données d'outils qui peuvent être utilisées pour le next step. Donc, la recherche a dû nous montrer que tout le script et d'être capable d'aider une image comme celle-ci sur son travail. Il semble être facile pour lui de le faire, mais à l'end, quand on lui a demandé comment c'était pour lui, il m'a dit, j'ai senti totalement niqué pour présenter tout mon appel à des étrangers, à des gens que je ne sais pas. C'était facile pour moi de partager le code sur Githa parce que vous ne savez pas si quelqu'un le verra ou si ils le verra à l'autre côté du plan. Donc, vous ne vous inquiétez pas. Mais ici, vous avez 10 personnes qui vous disent OK, dis-moi ce que vous avez fait et pourquoi vous le mettez là-bas. C'est très difficile, mais je pense que les gens qui sont là sont vraiment bienvenus et amusant, ce qui est en particulier de notre communauté. Notre communauté est faite, c'est-à-dire qu'il y a beaucoup de travail qu'on a fait dans notre communauté d'être capable d'accueillir quelqu'un, quel est votre origine, quel est votre gender. Il y a peut-être que vous avez entendu parler de toutes les femmes qui ont inspiré les filles, c'est des groupes qui ont aidé, particulièrement, à l'arrivée. Les gens ne sont pas usually seen like white men with a beard, you know. We are not only white men doing some coats, so. The art community is very inclusive for that and I think that this thing that we have in mind helped us also to be friendly with this researcher and to share the work and to work together. So we separated the work into small pieces and we opened issues on the GitHub repository to say is this part of the code that does that and I would like you or somebody to work on it to make it nicer or to work on this part. With this patch the different issues between one or two developers and so that anyone could work on his little part because here the workflow was quite easy as soon as we had some datasets that were available to do the different parts it was quite easy to say you can work on this small part because you have some input data and you have some output data you know how to go from this point to the other point. So this section part and then of course you have to manage a repository and how you to present how you collaborate inside a GitHub repository how do you deal with pull request and everything like this but I took this part for this time. So the recipe I mean for this work and how to share this work is first to carefully peel back the code. So we had to identify inside the code if there were some user specific pieces. I mean if inside your code you say yes this data is on my computer at c.dots my document and set is on my name nobody will be able to use it again. So you have to go inside it and to find all this part could be removed or at least be parameterised put it on the top of your script saying this part you let the user define where is the data on his own computer and you can also cut the different big code into small pieces so that it's easier to maintain to see the goal of the different parts. With all software everything is about functions. So a function is you take something in input you have one thing as output so the parameterisation is like the parameters of the function so we had to put different script as functions or smaller function and we use what we call Reprex reproducible example this is the word that we use to use in aura is you have one function so you have to show a small data set that can enter it and show what is the output of that and anyone that use this small data set and use the same function should have the same output without having to have the big data set that you use for the entire analysis and if you are comfortable enough of course you can add some unit test on these functions but maybe it can be later if you just start to learn how to build a function so the first part of this recipe the second part of this experiment generously we already spoke about notebooks so in aura we have this notebooks that we call vignettes it's done with markdown 2 it's like the Jupiter notebook but for aura so aura markdown and it allows to mix some plaintext documentation with some aura code that will be executed during the process of building the package here and so that it's also reproducible if you give the same notebook and give to somebody else to compile this notebook it should be self-contained and anyone can reproduce a different example on it so in this notebook you put some also reproducible examples but because you did some reproducible examples for the functions you can reuse them to show how this function works and why they work this way and in aura as soon as you build a function you have to document the function too so say what does this function what are the parameters what do you put in this parameter is it numeric is it text whatever and what are the dependencies needed to to use this function so indeed when we do that we transform the the package of the loads of scripts into our package because the structure of having vignettes of having documentation for function and putting the function in a specific directory inside your your big directory is called package and this is forced by the R community to document completely this package so from there we had built a small package and of course we had to each of us continue to develop the different parts after that you have to add you can add the right amount of readability for me or is is not minified language so it's not like javascript you can put a lot of air inside and when you put some air you can breathe when you read your code and you have to think about future you in six months will totally have forgot what you put inside your code and if you can read your code like you read a book it's easier to go back inside there are in the R community some packages like in the Tideiverse who also give some functions that are readable and the code can be read by anyone even if you don't understand the R code you can understand I mean even if you never developed in R you can understand the code because the names of the function and the way you write it is readable by anyone like an English text of course you need to speak English and so think about future you and think also about the other developers who will help you because if you want to share to the community of course people will give you some feedbacks on it and the last part is communicate abundantly about your work there are many different ways to communicate about it social media I'm sure everybody of you have a smartphone or almost some you can do some blog post about what you did and indeed at the end of this hackathon I wrote a blog post this one is in French but the blog post was to present what I present here today so what we did to go from these small scripts to a shareable package and it was also a way for me to add a little more information for the developers because at the end of these days they are let alone with their code and then nobody is here to help them anymore so at least I add some clues inside this blog post to help them continue the work alone and you can also build a website and something that is interesting with the R and the R package is this package which is called package down and with this package down with one line of code you can build a website from your R package so it uses all the different documentation part you have put inside the package to show it as a web page and if you combine your github with a CI, with a continuous integration this website is built each time you put a commit on gith so this is the website that is built from the repositories that we built so we have the first page which is the readme that you have in the github repository you have a second page which is the references which is the list of all functions that are available inside the package and with the documentation of the function how to use them with the examples inside and you have the articles you have the article part which is the part where you have the vignettes so all the notebooks inside so the feedback I have from this experience is that for this kind of researcher who only worked alone in the lab the mentoring of this project was a good start for them because they would never go alone in this project I mean at first when I presented what they could do with the script to the package and to show the website ok but where do I start I will never do that alone that's why we decided to choose one of their project to say ok let's do it together and let's see how it works and how you could do it as a researcher who present the work you have to accept the exposure I mean as I said it's not easy to present to people you don't know your work and you have to explain to people in front of them yes here I did this because I thought it was interesting but yes you have another way of doing it so let's go you also need a welcoming community because you cannot say yeah I do not code the same way that you code so let's break everything that you did and I will record it my way but then the researcher will be alone in front of this script and continue to maintain so you also have to be to be welcoming to accept as the helpers that you can help but help the way the other researchers understand so this is important to be in such a kind of group so at the end the researcher will change his practices this is for good because as soon as you know the process and you did it together you can you can apply this every time you do a new code you say ok I have to think when I do my code that I will do it this way because the documentation is important and at the end I would like this website that is nice to share to everybody so this is important but the thing that was almost missing is the follow up because I only wrote this small blog post go back home and I continue to speak with them by email but it's not the same than being here and helping them on their own code so maybe we can find some other way to collaborate so transform being scatter analysis to shareable workflow is accessible to people with a little help to start but as soon as you know how it works you can do it alone and continue as you wish and I would recommend for anyone either it's for or or any other language start with the documentation because you have the thing in mind you know what this function what this part will do so write it not keep it in your mind and as soon as if it's written you can say yes my function is written and I can share directly it because I already wrote the documentation so it will be easier for you for your future thank you very much so two questions which are maybe linked can you say a few more words about Reprex which I don't know and the following question is you don't mention in your guidelines for reproducibility test driven documentation or so the first question about what is Reprex it's a construction of reproducible example ok there is a package in or which is called Reprex which helps you to do some reproducible example so the reproducible example is that you have a data set and you do I don't know a model on it and you have some outputs but the data set is very big and you need to spend two hours to have the outputs if you want to somebody to help you to debug this function you will not give the big data set so the reproducible example is like the smallest example you can prepare so that the people can reproduce the output from the small input and can reproduce also the errors and the warnings that you have so that you can help them yeah I can reproduce it in a few seconds give me this small part of code and then I can help you to debug this this is a reproducible example I didn't spoke about test driven development I think test driven development is can be problematic you have to be very careful about that because you can think okay before writing my function I write what should look like the output this part is interesting but if you put a lot of tests you will write your function as you want the test to be passed but maybe not I'm not sure I can really explain it in a few words but it's quite dangerous to do that it's like you go at school and say I would like my students to be very good at addition because they will be tested in addition so you don't teach multiplication and you don't teach division you spend a lot of time on additions at the end yes but the rest you didn't think about it test different development can be prepilatious yes you say that the test driven development is iterative process when I imagine test different development is like you write the tests you want to have and then you write the function to be able to success in this test but the iterative process of course exist because you use the function and at some point you will have some new data sets and the function will not work so the test you will say ok this should work so you write the test like I would like the output of this process to work and then you add the new test in your code so that the next time you see this function so you correct the code of course and the next time you use this function you make sure that it will work so here to verify that when you change the versions of your code when you modify different parts in other places of your software that's the test still pass because you verify that all the different kind of data you used continue to work this is the iterative part but for me it's not what I call test driven development but maybe it's just a semantic thing thank you