 Meaty de la noche. Y mehao como un poco de coluna. Se me Keith. Ya. Gracias a Dios. Espero que os haga muy bien dirtarla. Si yo. Hola Roberto. Que tal, ¿qué tal? how are you? Están preparados. Are you ready? We see. What's up? What's up? Perfect, a 1.000 1285,000. ¿Qué? No, 1.500. We've been together. 7% of the set has been Goodbye. ¿Puedo ir? ¿Puedo ir? ¿Puedo ir? Hola a todos, es un placer estar aquí hoy. Soy Roberto de Cosmos, Stefano Zaciroli. Puedes probablemente conocerte por nuestra historia longa en Free Software. Antes, pero estamos aquí hoy extremadamente excite para decirles sobre un nuevo proyecto que está tomando básicamente toda nuestra energía. Es llamada la heritación de software y básicamente estamos construyendo la librería de Alexandria o Source Code. Pero antes de entrar en los detalles, déjame darles una motivación. Entonces, si nos veis alrededor de nosotros, vemos software en todo lugar. Se empiezan las transformaciones digitales, los fuertes, nuestra innovación es una componente esencial de la investigación moderna. Es un mediador esencial para aceptar información digital en cualquier forma. Puedes pensar en esto y usarlo en todo lugar de comunicación, entretenimiento, política, la manera en que se organizan las políticas, depende del software. Entonces, en algún sentido, software empeza, embodiza una parte significativa de nuestra conocimiento y nuestra heritación cultural. Pero, como sabemos muy bien en nuestra comunidad, cuando hablamos de software, hay una cosa que es realmente la conocimiento y esta cosa es el Source Code. El Source Code del software, es un objeto muy especial en la historia humana. Si piensas en esto, es la conocimiento humana executable, que es el primer tipo en la historia humana, un objeto como este. Como Harold Abelson, que es un estudiador fantástico, que escribió un libro, que se usaba cuando era joven y en la universidad, también es el founder de las comunidades creativas. Él usaba decir que el programa de computadores debería ser escrito en el primer lugar para la gente para leer y solo accesorizar las máquinas para ejecutar. Esa es la recomendación para los estudiantes, por supuesto. Y ahora, si veas un gran cantidad de software que es alrededor, ahora tienes acceso a piezas de código que son bonitas. Por ejemplo, veas el Source Code que fue realizado por el Free Software. Muchos años atrás, bueno, no es muy readable aquí, yo sugería que veas y leerlo sobre la internet. Hay piezas fantásticas de hackery sobre cómo hacer este 3D animación de trabajo de máquinas que no son poderosas. Hay un gran number de truques en el Linox Kernel, por ejemplo. Para entender lo que realmente está pasando, debes ir y leer el Source Code y leer los comentarios que están ahí. Una de las preguntas que el founder de la Museum de Computer History dice que tener acceso a el Source Code de nuestro software es una verdadera puerta abierta en la mente del diseño. Puedes pensar un poco sobre esto un poco más. Esto es realmente nuestro Free Software, realmente los comentarios de software. Pero ¿qué son los comentarios? Los comentarios son básicamente recursos y los comentarios que no son de nadie en particular pero que todo el mundo puede usar frecuentemente. Usamos esto en el economía. Air, agua, algo así. Puedes pensar sobre los comentarios de software. Esto es el software que todo el mundo puede usar, adaptar, y básicamente todo lo que es Free y Open Source Software es uno de los comentarios de software. Pero, infortunadamente, cuando sabes que, de la economía, cuando tienes algo en común tendrás lo que se llama la tragedia de los comentarios. Si no te preocupes de tus comentarios, si no te asustas de los comentarios, si no te asustas de los comentarios, tendrás lo que te asustas de los comentarios. Sabemos esto, es la razón por que, como comunidad, estamos muy activos intentando mantener nuestro software, asegurar que hay una comunidad cuidando de esto, desarrollando, y eso es muy bonito. Pero si veas nuestro código de servicios, es una parte preciosa de nuestro conocimiento. ¿Estamos realmente haciendo todo necesario para asegurar que es mantenido en el tiempo? Bueno, no realmente. Hay algunos problemas que tenemos que enfrentar. Una cosa es que, claro, el software es todo el mundo, es positivo. El software también es spreado, porque estamos actualmente usando muchas, muchas diferentes plataformas para desarrollar software. Estamos fortemente opinadas de las personas. Así que usas, Marijn, usas tu, todos usan las propias plataformas desarrolladas. Aquí en este Cloud, he justificado algunos nombres que deberían llevar a la gente como mí que usan plataformas que más o menos desaparecen hoy. Otros nombres que deberían llevar a las personas modernas que usan plataformas muy modernas que son muy útiles y quién sabe de dónde estarán en 10 años. Pero no solo tenemos muchas plataformas desarrolladas, también tenemos muchos lugares donde distribuimos el software del resultado. Algo se usa para desarrollar plataformas desarrolladas, algo que se usa en archivos, muchas, muchas otras cosas. Y para hacer cosas más malas, en el tiempo, hay un lugar en el que te muevas otra vez. Si estás trabajando bien en tu casa, tendrás que cerrar el otro y poner un error en el nuevo, pero no todos lo hacen. Así que no es fácil seguir la evolución del software. En un mundo, lo que estamos perdiendo es un lugar en el que puedes tener una visión global de todo el software que es disponible. Lo que nos hace es un lugar de todo el software, no importa dónde y cómo ha sido desarrollado y no importa dónde y cómo ha sido distribuido. Así que esa es la visión número 1. Pero luego tienes visión número 2, que es nuestro software que es software, es actualmente fragil. Tal vez no nos noten, tal vez hay muchas personas que no noten lo que pasó hace cuatro años, pero puedes tener accidentes, tu server puede romper, tienes un earthquake, tienes un fuego. Puedes hacer errores. ¿Quién nunca actualiza RM-RF en la misma dirección? Puedes levantar su mano si no lo hiciste. Claro, claro. Ok, este tipo de cosas. Esto es conocido. Pero más recientemente, ves diferentes problemas que son el ataque de los malichos. Por ejemplo, si alguien está creciendo en tu sistema y removeando todo, si no sepáis 150.000 euros en bitcoins, haces algunos acountes específicos. Esto es muy fascinante. Pero más recientemente hemos descubierto que hay una even gritter que es básicamente decisiones de negocio. En el mes de 2015, cuando los gitarios de 100.000 repositorías tenían que encontrar un nuevo lugar. Cuando Google Code fue cerrado, más o menos en el mismo momento, más de un millón y medio repositorías tenían que migrar. Así que esto es nuevo. No hemos sido usados para esta situación. Entonces, lo que estamos realmente, realmente perdiendo hoy es un archivo, un real archivo que tiene la misión de preservar la misión. Es un lugar donde puedes ir, si la repositoría saldría, si la repositoría fue en Kitab o Kitab, o si la plataforma que usas para hacer desarrollo saldría o se han perdido los backups. Y la segunda pregunta es, si veas el software que estamos desarrollando, ahora hay un gran cantidad de software que es disponible para todos para estudiar, para entender, para verlo. Hay muchas cosas para hacer, automáticos analizados, análisis de pago, análisis de pago, vulnerabilidad, detección o algo así. Bueno, nuestros amigos en física saben muy bien por este tipo de grandes desafíos. Necesitas instrumentos seriosos para ver estos tipos de objetos. Pero hoy debemos copiar lo que las personas hacen en física. Ver un gran telescopio para ver a los estados, para ver a nuestro código para estudiar y entender cómo se evoluciona. Así que estos son solo tres problemas que enfrentamos, y debemos hacer una acción. Para hacer una acción ahora, empezamos con Stefano hace mucho tiempo y, por cierto, un proyecto que se llama software heritage que tiene la misma misión, nuestra misión no es para llegar, no es para ser femenino, nuestra misión es para colar, organizar, preservar y compartir todos los códigos que están publicados. Y queremos hacerlo porque es importante preservar el primer código, por supuesto, pero es más importante tener un único lugar donde puedes estudiar y encender el software que estamos construyendo hoy para preparar para un futuro mejor y ahora ¿Cómo hacer algo así? Esa es la motivación, veamos lo que estos chicos realmente hacen. Pero, primero de todo, sienta abajo y intenta pensar sobre qué es el mejor way de organizar un proyecto como esta. Una idea es tomar un poco de filosofía unica para hacer una cosa y hacer una cosa bien. Entonces, lo que queremos es una plataforma común que only cares about the source code y luego, en top of this, we want to enable other people to develop a wealth of application that can be related to cultural heritage to industry, to research, to education just to give a couple of examples think about what you can do if you have a universal reference catalog of all the software that has been built with all of the history of its development the kind of analysis, the kind of studies you can do that other ways of building software that you can develop for research, having a single place where research software can actually be put and reused by other people who care about reproducibility for people doing research on software having a single place where in a uniform way everything is available to perform your research in a repeatable way and there are many many other applications we don't want to do it we want to enable it I mean it's an infrastructure for doing all this but that's not an easy task how can you sustain such a project how can you organize such a project so we have two basic principle first of all of course coming from the community which is ours it's evident we want to have everything as open source software 100% of all the development of the infrastructure we are building is available as free and open source software but also we highly value transparency we do not believe we are the strongest in the world and we take decision it will be the best one we want to have other people coming and working with us to find the best way of making progress and on the other side we are really here for the long haul I mean it will take time to build such an infrastructure it will take time to make it sustainable over time and the only solution for this project we last over time is to use basic principles of reliability engineering which is basically replication so replication at all levels we should not just have a copy of our data ourselves it should create a network of mirrors that may copies in other places in other countries under different legislation possibly using different technology to make sure that it appears but it also means replication in terms of contribution so not just having a single company or startup running this now we want to build a nonprofit multi-stakeholder organization where many people with different reason of being there contribute to make it viable and you see since we have people from cultural heritage which are interested in from industry from research, from education bringing around the table many people coming from different places for different reasons and they should never go away altogether I do not want to see again a single manager that say what this project I have here let's scratch it and then you have to move another mini repository somewhere else so all of this is nice and fine it is a high level vision of what I am doing now I think you are curious what we are doing technically inside how far we went and is this just a slideware and dapperware or is this actually software and hardware running and here enter Stefano who will give you a very nice view what is going on so let's dive into the specifics of what we are doing and how we are doing it so first of all what we do actually archive and what is in the scope of our archival mission we go after places where we know we will find source code and specifically publicly available source code and so what we go after practically are publicly available version control system as well as source code distribution artifacts such as tarbols or distribution source packages and when we find those places what we retrieve from those places are actual file content also known as blobs, we call these content objects we retrieve commits we think that source code itself is very important but the history of its development is important as well as knowledge that tells you why specific changes has been made in a specific piece of code so we archive all commits that we find that we call revisions with all their metadata so commit messages, auto information information about which were the previous commits so on and so forth we archive releases that is specifically unnotated commits that auto are considered to be worthwhile for a specific software release with a given name and a given version number and we archive for all those information where, the information about where we have found it which we call software origins and when we have found it so for each object we add to our archive we note down we have seen that in that specific repository or in that specific distribution packages at that specific time all this is stored in a canonical model which I'll describe in a bit in more detail which is independent of specific version auto system technology or packaging representation so different vcs, different distribution packages all stored in the same model what we do not archive, for now at least trying to focus on doing one thing for the moment, is we don't store on page, website of project wikis, we don't store backtracking system information or issues or the history of code reviews for instance and we don't store mailing list so for now it's to focus on a single mission our idea is that we basically should play one role in a more general semantic Wikipedia of software where the information we store will be linked to information stored by other people other archiving initiatives by saying at the time that source code was archived, well the website of the corresponding project looked like this and the same you can imagine for issues, for bugs and for code reviews and so on and so forth so this is what we actually archive and how does it work? so it's a pretty standard search engine like architecture, implemented as a two tier process so on the left here you have all the places where source code can be found so you can imagine the major centralized hosting platforms you can imagine specific instances of hosting platforms you can deploy yourself on premise you can imagine distribution with their source code packages there is a specific repository that exists behind language specific package manager and all those conceptual places are processed by components that we call listers and what they do they go after a specific forge and can do either complete full listing of all the repositories available there or incremental listings saying something like what new repository have been created since the last time and the result of those listers are stored in our database as points which we call software origin and each point is actually a single repository or a single package so you can imagine here that for any repository on Github or Bitbucket we have created a data point with a canonical URL where we can retrieve code from and we keep that in mind for future processing with the second tier of this process this happen periodically and what we develop in terms of code is a specific lister for every specific hosting place so we will have a lister for Github one for Bitbucket one for GitLab and for GitLab this lister will list different GitLab instances available out there so this is step one it is now a pool model so we periodically redo that but in theory could also be push when we receive notification from partnering hosting platforms about the new events and it is actually loading so we go after any specific software origin we have stored here and we use a loader component to retrieve all the content that is available from that repository so we retrieve all the comets all the revisions, all the blobs and what we do, we add that to our archive deduplicating everything globally so when we find a new commit in a specific repository we check whether we have already archived that commit for instance that repository and we add it to the archive only if that's the first time we have seen so this model is the canonical model in which we store everything where everything is stored only once must be duplicated and is implemented as two different conceptual parts a miracle DAG and a blob storage so for those of you that might not be familiar with what a miracle structure is is a data structure or a DAG which is essentially ash so each node in these structures has an identifier which is not chosen by you when you add stuff but is intrinsic so the identifier of a given node is computed by the content of the node itself and by the identifiers of the node that are point by the specific nodes is a very specific, very common cryptographic construction that you find in a lot of technology these days you find it a basis of Git you find it in blockchains you find it in IPFS, it's very popular and it has some very nice properties for instance it allows to compare to a huge structure very quickly if they differ in only some specific places because for instance if the root identifier of two huge graphs of two huge trees are the same you know that the entire content of the trees are the same assuming that the data structure is consistent and also of course built in the duplication because when you add something it has a predefined identifier based on its content and so if the node is already there you just don't store an additional copy of it anymore so we use this in the DAG variant because in commit history you might have actually known tree substructures and an example of the kind of nodes we have in our archive this is a revision node it's a commit which has been done actually on our own forage which is one of our fellow engineers and all the information of the commits so the tree, the directory that existed at the time when it was committed the parent commit, the author, the committer which might differ if you have someone else pushing the code you have committed yourself the commit message is used to compute the idea of the node and that idea will be the idea we have in our archive for this specific node so in passing we know that our IDs for now are compatible with Git itself it's not a long term design decision but given the popularity of Git it's a good property to have today so if you have object IDs coming from Git you can use them directly to check if something is present in our archive or not so that was a specific node the conceptual structure of our archive as a whole is then a huge graph in which you have blobs here you have all content of files without their names just identified by the ash of their content you have directories pointing to blobs you have commit, these are the directory structures these are the revision so the commit history which are usually chained together in a long history starting from the oldest commit to the most recent one with the branches and merges in there and you have releases which are specifically annotated commits declared by the developer important commits corresponding to software releases and you have snapshots so snapshots are essentially the pictures we take of a given repository when we find it noting down where each branch or each tag we are pointing to at a time this is a conceptual view in terms of actual technology this is implemented as a very simple object storage here which we have developed in-house it's all Python 3 and the reason why we have developed one in-house for now is that we don't want to impose any specific technology to people who might want to mirror our archive but there is no reason why couldn't have backhands for your favorite distributed object storage and everything else, so essentially the graph structure without the final payload of the files is actually implemented as a Postgres database there too it's been working very well for us up to now there is no reason why it cannot be great in the future to different structure which are well suited to store graphs this is pretty big so we are skeptical that many technology for storing smaller graphs would be applicable here so conceptual view of the archive what is actually in there today this is implemented and it's been mirroring a lot of source code and source code related artifacts in the past months so in there, in our archive you will find a full mirror of the public part of github which is maintained periodically up to date you will find all the history of source packages uploaded to Debian between 2005 and 2015 which we have injected one off as a dry run but we will be putting production and the periodic updates in the next few months similarly we have an ingestion of all the terrible released by the GNU project up to August 2015 again one off run which will be putting production regularly soon and we have locally retrieved a full copy of all the repository from Gittarius thanks to collaboration with the archive team and a full copy of all the repository that were available on Google code thanks to collaboration with Google themselves so this is not yet injected in the archive itself we have local copies what we are in the process of injecting them, for instance most of the subversion repositories we retrieved from Google code have been already ingested in our archive soon in the next few months in terms of files all the number you see here are unique objects because we did duplicate everything so we have 3 billion different unique files and I mean unique file content with that we have something like 700 million commits and all this is coming from a bit more than 50 million software origin so you can imagine software origin from now as either packages version less packages or Gitt or SVN repository we clone and we retrieve code from on disk is something like 150 terabytes for the blob store which of course dominates completely the disk occupation and a 6 terabyte database for the graph if you think of it as a graph it's a pretty significant structure it's a graph which has about 5 billion nodes and 50 billion edges among the nodes so it's a pretty significant graph in itself we have reason to believe that this is already the richest source code archive in existence and as you can see from the graph it's growing daily so what you can do with that so starting today as a token of gratitude for all the communities being here at fosterman all you do for free software we are opening up our public api so you can find all the details about our public api at that URL I should apologize in advance we are available via IPv6 we are in discussion with our institutional network provider to give us IPv6 blocks so this will hopefully be fixed in the next few weeks so please bear with us until then and use the fosterman ancient network here if you want to access it what you can do with that api you can essentially browse the content of the archive as a point wise graph structure so you can jump from one object to another following the links in the graph so for instance you can jump from releases to the pointed revisions to the pointed directories down to the file content and you have full access for the entire metadata for any objects you are visiting additionally you have access to all the calling information so you can do queries like when have you last visited the specific git repositories that I care about or given a specific visit you have done on a git repository where where each all those branches that I have pointing to at a time so it's a sort of way back machine for your git repositories and your source code releases you have a full endpoint index available and linked from the main api documentation but just to give you an overview this is a full story you can follow with our api let's assume you are a developer of highland which is a lispy dialect implemented in python you can ask what is the origin identifier for that repository given the canonical URL of the repository we cloned from so you have this endpoint here, origin you give it the type of origin you give it the URL of the clone URL and we will answer you with the ID of the origin so this is origin ID number 1 equipped with the origin number you can jump to the list of visits and we will return you a list a visit number 13 has been done at this time stamp which is a few months ago I believe it is something like September 2016 and it will give you the URL where you find more details but that specific visit so you jump to that and you will find a huge mapping of all the branches and tags or refs in git terminology and where they were pointing to at a time so for instance master branch we are pointing to the target commit 0.94 etc which is actually a revision and it will give you the URL to know more about that object a different tag the release 0.10.0 was pointing to a release object and it gives you the URL of that release object so you drill down for instance I followed this release and I found out the committee was pointing to and you will arrive at this vision number which will tell you all information about that specific commit so that commit was done by our friend Paul we will tell you information about the date the commit the date of the committer date it might be different in general and it will point you to the directory so the root directory associated to that commit and you can continue you will give you also the commit message for that specific revision and you can follow the directory structure down to specific content objects when you arrive at a content object you have the information about all the checks we compute for the contents which is the SHA-1 the SHA-1 Git and SHA-256 and it will give you additional information which is still working progress like detected information like the file type the language type and the license type and it will give you also a link to the content of the file itself for download so a couple of caveats apply here in addition to not having IPv6 there are rate limits that apply throughout the API that are pretty severe for now you can do 120 requests per hour and block download so the actual access to the file itself is not available yet this is because we are focusing our resources on developing and adding new features to the archive rather than in putting energy in keeping up our infrastructure to offer this as a public service for everyone but we are open to helping in case you have resources to offer to actually lift those restrictions so zooming out a little bit this is a specific thing we are releasing today which is navigation of the archive for an API we had already released in the past a very simple feature accessible from our main website which will allow you to just drag and drop tar balls or a set of files to see which of those files we have already archived and in the future we will be working on a proper web UI for the equivalent of the API for browser users we will look of course in allowing you to download code as tar balls or as git bundles for instance and this will be hopefully available in the next few months and adding provenance information telling you where we have seen all the places where we have found a specific object or a specific comment and full text search of course because it's really interesting to once you have this archive it's super interesting to be able to do full text search of all of it and much more so once you have this archive really the sky is the limit of what you can imagine and you can help so most of the people here are free software enthusiasts and are coders so this is run as a standard free software projects we have our own forage, we have a development information we have a mailing list, we have an IRC channel and all our code is available as free software on our forge which is a fabricator instance which we love in case you are interested in contributing so these are some of our top development priorities so of course we need to add more listers forages out there that we don't track yet and each type of different forge will need a specific lister we need loaders for VCS and package format that we don't support yet we need the web UI itself we have a prototype but it's not ready to be released as a product yet and of course there is the ample opportunity of working on indexing content that it is in the archive with whatever indexer but of course all contribution are welcome so just follow our development channel show up on our forge talk with us there is definitely something you can do that will help us in our mission and of course you can also join us so we have opportunities for students both on development topics and on the research topics and we have an open position right now for people who want to work on our web API our web UI sorry this address that you see on the slide Roberto so thanks a lot Stefano for this in-depth presentation of what you are doing back to the old guy that brings again the institutional point of view here so you see there is interesting technology going on here and it's an extremely exciting technical project to do this, it's really not easy it's a lot of work and here you see a picture of four people Roberto, Stefano Nicolà Antoine so these four guys have been working like crazy for the last two years I would say or less to try to bring together this first initial part of the project clearly we do not pretend to do this alone we will never succeed doing this alone so what is very important for us is to make sure there is a community around it so the first step in building this community was to get support from a national research institution which is INRIA in France, I don't know how many of you know it but it's a fantastic place which is a research center fully dedicated to computer science that has already shown in the past that they have a strong culture in free and open source software there are a lot of contributions coming from there and also they are willing which is special that is to say help creating new institution they were one of the founding partners of the W3C over 20 years ago and they are ready to do this again they have put already on the table a lot of resources and energy we are in their office space we are using their infrastructure we are using their services for a lot of things and so this gives us security for the next few years it's not enough we need to bring a lot of people more people around so we have been talking we have been talking to partners around the world telling them what we are doing getting mind share getting support so all the people in this field are actually supporting the project I'm sure you will recognize their name and the red ones here in the first lines are actually people who by actually providing real resources these resources can be money we need it to hire people to offer you job for example it can be infrastructure it can be connection with the research networks etc so you discover some usual suspects in the IT industry but you have also people from research organizations and from other places in society like banks for example please if you have connection with these people and these people can come and help it's very important to have them around the table as we have said so to conclude we do hope we manage to convince you that software adage here is a very exciting new project that has a clear mission which is not just another project but something which is really at the service of our community of society we are very open to collaboration it's core of our values and we try to do everything possible to make it easy for other people to come so you have connection ports to the project that can be institutional if you want to just give money or resources that are more traditionally in community if you want to collaborate with us if you want to provide time, brain, knowledge it's really open to collaboration you know it's not easy creating a community around the project takes time but as I always say it's not just a destination it's a journey which is important and so I believe we should see more people coming and try to work together on this project which is not just our project it is our project as a community so I think we stop here and we all thank you very much for your attention I have a bunch of questions the first one is about software that was written in Deity that wasn't open source where the company that wrote the software died like I'm thinking about something like PhysiCalc they wrote something that was the first spreadsheet you don't have the source code are you guys open up on doing stuff that the archive.org people are doing which is reverse engineering these things to get the source code or you you have issues with the legal part of it and are thinking that these pieces of software are dead and long gone that's my first question and my second question is you have 150 terabytes of data how do you maintain that on the long term? lowering that for NASA for instance their image system they have cartridges and it takes them 7 years to copy the complete archive to a new system no copying, the new system is obsolete already what's your plan for that thank you two very interesting questions one thing is is the archive open only to free and open source software or also to other source code and what about software for which we do not have source code again Unix philosophy we focus exactly and only on collecting storage and preserving the source code so it's not our business to do reverse engineering or something which comes out so if you do reverse engineering and you have the legal rights to deposit that's okay we get them yes they know but they mostly he said that there is a part of archive.org doing reverse engineering and posting the result of the reverse engineering repeating the question because you didn't have a mic again the point is to provide an infrastructure that can store this kind of source code and all initiatives that can provide new contents are welcome but we are bound again by legal issues as you said so we can only accept software that we can copy so for example of course you will find in the archive software which is not free software our basic requirements is that we can copy it and then somebody can do something useful the minimal useful thing you can do with software as you see very well in source code read it even just reading it that's enough for us that will come in that was question number one the question number two is long term reliability of all this again as I said the destination is important but we are in the middle of the journey we are not there yet so for now the size of the archive is not so impressive after all I mean it's very costly movies or cats which are uploaded by my kids now and then it's not the youtube size it's big but it's not the youtube size and so there is this issue of really really long term preservation I do believe much more in network mirrors and these mirrors using different technologies but one thing we are doing we are talking to libraries for example and these libraries have a mission to archive digital content there is a structural process to actually make copies to tapes and make sure these tapes are changed when the technology changes etc again not our main mission right here but again collaboration with other entities they try to do this and we are starting making these connections I have a question here can you raise your hand yes I'm standing up here it's a two fold question it's related to the right to be forgotten and also to problems that you can create when people don't want their codes to be public anymore so like happened to remain a few months ago some crazy community somewhere in Asia India I don't know decided that they weren't happy with the site and they attacked the site put it down so do you have any thought about the idea of people not wanting it to be free anymore or if it offends someone or a community the fact that is free I don't know something in the sense of a way out of your of your solution right so of course we will abide to the law regulation that apply where the software is hosted which right now is hosted in France and in Europe I believe we have the most strict regulation on the right to be forgotten and what not so what we will do is that we already have a legal page information on our website and if there is any applicable law either for what you mention or for copyright infringement we will just hide the content so the content will no longer be available for download we will keep a copy of it if the law allows us to do so so for instance if the rights expire one day we will be able to distribute it again which doesn't apply to your case of course but might apply to other law regulations so I had a I made a couple of notes about your data model they're like details but you're going to have to change the data model occasionally you obviously going to want to change to a different hash function what's your do you have like a transition plan for how to upgrade from old to new with more data in the commit objects or whatever right so we are already resistant to sha1 collision because I didn't show the details but for blobs we are computing three kind of ashes and we have unique indexes on each of them so we will spot if there is a sha1 collision but of course it will evolve and the idea is that we will have versioning on the scheme we use for our identifiers so for now sha1 is doing well there are no well known sha1 collision in the wild but they will arrive at some point and the idea that we will have new releases of our identifier scheme and you know switch to a bigger hash function or a better hash function so for the fun technological fact the only reason we don't have SHA-3 yet is that when we started there wasn't yet a pretty decent SHA-3 implementation in the Python standard library but in the version of our stable releases we are using but we are ready to actually add that into the mix as well and maybe just you know adopt that as main identifier for the time being we are waiting for the first sha1 collision to publish a paper as soon as possible in a security conference, we found it I have two questions first one is if I am a user of this archive how do I find a better source code in there will there be some ranking system or some distinction that I can take as a sign to find a better source code which I am actually interested in and the second question is how do you make money would you make money to support all this infrastructure and your work, will there be donations like Wikipedia or something else thank you there are two questions so the first one is how your first question is a potential very interesting use case now that you have this fantastic place where all the source code is available in a uniform way let me stress it again the value added here is that you have a uniform view of the source code no matter where it has been developed etc now that you have this where could I find a piece of code that is a potential application that can be built on top of the archive it is not our mission to do this application but it is our mission to make sure you have everything necessary to build it and there are people doing work in this if you are interested in this I can give you pointers to research your work in that particular direction there the challenge here is to scale the scale is much bigger than what you see at times up very quickly here I have one question I should answer the last question there was a second part very quickly so how we find the money you are right we need to eat at the end of the month everybody but you see our strategies replication if you remember so we are building something which I believe is valuable to many many many different parties so the idea is to have a common infrastructure where many different stakeholders put energy to make it stay it is not just for industry not just for science, not just for education not just for culture, for everything so we are trying to talk to everybody and that should make it sufficient to make it work so thanks up, do you understand ok last question my last question would be you make own comments where you find it when you find it this archive you pull do you give other options for historical reasons something, some comments ability maybe in future so for historical listing because what you do is a historical view on source code so maybe it would be an idea to also give an additional historical comment of some sort what are your plans that's definitely a thing we want to see I mean in this semantic Wikipedia software that Stefano mentioned the point is to make sure we connect with other initiative like Wikidata, Wikimedia other people who are doing serious studies on this, the computer history museum the internet archive or something like this to make sure that you find extra information on the source code and you connect it to what we are doing but again our main goal is just focus on the source code and make sure we connect with the other initiative don't try to end the wheel just connect with other people ok, thank you ok, thanks a lot for your interest thanks for your attention alright thank you Roberto and Stefano we have a little bit for you so thank you thanks a lot enjoy and thanks for your talk what's the question before the word no no no no no no no ¿Pero lo que es lo que es lo que es el balance right between all the negative score and what matters compute the node idea? So what is actually subject to the hash function you use to compute it? That's it, I'm pretty hard at balancing the far. That's awkward because ideally you would like the node idea of a probable hand of the corresponding get thing to be the same. The problem is that it doesn't score.