 We will start, so I would like to welcome you all to this SIB Virtual Computational Biology Seminar Series. Today we have the pleasure to host Rostislav Kuziakiv, who is a data scientist at the Service and Support for Science IT unit of the University of Zurich, which is also an HPCA co-operative of the SIB. So, Rostislav Kuziakiv is an MD in Ukraine in 2003 and his PhD in bioinformatics at the University of Geneva in 2013. And in between those two diplomas, Rostislav Kuziakiv held a different position. So, he worked as a medical doctor in Ukraine, bioinformatician at the University of Toronto, clinical project manager at the Toronto East General Hospital. And he then decided to move overseas and he went to Europe, where he came back to Europe, where he became researcher at the Swiss Institute of Bioinformatics in Geneva, where he did his PhD. And he also held the position of clinical data manager at the Geneva University Hospital. And since 2014, at the S3IT unit in Zurich, where he works now, Rostislav Kuziakiv participates and supervises local and international projects with a clear focus on data management and bioinformatics application for life science and medicine. So, just briefly, the S3IT supports the Zurich researcher and research group in using IT to empower their research from consultancy to application support and access to cutting-edge cloud cluster and supercomputing systems. So, today, Rostislav Kuziakiv will tell us more about the IPOTO, if I'm not wrong, and it's a reproducible work for management systems for bioinformatics health data. So, thank you again, Rostislav, for accepting our invitation, and the floor is yours. Thank you very much, Diana. Thank you very much. Thank you very much for coming here. It's a big pleasure to be in Rosane. It's a pleasure to see familiar faces. So, I hope you will find something interesting within this talk and applicable towards the work you've been doing. So, before maybe jumping into the hot waters of data reproducibility and data management, I would like to recall back in time and to give you a bit of history on the union how it was created, because that will be the base for your understanding how we ended up into the data reproducibility. So, back in 2014, University of Zurich, following the demands, the growing demands of the research units, decided to come up with a unit, taking already into account the successful existence of Python IT in Rosane and the unit, similar unit in Basel, and to come up with a unit which would be weaponized, okay, tooled with different computational resources, and the unit which would help the scientists with their computational needs. So, don't ask me why, but actually, due to the administrating discussions and so on, they put a unit with the Central Informatics Unit, the Central Informatics Department, under the Department of Low Economics. So, this is something to talk to the Uni Lightroom, but as for now, it's like this, changes are coming, but within, from this place, we are called to serve all other faculties, from Life Science, Economics, also Swiss Veterinary Medicine, Mathematics, and Physics. So, creating, by creating this unit, our mission was to combine or to unite certain experts in bioinformatics, data management, software development, okay, and to be a unit which would support researchers with their growing demand of computational resources, as well as provide in an easy, user-friendly way, services to the same research groups. Also, we were called, depending on the people what we had in the group, to be sort of also a innovative platform for the new developments. For example, one of the people who we have been collaborating with, Professor Ben Boden-Mele, okay, so who is working in the 3D imaging and cancer. So, one of his software developer was working for him, but as we know from the financial resources and then other things, he was relocated within the S3AT. We funded his position and he was, and he continued, okay, developing of the tool and that methodology. So, becoming also, turning us into the unit where the research and the development can continue. So, along with this, Uniliterum and the University granted us with a list of certain services what we had to provide. Definitely, first of all, it would be access to the research and infrastructure, software development, okay, the specialized data analysis and a small consultancy on everything mentioned before. And, in addition, we would have to also run the courses, like CBAs running courses, sort of, Uniclinux sequencing, proteomics and so on. Definitely, we were not alone. We are glad that the Uniliterum understood that we cannot do this on our needs. So, they weaponized us with certain expensive toys. First of all, they invested almost 15 million Swiss pranks into the OpenStack cloud computing system. You will see some numbers on what is all about and what kind of characteristics it has. So, this OpenStack was established in the University of Zurich, University of Zurich data center, and each researcher, okay, contributed certain amount, can have an access and generate the computational power, what he needs. If you cannot be satisfied with the OpenCloud solutions, University of Zurich, another unit, offers also the GPU cluster possibility, as well as the Hydra, which is a high-memory machine system for your extensive and really powerful computation. So, these two, mostly used by the Department of Mathematics, Astronomy and Physics. If that is not enough for you, we also collaborate with the computer center back in Logano, and up on your request and set administrative agreements, we can pre-book the time and the resources at the PISDEN supercomputer back in Logano. So, with this, okay, now researchers at the University of Zurich, they got the possibility, which didn't exist before, and they did really give it a really good spin. So, the team was growing. As for now, you see the other team members. We have cloud computing IT administrators. We have guys doing the front-end development, the back-end development, running also the Hydra high-performance computing and parallelized computing. Three first ones belong to the so-called Live Science team, and the Live Science team, major responsibilities would be to provide consultancy, as well as support on the data management, mind-formatic analysis for genomics and proteomics, as well as customized workflow development and the data visualization. And to do all of those things mentioned before, we used what we call the I-Portal, which is a central, open-stack solution for storage, tracking, analysis, and data visualization. And the I-Portal consists of three major bricks. It's an OpenVis, Singularity, GC3Py classes, Python classes, and Jupyter. I will talk more in details about this later. So now you've heard the story, where it's begin, where it's all begin, and now where we're now. But a little bit of story continues since 2013. That was a big start for us. We were mostly hired that year, and we were accelerating. Our management team were bringing simply really projects on their weekly basis, right? The trust what our management team had within the scientific community at that moment allowed us to engage and to get funds, okay, for those projects. So 2015, we were engaged in many interesting projects, covering genomics, proteomics, image processing, biomedical, plus we were doing the biomedical data management. We were getting more data because people been using computational resources available, more calculations, and we were quite happy because first papers appeared and the first reviewers came. And that was the breaking point for us because the engagement in so many projects put us at the age where the reviewers were starting asking really tough and nasty questions, which we had hard times to answer. So definitely where, first of all, where we have our data. Role, results, analysis, et cetera. But the most hardest one, how did you get results? What software you use? What kind of version? Under what parameters? What code? Where is the code? And so on. Finalizing everything, now can we reproduce these results? And we weren't happy campers at that moment. 2015, 2016, breaking, okay, we sit and we had our endless meetings on how it can be done, and basically we were spending simplest nights because you just have to understand the way we were, we've been working. We were given a project we had to finish within three or six months. And we were pushed to the limits where we had to generate the results as soon as possible because now you have tools to do it much faster. You can parallelize, you can scale your code and you can generate more results in a faster and more efficient way. But you have to remember that we were just within scheme out of our studies. We were concentrated before we did one big project where we ran everything locally, where we had everything stored on the draw box or something else, so that was a completely different scale for us. And we struggled, we struggled a lot. So we, as I said, we said we started analyzing and definitely decided let's take a look and trying to figure out what's going on. Why do we have those problems? So we started reading papers. And this one particularly explaining, showing you that the usage, okay, the appearance of the big data sets or small size, medium size data sets along with the panels, okay, rolling exponentially within the coming years. So you have more data, okay, visualized and more data represented in your papers. Along with that, okay, we started analyzing, hey, what about the, what about the usage of the bioinformatic resources? And here's the publication coming from the, from the MBL on the usage of the bioinformatic resources and correlated with the sizes of the data sets. Also you see quite big growth. With this one, okay, paper, describe how the bioinformatic resources as well as a database usage was used depending on different domains like bioinformatics, biology and other domains. And we see also the growth. That could be explained and the explanations for this is that we have a quite fast development of the technologies. We have people who are self-educating themselves into the bioinformatics plus we have additional courses, additional trainings giving them knowledge. So they are not afraid to try new things. They are not eager to wait and sit and someone will do this for them. Okay, plus reduction of the prices for the usage of the resources make it available to them and they try, okay, and they started using this. So that was sum up somehow and all of those big data sets and all of the data what we've been receiving. On the other hand, I mean, we still didn't get the answer why we had those problems, why we cannot, okay, just to recreate what we've done and I spending sleepless nights trying to figure it out where I put the note how I run the calculation, right? And blaming myself, okay, maybe that's my task management skills, people management skills, I don't know. But the, let's see, a relief came when this publication and the post showed up where the big guys, okay, from the company Agnan they tried to recreate 53 landmark studies, okay, following, published already with all the additional material and explanations and kind of failed because they only were able to recreate 11% of the studies at that time. Stating, okay, that we have a data reproducibility problem. So, and we could sit and we could forget about this. We could have our kitchen talks, you know, there's data receivability and so, I don't know, here comes SNF. Those guys, they wanna cover themselves. These guys are finding agencies and they've been asked really tough questions up about. So definitely they wanna cover themselves. That's why they came up with a data management plan for your SNF friend. Okay, it's still vague, still biased, but lots of, lots of emphasis were put into the things like data reproducibility, okay, data storage, calculations you've done and so on and so forth. So, we had to get ready for this, okay, because people, as you see, one of the services were rent proposal writing. People were coming to us and asking, what about data management plan? Guys, how can we cover? Could you help us with this? And we had to come up with sort of a solution. I'm not saying that this is the panacea. This is a recipe. This is our story, the way how we was limited resources, the three of us, okay, and the guys we know, okay, and the guys we work with, okay, from ATH, BASE, Geneva, to Lausanne, okay, we sit together and try to at least cover our backs when the SNF comes or the reviewers come. Okay, so, and we were not alone, okay, people, definitely this is not the new topic and people, some research groups, okay, trying to tackle it, slowly applying work somehow describing their vision on what data reproducibility should be and how it can be achieved financially or how can it be achieved by the data management and so forth. So there are a few papers, really interesting ones and really top journals where describing how, okay, those research groups, they see the data reproducibility. But this is a personal vision once again, right? So you can pick it up, you can trust it, right? But on the other hand, is it the final institution where you go? Is it written by someone, okay, all of those guidelines, are they written by someone who knows how the science is done in the proper way? Okay, so this publication was kind of a saver for us because these professors, they know how to do proper science and how to annotate, how to prepare, so you are good in terms of at least some kind of data reproducibility. So they come up with sort of the guidelines, what you can use, okay, when describing to people what is the data reproducibility? So we definitely cannot cover all of them, okay, with going with GOR numbers and when you push a publication and you push a data set in the public repository, no, we don't do this, but at least what we can do, we can be sure or somehow that we can cover like data analysis side, right, from the data generation to the plot creation. And these are the highlights. They say data should be shareable, software should be shareable, work clothes should be somehow shareable, okay, and all details of your computations should be stored, okay, along with the environment where they've been done, okay, and potentially also put in the shareable repository. Then persistent links should appear to the published articles and so on, so the additional materials should be linked to the all calculations what were done in the database. So we use this as a ground truth and try to come up with something. So something is now called iPortal, right, that has a long story, the name has a long story starting at the Rudy Zaberser lab, but we started talking to the guys from the ATH Basel, from the APFL, talking, asking them questions. So what would you, how do you deal with this? What do you do particularly? So first of all, and the first and the, one of the most important suggestions what they give us, you have to get rid guys of the needle of the commercial tools. Use a friendliness, it's pleasant, it's a pleasure, it's stentative, but it will bite you like hell, okay. Whenever you do, whenever you need to update, whenever you need to develop, you will spend time and help talking to the guys and you will pay, you will pay a lot. So yeah, we bite our bullet and decided to come with a list of three open source solutions for different tasks, what we applied to the data. So the first one is an open base, it's a data management platform developed at the ATH story. Now it's been for 10 years so really have a funding of open base and open base becomes the data, the central data management platform at the ATH. Moreover, open base was taking as a data management platform for a new development of the ATH called Leonard Med. I think that some people here know what Leonard Med is all about and how they, and how you can access to this. So now open base will be your interface with Leonard Med. So, and we decided since we worked with ATH guys and open base guys a lot in the past we decided to give it a try. So that was used to manage data, to annotate and to manage and to take data directly from the either users or the instruments. Then we need to have a solution for our software. When you develop the pipeline you need to have a solution for your tools. And for this we went out with a singularity. So singularity, it's a, I mean it's not a new topic, it's all about the, if I spell it correctly, dockerization. So you can imagine, yes it recalls you the name docker, right? So docker containers and singularity containers mostly doing the same. But for us it was a really important difference. That's about technical details. When you run, and you can correct me if I'm wrong, but when you use docker, docker makes you a root. And in order to use docker you need to be a root. Which means you cannot bring a docker on the central system used and shared in Mancada users because you can do really in a system, right? So if you're a root you can do everything. So not many teams running or having the local infrastructure would like to go with a docker, right? For the quantization of the workflows. So that's why the guys develop the singularity. And that singularity container gives you the opportunity to be you, to be yourself. You don't have to go with a root. Which means you're the permissions and use the management. It's totally and much better and really simplified with singularity. So singularity, what we pack into singularity are the tools what we use along with the script what we have for our workflow. So you can recall or I call it personally singularity docker, I call it the time capsules. Because you pack tools, you know exactly version what you've been using, you know what you apply, you test every tool and you put your script inside of singularity container. So you have your time capsule that's what you analyze on that particular day as by that particular person. So this is the workflow part but now since we have an opportunity to use those computational resources, right, we needed a tool which would allow us to go with high performance computing. Paralyzed computing. Okay, it's scalable computing. That's why we use GC3 Pi. GC3 Pi is a suit of the Python classes developed by S380 guys. So at least we didn't have to make it. Right, so just for submitting and controlling the bad jobs when you run them on a cluster or the grid resources. And we have both. So this is what we use to control other jobs. Okay, and the final point when you analyze your data you would like to visualize the results and you would like to do additional analysis. And for this we use Jupyter Notebooks which is an open source web application right, allowing you to interact with your code okay, live update your plots live run additional simulated live, okay, save it and then share with others via the Jupyter Notebook file. So these are four components what we had in the iPortable. So to represent it graphically, right, you have an open base data mover the plugin which sucks data out of the machinery, either microscopy, mespectrobots or NGS puts on your storage SS or ATV also provides storage solutions right, and make it available for the biologists, bioinformaticians and modellers, for the data management for the data processing okay, and data visualization. So this is sort of a simplified view of the whole platform. Now let's go into the details. So when developing while developing iPortable, we also had our goals and all of that guidelines. So definitely okay, and you know it the base of the users at the customers what we have in terms of the knowledge of bioinformatics computational biology and programming at all okay, it's quite different. You have totally self efficient guys to whom you just give access to the cloud and he builds slump cluster he runs her performance computing parameterization, he runs biohydra SDA by themselves even in biology. And we have those to whom switching all switching off of the computer you are cool. So definitely we need to provide also the sweet word user friendliness right. So we needed to have a web based platform which would be easy to use user friendly. So for us what was more important is to retain data in the long term but in the unchangeable way once pushed directly from the machinery okay, or from the uploaded by the user the data should be labeled given a unique identifier so I know precisely when it was generated by whom under what conditions. So the workloads should be reproducible, insurable the reports and the results should be linked back to the input data and the reports of the analysis can be downloadable updated and basically pushed back and linked to the original data and the results generated before. So these were the guidelines giving which we received talking directly to the users right. We generated the online form where we asked for one month how would you envision what you would like to have and so on. So this is sort of a class of the answers and how they would see this. So starting with that okay we had we had the developments going and so on and so on and we split it and sub-categorize those guidelines into four steps. First of all and the most important fact for the user is I want to get rid of the data. I have data sitting under my desktop okay or my sitting data in the functional genomic core facilities they're calling me up right I need to get rid of. Do I have a place? Do I have a quick way just to upload the data? Okay so that wasn't developed. This came up just simply recently we call it the simplify data which is the application which allows you okay just simply copy and paste your data right and put into this window with minimum annotations what you okay find helpful okay for further for further usage and it will be uploaded into OpenBis. It will go to the special place called staging right which will be dedicated towards the user so the user only see what data he uploaded before and this is sort of I mean I cannot think of a simple way you just simply copy and paste. For the heavy load we also use in-browser application which allows you to upload the data sets up to 12 gigabytes of size and we use the data mover it's been tested so I could push the data set of the size but for big data sets we recommend to go with the OpenBis data mover which is a Python plugin also open source developed by the ATH guys allowing you to tune it to the level that it can be either getting data from the user because the user will get the openBis data mover server he will be uploading data there or the data mover will be getting data from either the functional genomic core facility or from the machinery in the lab so when data is uploaded into OpenBis the next what we ask and we also train how to do this let's say let's annotate your data don't leave it like this let's represent your study so within the six months you know what you've been doing a PhD student, every single postdoc he runs one or multiple projects so let's bring them all together let's represent the study describe the experiments describe the samples and link the data sets which you've just uploaded to under those samples so at the end with a simple metadata model which uses only five attributes space, experiment space, project, experiment sample and data set you have your study represented inside of OpenBis so this is screenshot this is experiment view mode so this is the idea of the experiment with some meta information on what type of cells when being generated, the number of cells along with this you have all samples belonging to this experiment biological and the technical ones towards which you apply certain SOP standard operation or procedure to analyze with certain methodology or technology, NGS mass spec or microscope and you got your data sets which you uploaded before so this is step number two you got your data uploaded and you got your data represented you don't have to go with really complex metadata model alright few experiments, few samples just for you to let you know what you've been doing so when you're coming running into our office six months later saying I remember I sent you that Excel file in the email sorry I don't want to remember this I want to remember the names of my kids already swapped them I want to remember this so this is a platform tell me what is the ID of the sample tell me what is the ID of the data set and I will do my analysis for you you don't want me to do this for you so let's build a workflow for you so what kind of workflow would you like because the open base why it was the major major decision on open base definitely you can come up with other tools B2B, C, LOP key but none of them none of them gives you the possibility and the the usability to be changed and to bring additional changes into this, like open base plus you got open base developers sitting at the 200 meters so why bother so open base within the open base we provide the workflows customized workflows for our users workflow we have a series we have a series of meetings where we sit together and they try to describe to me what they've been doing what kind of data they've been getting what they do the data what kind of tools they've been running under what parameters and what they want at the end but usually at the end is not what functional genomic core facilities give them because functional genomic core facilities give you an Excel file and then you have the then it's up to you to figure it out which is downregulated, upregulated not regulated, right? what the researcher wants he wants to have a map plot he wants to have a volcano plot he wants to see the outlier not in a PDF format like functional genomic core sensors he wants to see, have a mouse over opportunity and to get the gene ID I know the genome that's the fact that makes sense so this interactivity comes later with Jupyter but before this we have to run the workflow so within the open base we use a javascript to develop those small apps which take you through few steps in order to analyze your data we ask you what data set you want to analyze against what we call a databases but let's say genome you want to align to then what kind of parameters do you want to apply to tools? tools being discussed already with our consultancy and with their preferences there are many guys having their own preferences built not from their experience but the papers they read before because they were cited in highly top journals that's another story so parameters and the final page which is a summary of everything that was collected before was an opportunity then just to submit it to the back end so for you to understand this information what user provides is stored in the ini file on the configuration file and what OpenBit does we never run something on the OpenBit's back end no, never OpenBit sends this ini file to the cluster what we have on the cloud system so the ini file comes and is read by this 365pi which starts downloading datasets starts downloading the genome and starts applying the parameters to the tools which actually are stored and run within the singularity container and then initiate the workflow written in the singularity file which stored in the singularity container I know that many guys can ask me so you store datasets in the singularity container you say your genomes in the singularity container no, we don't do this we store datasets separately they've been downloaded when the singularity container needs to be run so we don't if we store we store some small datasets but no big ones no never input file and the genomes then this is replied what should be done orchestrates the scalability of the code and the workflow on the ID Vesta, Hydra on the workaround computer when the workflow is finished the results are uploaded automatically back into OpenBit for further analysis and the visualization with Jupyter notebooks this is the description of the workflow so a bit of information on the result so usually the result is totally coordinated with the request coming from the user he says NGS I don't need BAM files but I would like to have a BigVec files and with BigVec files I would like to have also the BAM files and the HTC metrics written in the TXT format can you deliver it to me in one big dataset along with all plots Jupyter notebooks so I can update and create additional plots and the singularity containers so I can send to my client to my colleague somewhere in Basel so he can rerun it alright so we try to come up with this kind of final output so here you see the report written HTML as for the visualization so they see if the workflow worked at all maybe it was a complete bullshit and they don't need this so they can rerun it or call us to figure out what's going on they have a Jupyter container and they have a bunch of files where they ask for from BAM, TXT, HTC Referex, CIC whatever they want plus quotes so taking a look at the HTML plot with a little bit of imagination and tricks coming from the JavaScript we tried into the HTML plot specifically in my case the users are asking so can you rerun every single time 100% can you rerun the quality check because it's kind of interesting the quality check from the functional genomic core facilities so when we know the tool of the choice for the NGS data is FASQ you can rerun the FASQC it generates a bunch of data so guys have been taking this problem and they created really excellent tool love it, it's called MultiQC which is sort of a cluster it consolidates every HTML report generated by FASQC and gives you an opportunity to be even active so it's totally mouse over you start with the beginning from the sequencing scores and ending up with the Kmers, GC content etc so it's already inside of OpenMess so you don't have to go and open another folder another email, that's already here just for your usage along with this and sitting together and discussing once again at the kitchen or at the happy hours with the researchers, they say that would be nice to have that plot, can you give it to me because I've been asking the functional genomic guys because they give it to me they give me in a PDF format I would like to have it sort of a HTML, I would like to have it interactive can you give it to me okay, let's try to do this and we come up with the list of the plots what they're asking and what I could do from my side so it can be from the simple volcano plots some simple heat maps really advance last rink and the PCAs we operate here I would say in terms of like a startup deliver fast and often so this mostly started with the operators the possibility of running the infrastructure the affordable prices making it possible to rerun the application whenever time you're if you find this results tricky let's rerun it under different parameters let's apply this, this and this so don't wait don't go because I've been talking to the guys and they're still in 2018 running every single tool manually and you know what they say they say, why am I doing this I'm doing this because I have my quality check visual quality check but you're wasting your time why not to run it once okay, in an accelerated way you will get the answer if you experiment work or not if it's valuable or not so we are trying to operate in this kind of domain so with a list of possible plots which came from the user we provide that report so that's about the HTML file what you can have along with the report in the HTML format we provide the files what they ask for we provide the Jupyter notebook file so all plots what I showed you before were generated within Jupyter notebook so and this notebook file was saved and put back so if they so which means the user can download that file and start working on this extending my work can you do this for me but simply taking this to the next level you have the collection of the data you have everything what is needed for the Jupyter to work you're in the same folder you open this, you download it and you start updating with additional functions or additional visualization so when you finish that and you are happy with your results you can save that Jupyter notebook and upload it back into OpenPass the system will ask you where do you want to upload it so your task is saved I want to link it to the raw data and the results from which I got the report so you basically have the whole sequence now you started with the raw data which we uploaded by okay from the machine or by the user you annotated your data in your preferable way you run the workflow you got the report and you have an ability to work and to continue your work not maybe on the raw data but let's say HTC generated matrices and you start digging more getting the upregulated genes downregulated genes you can transfer with Jupyter notebook you are simply unlimited because that works in R it has R kernel it has Python it has R, Matlab anything, C++ Java, everything is possible for Jupyter Box and it's open source get it from the GitHub and you will be both told either on the local computer so again I describe to you how it works at least behind with this system or with this kind of approach not really but much better night sleep because at least I can be sure not 100% but I can be happy at 90% that if someone will lock on my door and ask, listen remember we did something one year ago or six months ago I will be able to answer yes, maybe let's take a look and following the guidelines from this publication we believe that in our way at least data reproducibility understanding we have four major components which should be the in synergy first of all data data should be downloaded data should be annotated your data set cannot be changed and your data set should be accessible okay and shareable and to cover those tasks be using open source software software should be discussed with the researcher directly I have cases where I thought that I knew what they were doing and what they want I have to confess I used different tools we generated total different results and I didn't tell them then we find out that we had our endless meetings and then we find out that tool that was a practical tool producing those results they wanted different so whenever workflow you have to bring in it should be discussed with the researchers it's better to invest three good meetings because they know many of them they know what they've been doing and what kind of analysis they're going to run so you have to discuss it should be tested containerized, saved and also shareable and the singularity having a singularity container you can easily or singularity image you can easily download it and share with your colleague because you save the environment you do the time snapshot for the environment then the workflows once again along with software should be discussed in direct contact with the researchers it should be universal in terms of the language which you use right everyone has their own preferable language Pearl, Python or Matlab okay so it should be universal it should be scalable and preferably annotated in a more less good way so you know what this is all about and what you've been doing talking about the scalability right so it should be applicable on the GPU clusters on the high memory machines and potentially cloud computing along with we're still keeping this in mind because still we have sort of a conservative way of doing research I have a server under my desk which keeps me safe so for this we use GC3Py and finally data visualization so I believe personally that in the data annotation and the data visualization in the area of big data really crucial things because what good it does to you right if your data is not annotated you can have terabytes of data but this is not labeled you cannot use it let me machine learning you need annotated data so annotations should be provided and finally people and I think this is because of the area of social media the people got used to the data visualization they don't want to read they want to see it and they want to through the vision the understanding comes now so for data visualization we use Jupiter because it gives us interactivity users can interact with the code directly feeling strong enough type in your code and start you are not the first one doing this forums, google groups are packed by a star packed with examples so that's why they are not eager so eager to try so this interactivity comes with Jupiter it should be universal depending on your preferable language it should be Jupiter notebook should be linked to the original results of the rose data and it should be shareable so all this together in a sort of a sequence makes us life easier in terms of data reproducibility when that question comes from other customers and it comes quite often a little bit of numbers so everything everything then boils down to the numbers so let's for now we have we have 145 active users we've been running this business since 2000 since we generated 2015 and I'm talking about the central instance I'm talking about the case where we install open base and this is quite different from the model run by the ATH where they try to come up with a separate instance for each research group in our case we don't have this flavorage that's why we install this on one central instance packed with storage where users can basically profit because they've been using the same system they've been developing the metadata models and please believe me there's some overlap of the models like sample types, experiment types properties, conditions what they use so they found this quite interesting if the new things should be developed we can bring it in so those users cover as you could guess mostly the University of Zurich different departments molecular I mean and those guys we have users really really advanced programmers advanced bioinformaticians we need just simply one component of the system we need either data management or can you help us develop the singularity container or can you help us to run gc3.py and make it scalable for our cluster what you generate or can you give me an access or build me a Jupyter hub cluster and we have cases where let's say I put it the first one because Department of Molecular Mechanism or Disease it's our pilot it's our pie because they've been using everything they've been using the whole system they've been running the workflows along with others but they were the first ones and the ones having testing the system so also we have collaborators from the University of Zurich and other departments and as for now we have 11,000 samples registered 23,000 data sets until now we've executed all together 412 workflows so those workflows if you ask me they don't think dogo don't think too complicated things we call the workflows some kind of data processing stuff let's say can you do the quality check can you run sort me RNA for me can you link all of these three together and then I will do it the rest by myself the workflows and the others like RNAseq developed by us where we're taking data from the functional genomic core facility putting an open base annotating them according to the to the sheet said by the user then launching the workflows and generating the results and others as I said depending on their needs it's all about the discussions about the collaboration so this is what worked for us and we will try to keep on doing this so special thanks to my colleagues especially Dr. Lang's Maastrom who was the one who came up with this vision and he received this vision at the Rudi Zabersold Lab at the ATH where the open base was used from the beginning 2010 and that's the lab where they had at that moment five mass spectrometers running 24 hours 7 and they were looking for the solution just for the data management and for the data analysis I'm really grateful to the SVA team for all the input our director myself Rafaela Santoro Bandrin the director of CIS team at the ATH and the professor Michael Hortigan who also I am of our system thank you very much guys and thank you those online I hope you found all found this talk interesting