 first of all hi so I'm Michael Denker I wanted to be here already much earlier but as you can probably hear by my voice I got a little wizard from a little friend who basically prevented me from getting out of bed the last few days so now I'll try this and if I lose my voice or something happens then there's an early dinner so anyway I'll basically pick up where Thomas left and probably I'll even repeat a few things that you've just heard maybe from a slightly different perspective I'm from the lab of Sonja Grün but also with Thomas I've been working together now for quite some time and I think we both share quite a few interests in this general domain of making science more reproducible and working on data etc so what I want to talk to you about now is going a bit away from the data and maybe a bit more towards also topics like analysis and how to basically cover different aspects of the life cycle of your data but in general when talking about reproducibility maybe I want to start by just making clear a few terms in the beginning starting from general definition of terms all starting with the letter R basically that was put forth by several publications and also there's a very interesting read on an issue on re-science on this it's basically one set of definitions but you will find other ones I think this is one of the more better ones so we can talk about for instance something like repeatability so repeatability means basically it's me who does something and then at a later point I do exactly the same again so for instance this could be me giving this talk a few months ago and now I'm trying to repeat it now obviously it's not gonna work and it's basically a technical thing because obviously my voice is raspy today and so it's not really going to be the same talk yes yeah actually yes but there was no photo so but but in essence basically I know what I did so in principle I should be able to reproduce it and I have the same tools available so this is really often a quite technical type of problem getting the laptop connected having a voice etc etc you could also think of Sonja Grün giving the talk and this is something we could call replicability so it's basically a different team but the same setup so basically I would just give these slides to someone else and Sonja would give this talk now the problem is a bit different not only does the technical side have to match but also it's a question of whether the slides and the words I use on these slides actually convey what I want to say does Sonja perceive exactly the same thing does she know what she has to say are the speaker notes exactly correct is everything written down how it needs to be formulated etc so that's a slightly different level of the problem and then finally you can also ask for something like reproducibility a different team with a different experimental setup so maybe Sonja decides to convey the same message in a talk but do a new talk new slides maybe call it fair data or something but try to convey the same message and here of course we have a even more complex problem where we need to also be able to delve into basically what is being conveyed by the research or the talk in this case and basically find ways of basically redoing this insight but this means of course also that we correctly described what we did in the reference that we're trying to replicate and that we have accurate means of basically validating our new replication of the experiment against what we did in the beginning so in other words it's a it's a question of how do we compare different experiments in order to validate sort of an experimental finding so these are more or less the different levels you can go to when you ask about these reproducibility questions let me just go into one topic beforehand which is you can ask why is this actually such a big topic and especially in neurosciences so this is a picture I like to do sometimes of the Rosetta stone you probably know it it's one of these stones with three languages honored hieroglyphs and Greek and Demetian and in principle this was able to help in a translation of these hieroglyphs in particular in the brain we often think of something maybe a bit similar we have basically some sort of brain dynamics we have stimuli coming into the brain and we have some output behavior and we also try to match these yeah this is sort of like very classic type neuroscience I think and I think from the beginning on there were always two problems that we faced in the science yeah one is variability meaning there's not such a really nice correspondence between these three things as maybe on the Rosetta stone right so stimuli very a lot it's very hard to reproduce the stimuli or even a behavior and if you talk about brain dynamics you all know that it's it's definitely noisy or variable whatever you want to call it so on the other hand something we observe is complexity it's just very hard to control for all of these things stimuli and behavior how to measure them accurately to like really to pin them down exactly how to really characterize a stimulus either you simplify it extremely so you can describe it or you have some complex stimulus which is not adequately described so it's very difficult to do and then in terms of brain dynamics of course that's why we all here how do we characterize this how do we basically deal with the complexity in the brain so I think this has actually led to two things in terms of reproducibility on the one hand you could argue that maybe it's undervalued in the past because of the variability it's just if you know I'm going to measure again in another mouse and it's going to be roughly the same but a bit different so what do I care how reproducible it is on the other hand you can say well reproducibility is simply a difficult task because of this complexity it's just very difficult to describe this extremely complex setups to a minute detail and also it's very difficult to describe the types of analysis things that we do on on the brain so I think these are two two reasons which have probably impeded in the past this question for reproducibility but I think the good news is that now with the age of digitization you can call it we actually have one method which allows us at least to combat this complexity task which is by employing software tools which help us to to basically get a hold of this complexity automate tasks so that we can like automatically get characterizations of our stimuli that we don't have to measure them with a ruler like in the 60s or whatever right and but we have like accurate measurements of this and and the tools that Thomas talked about for for gathering metadata and and storing them and so on so we have a lot of help from the from the side of of software and hopefully it will then also be realized that in order to understand the variability that we observe in the brain it's very crucial to be extremely reproducible because that's the only way how we can actually then discern what is our measurement in precision versus what's the actual variability in the brain so in general what I want to talk about today is this typical flow in an experiment coming from some setup where raw data is being recorded it's typically then post-processed from both the pros post-processing and from the experiment of course you get some metadata this is being compiled as Thomas had described in the next step we then combine the post-processed data and the metadata into a common data model and then we perform probably some complex parallel and collaborative analysis on this until we reach some interpretation and in order to do this in a reproducible way it's probably not a good idea to do it how I guess all neuroscientists including myself did it 20 years ago start up mudlap and or actually not mudlap at the time but even other things and and just hack away but to try to use community tested and standardized software components all along the way we had somehow possible in order to profit from the knowledge from others but also in order to to be able to actually describe the individual steps in this in this workflow so today I want to go through some of these these tools partly in a bit more practical manner and describe how how one can already to do they do lots of these things which is not to say that there's not also a lot still to be done in order to really achieve a nice totally rigorous completely documented completely reproducible workflow still challenge so just to motivate why we did ask people a few years ago whether this is really a problem also they see this whole overall complexity in the work that we are doing and when we asked to which degree that people think that the complexity of the data sets and the analysis basically influences the work it was quite clear that most people really think that that this is an issue that basically having best practice guiding principles and solutions for workflows is something that would be really appreciated and when we asked what would be the things that are in need the top answers were basically that we need common guidelines to describe the entire workflow process we need public software toolboxes and standardized data formats so the community is is aware and this is already old data I'm sure the numbers would be higher today which is a good sign but just be be aware that this is a luxury situation because it has not always been like this so this is a process that has been going on for quite some time and as Thomas has said in his lecture earlier if you had come with I think you call it file names according to some specific patterns like ten years ago or so people were not so happy about such things so in general I'd like to talk about data models first and I'd like to talk about the neo data model Thomas already introduced it very briefly in his talk I'd like to go a bit more into depth now also because you will be exposed to this in the exercises later in this course the idea of neo is really that you have a wide variety of data sources in electric physiology of course most prominently these are recording systems and here just some of the commonly found vendors that you probably all know about what you also know about these vendors is that they are extremely protective about their their methods and they all define their own data formats which they find are always the best and which are always somewhat difficult to read and it would not be surprising to me if there are not still also many bugs here and there given the fact that I also fixed a few in loading routines they provide and so on so it's but it's not only a question of file formats the other thing is that of course also these recording systems don't just have different labels on on the sticker but they're also different in terms of how they work you have arrays you have tetrodes you have patch clam you have whatever so it's it's a huge variety of types of data and how these data are structured within these records on the computational side of course there's also simulation going on and as you are aware in neuroscience we are in the happy situation that the community realized early on that you have to distinguish between the description of a neural network model and the simulation engine which led to the good situation that we now have a variety of network simulation engines available so you've probably heard of neuron there's moves there's Brian there's nest there's older ones like Genesis and so on all of these are basically software that takes some description of a neural network and basically perform a simulation on it the output is of course different you have things which produce basically only spike time such as nest you have simulators which can produce more complex outputs including membrane voltages etc you can even have like hyper simulators which then generate local fee potentials from such simulations etc etc however what's in common to all of these is that the generic or the overall types of data are typically quite comparable to those you get from from recording systems and finally of course we may have also other models so this could be either like texture models these could be I don't know some some theoretical models you have in some other language you implemented yourself but also for these when we look at the types of types of data objects that we can get from them we see a large overlaps these could be spike rates spike times potentials continuous rates correlations maybe these kind of things so it's a very very compact set of generic types of data and this is the main idea of Neo it's trying to be a internal data representation in Python so this is something I'd like to emphasize maybe three times now not three times but it's not a file format so Neo is not something you write to disk but it's basically a library in Python that allows you to represent your data in memory while you work with it in a standardized form of course you can save it to disk and load it to disk but it's not a specification on how to save something to disk so that's done by other things like NWBS was mentioned earlier however as as Thomas also said it's able to load data from a number of proprietary formats and basically it delivers semantics by an annotation mechanism and what all of this means I'm trying to show in more in-depth now so again the generic concept of Neo is I always see it as a bridge basically so you can see Neo as this middle library sitting here on the bottom you see different types of data providers which in general will be file formats of course these could be vendors of EFIS file formats but it could be also custom file format such as NICS or HDF5 type file formats or there's also a way to export to some sort of mudlub structure what's not written here but what could also be used is of course input from neural network simulations for instance this is also one option on the other side of the bridge waiting are different applications and these basically now wait for Neo to load data from these different types of data providers into something that's generic and then these applications can work with the data without having to know where the data come from so since the data structure is now completely identical independent of the data source it doesn't really matter anymore right and this allows you to write either real applications or your analysis script in such a way that it becomes as reusable and generic as possible and also of course for other people to read your code as easily as possible they don't have to first understand of how you think that data should be represented in memory so in terms of the file Ios this is a pretty recent list of current supported vendors you see it's it's quite a lot and if we add little red dots to it indicating the different types of file extensions to it it's even more it's quite a zoo and this already tells you that it's in many cases quite convenient so if you give me some data set in a not completely obscure format then the beauty of it is you can just give it to me and I can just load it and display it more or less instantaneously the downside is at the moment still that we get still still very little support from hardware vendors to actually support the writing of these Ios so much of this is community developed so some Ios work definitely better than others but of course if it doesn't work there's always an incentive also to make it work also we do hope that with all this research data management becoming more and more of a political issue that in the future hopefully also the the vendors will feel a greater urge and greater need to actually be represented in a good way in these types of projects so in terms of what what does this how does this data structure look so in general it's divided into several types of objects which are actually Python objects the simplest of these is probably the analog signal it's called it's basically a time series can actually contain several time series the other one is called the spike train which is basically a series of time points signifying a spike train obviously there's also something called an irregularly sampled signal which is again a time series but sampled not at the same sampling interval but with given time stamps there are events which signify for instance behavioral events in the data and as epochs which of course signify a span of time in the data so in addition to this there are also objects which group these primary data objects so one of these is basically the segment it groups objects in time so basically groups all the things which occurred at the same time in the recording there's something called the unit which groups spike trains across segments of time so obviously signifying these spike trains came from the same source there's something called a channel index which is able to group individual analog signals into something into a larger channel group and finally there's a block object which is more or less the container for the complete thing as a whole so again there's basically two types of objects in the or one represents data and one relates the data idea is here that these objects are completely semantically bland yeah it doesn't say what is a segment yeah it just means grouping by time so at this stage there's no meaning attached to all of this yet yeah yes exactly yeah so if you go into more detail you see here the four data objects I just talked about and you see basically the things that they contain in terms of the attributes they contain what is good to know is that all of these data objects themselves are originally derived from a NumPy array so intrinsically they can do everything that a NumPy array can they are then derived from a quantities object so quantities is a library that allows to attach a physical si quantity to an array in order so milliseconds or volts or whatever in order to basically make this clear and then basically Neo goes on top and adds all the properties which are in addition required to precisely define this object in terms of its neuroscience context so for instance for a spike train the spike train will have obviously the times of the spike train so these are just the time stamp IDs it's a quantity object but also in order to fully describe the spike train you also need to supply a t start and the t stop so the first time and so the starting time and the stopping time because if you would not have these two quantity at these two measures then if you only have the time stamps and you ask something like was there a spike at t equals 5,000 then you don't know did I still measure at t5,000 or was there really no spike so it's absolutely necessary to have this information of course also there's some some things which are not strictly necessary but at least very desirable so for instance the sampling rate would be something so you know roughly at which rate these spikes were digitized and the waveforms accordingly which are also attached similarly for analog signals you have the signal itself you also have the time array which is missing here sadly the sampling rate and also the t start and things like this so it's basically there are a number of things which are prescribed which are necessary to describe these objects and you always have access to them but again there's no nothing semantic attached to it yet it's still very generic you don't know anything about it there's also some general attributes there's always a name there's always a description which are both strings there's always a file origin which means like which file did it come from or which source did it come from and there's something called annotations now these annotations are now the thing where you can start to add information about what these things actually mean so for instance when we talking about an event this could be of course maybe a drug injection it could be behavioral cues it could be a trial start if we talk about an epoch it could be a stimulation period could be a behavior type could be a trial period and so on now if we go one step further to the grouping objects and we start with the segment then basically what you see here is that the segment basically links directly to all of the objects that it contains so it has basically lists of spike trains for instance which contain like all the spike trains which are basically in this temporal segment so that it's easy for you to traverse the individual objects that belong to a specific thing again the name and the annotations are the things where you can actually then give meaning to what the segment is again this could be a trial obviously could be something like I don't know pre and post treatment it could be something like before the artifact and after the artifact or it could be something like I don't know completely arbitrary so the same thing we have for the channel index and for the unit basically the channel index links to the units and to the analog signals and irregular sampled analog signals the unit itself then links to the spike train so this has been found out to be one structure that is very well able to capture I forgot the block sorry a structure that's very well able to capture most of the recordings that you will come across and is able to deliver a segmentation that's both meaningful for the user to basically segment data and on the other hand it's not too convoluted by including things which actually have a slightly semantic feel to it yeah so for instance there's there is nothing like a trial in here because a trial can always be either represented by an epoch or by directly cutting the data beforehand into segments and this I think is one of the biggest challenges how do you do these things in a generic way so that it's applicable to everybody but still easy to use so here's an example of how how to actually use this in practice so here we have an object a and python which is an analog signal if I just print it it will directly tell you what it is it's an analog signal with three channels of length 3000 units I'm millivolts etc you will see it's a description it's a raw signal of a three-contact electrode and as you see here annotations is essentially realized as a dictionary so basically you can are at arbitrary key value pairs as annotations to this object in order to further specify what it means if you ask for the sampling rate attribute of this analog signal then here what you see is it directly gives you one of these quantity objects here it's basically something that looks a bit ugly because it's not simplified in this output here basically it means it's 130,000 of a second which is a very which is basically 30,000 kilohertz which is of course a very ugly unit and one of the really nice things about having this option of using quantities to specify things like the sampling frequency here is that as you see it's not it's not divided out it's still a string expression here so that you actually lose don't lose any accuracy I now actually dividing 1 over 30,000 which you know will become some very horrible millisecond value and which will cause you pain here and there but it actually allows you to code yeah the real values here and still calculate with it so for instance what I can do is I can call the function called time slice on my analog signal and ask it to cut from 0.5 seconds to one second this pq here is the quantities library and s is the seconds from the quantities so I can directly tell it in what basically in what units I want to have the signal cut and you will see that despite this ugly sampling frequency here or sampling rate up here it's able to cut to the correct time indices in a very user readable manner but also of course in a very defined manner and of course you can also inspect things like for instance which segment does this analog signal belong to so you can for instance just ask a dot segment and it will tell you exactly in which segments basically this this analog signal is grouped into should be multiple I think this was at the time this is a bit older slide I think it was fixed in the meantime that this so this is basically the unit yeah so this is the unit of the sampling rate and it's not printed here that's that's the problem yeah so the beauty is that the unit can be specified as a string so in this expression here this is still a string expression that you can directly insert and only when when you're required to actually do a calculation that requires to evaluate this expression or by by you typing simplify or whatever it's called then it will basically do this thing yes and also what's important to know is that although this is sort of the the full neotree you're not required to to build everything up right you can just work by just having analog signals you can just work by just having segments and so on you can have a block a segment and events without having any channel in that indexes and so on so the idea is to provide you with building blocks which you can put together to create a structure that represents what you want to represent of course when you use the neo IOs then of course things are a bit different and the first loading step because of course the IO needs to make some first decision on on how to do it and here the idea is more or less as you would expect the file is represented by the block and then of course it depends on the on the vendor but typically the IOs will create a segment for each continuous piece of recording yeah and obviously a spike train per spike train if something like this is in an analog signal for each recording channel and so on but then in the further process you may want to manipulate this for instance by cutting trials and stuff like this so if there's like a very convenient structure for you to work in then this would be a way to basically structure your data by cutting by shifting realigning whatever and then you could for instance save this new neo structure to disk and then use this to collaborate with others and so on so the the question is now that we heard from Thomas basically about metadata and now we heard a bit more about the data format and as you've seen well we have lots of information now maybe in an automobile file we have a way to load our raw data how do you merge this now how do we add these semantics and for this I just like to remind you of this picture again it's a sketch from an experiment where you see the individual machinery that's needed to keep this monkey happy and the different types of files producing data here and there again it's just two files which have the actual data the rest is all what we would consider metadata which needs to be recorded so it's it's quite a mess on the other hand there's also something to consider which is sort of the temporal evolution of such an experiment yeah so typically you start off designing some experiment so assuming a monkey experiment you start basically a year before you actually take the first data so you have like a year of collecting metadata beforehand and at some point you have maybe the surgery and something about the recording device and then you start recording for a year session by session at some point you find a poor guy to do the spike sorting who will start at some point and then at some point maybe you get to quality check so this is of course not a sequential progression the best thing is of course that latest here these things would go basically in parallel as much as possible to avoid any data loss or bad data or whatever but essentially you see that there is also a point of of collecting metadata in time and so the question is how do you structure then the data into such an automail structure how do you form these hierarchical meta representations and here it turns out that essentially we tend to group the metadata according to our own categories of how we tend to think of our experiment both in terms of the equipment that we use and also in terms of the temporal evolution that we use so for instance we may have one category of metadata we consider as post processing so that's something that comes at the end or we have one category of metadata that belongs to the monkey or the animal or the subject or we have one category of metadata that belongs somehow to the recording system and this kind of thing so we tend to think of these broad categories and tend to structure our metadata according to these principles what then typically happens is that we need some way of basically collecting the metadata from these individual files I don't want to go into the detail to it there's one publication describing this in more detail but just to mention in general what what's happening is that we create templates which basically give a rough structure of how we want to represent the metadata and from the left side here comes in this heterogeneous gush of metadata and then we have some scripts which basically merge the two and spit out an automail with the combined information and how this looks I just like to quickly show in an example of one one published data set which is now on the gin infrastructure that I guess you've seen already and that Thomas talked about and maybe coming back to one question that was raised in the discussion of the last talk what why why for instance put it on gin in this case why use repository that is specific to this type of data I think one of the best answers is that if you use a repository that's specific to a specific type of data is that often you will profit from certain little gimmicks of this repository so for instance in this data repository if we go to the actual data sets and we click on the autumn elf file containing the metadata then if network speed allows we actually have here a viewer integrated into the gin repository that allows to directly view the autumn elf file in this repository so you can directly see the data to a better degree then I could see it had I deposited it on something generic such as the note over so so I have to turn my back now because of this situation but just in general this is what a student or former student of ours Libert sale came up with after quite some time basically a quite complex set of metadata structured according to these generic types of categories that I just talked about and you see that for instance you will have information often quite scattered across these individual categories so for instance you will see that in the pre-processing you will have a section offline spike sorting yes I'll try that better sorry about that so you will see here there's some information about offline spike sorting which is basically contained here but also if I go to the recording equipment and then to the array that was implanted here and I look at the information per electrode then again I will also see information from the offline spike sorting which is then the specific information about units on this particular electrode so what this means is that there's now information about the spike sorting which are obviously you want to have included later when working with the data in several locations in this autumn elf file and there's always a big problem in terms of so how can I structure data on the one hand that it's makes sense for me when I want to view the metadata in the way that Thomas showed for instance generating overview is getting an overview of the data understanding the data set and on the other hand understanding which of these metadata makes sense in the context of which of the actual data data objects that I have so for instance when I look at this spike train which information do I need well I probably want to know from the artifact rejection whether it was whether I should use the spike train maybe I want to know something from the spike sorting and maybe I'm even interested in yeah I don't know what where this electrode was where the spike was on yeah so whether this was a electrode and prefrontal cortex or whether it was in who knows where yeah and so on so in general this is this is a problem that there's no real consensus yet on how to map this and this means okay this means essentially that we need to do this mapping ourselves at the moment yeah however the fact that in the autumn L we have key value pairs and that we are able to annotate Neo with key value pairs also makes this much less a complicated task than we may think so essentially we could think of for instance saying okay the units which basically specify the neurons in my recording what I really want to know is the signal to noise value on each of these new units and these may be saved in the autumn L as metadata information coming from some post processing script that just calculates as an hour values for all the units then basically the thing to do would be to simply annotate all of the units in the neo structure with the corresponding value from the autumn L so essentially the workflow that that's that's currently I think to my understanding the best thing to do at least is really to have something I would like to call it maybe a derived IO which is basically an experiment specific IO which has basically two tasks first it loads the metadata using the autumn L library then it loads the data using a specific IO of the neo library so for instance a black rock data file and then the task of this IO is simply to put selected key value pairs from here onto the neo object that comes from here you probably don't want to put everything because it's just much too much it's like 11,000 key value pairs and this is just not going to make things easier for you but you probably just want to select a few which are actually really meaningful for you for your specific analysis task for instance signal to noise for a unit so you can say later on okay do this analysis give me all the units which have a signal to noise larger than three and then do the same thing again and ask for all units which have a signal to noise larger than one and see if you still get something similar to corroborate as a result so this a neo object is something you can work on of course you can then save it either to nix or to play in HD5 or whatever and then also have basically a fixed copy on disk for further use if you collaborate with people then you will always have to decide either you work jointly on this derived IO or whatever you want to call it so you work jointly on this code or you basically decide on some new safe version of the data and share this both have advantages and disadvantages this is a bigger file size but it's can lead to less problems this requires more discipline among the collaborators to to update their code on all sides so that's a personal preference I would recommend this for larger things and this for smaller things in terms of number of collaborators so if you have 10 collaborators don't do this then save it and and give your collaborators something on disk if you know the people then maybe that's easier so just to show you how this would look now in Neo so this is exactly now the same same data file where I now use this derived IO to load the data it's doing exactly this it's loading this automail I've just shown and it's loading the corresponding data file using basically a command that's similar for all the neo IO's which is called read block typically takes some some parameters which are often a bit specific than to the file and the output is obviously a neo block object and what you can then do is basically print lots of stuff of the block and of the segment so this is just a shorthand for the neo segment that's attached to the block and in particular you can print for instance the annotations of all of these objects and so on so if I do this just to give you an idea is here for instance you see the annotations of the neo block and you see that there are several pieces of key value pairs from the automail which are sort of describing the general experiments of instance where was it recording what was the subject name and so on and so on and if we look at for instance for instance if we look at an analog signal which would be a raw recording signal then we would see here so for instance the connector ID or things like the channel ID some things from preprocessing some filter settings and so on so all these things which are then sometimes maybe useful to just have handy in your data currently the the main focus is a lot on the Ios to be able to have Ios which read basically more or less on demand that's one of the big things currently so because one of the things that neo currently suffers from but basically I think all of the readers that you get from your vendors that they either read in the complete file or they're very limited in terms of what you can select what you want to read and whatnot so one of the ideas currently is to extend neo to be able to basically read the structure of the file as a whole but the actual data is not yet loaded so you can selectively say okay but now I need this channel but only this this segment of data so only 500 to 1000 seconds so that this is available in a generic manner and not not on a good will basis by the file I over that supports it or not the way I see it is that neo basically is currently mainly a test to see what types of representations and what types of objects do I need and this has been quite some I would say blowing up in the past where people collect we need this this and this and currently it's shrinking again so there will probably even be one more step of simplifying things in the future but it seems that this this seems to now converge to something that that many people can live with so once this is then there I think one can also then build basically something databasey in the back end but I guess for now I think something like this is probably quite quite important because yeah database also has always the other problems so yeah okay so so now we are we are happy so we have these new objects and now of course we need to do something with them and this is now again the data spiking activity in parallel many units and what you probably know and what you will also realize in the next one and a half weeks still is that in order to understand such data of course it's always useful to look at many many many many things in order to understand it so it's not just enough to look at just the rate but maybe you also want to look at CVs you want to look at infiring rates you want to look at rate distributions and you want to look at many many other things it's just so rich data so the question is if you're just a simple researcher and your time is limited how can you try so many different things especially maybe you're not even the best programmer obviously what we need is a way to to simplify the process of trying out things on data and looking at things other people have looked at and basically sharing all these methods that people talk about in their papers and make nice figures about so I'd like to talk about one project where I'm involved which is the electrophysiology analysis toolkit it's basically meant as a basically an umbrella a sort of a library to bring together analysis approaches from the community basically in particular to bring together also experiment and and simulation probably with a focus on not just single neural dynamics but really on population type measures that characterize multi-dimensional recordings and also with a focus on multi-scale data so things which relate for instance of the potentials and spikes or these kind of measures so the idea of elephant is that it's based on a neo library so it's one of these applications sitting on top of the neo data structure and by this means a sort of should close the this step from basically having the data loaded and now being able to easily apply analysis methods on top of them it's basically a rather modular design and the ideas to have a community centered and completely open source so there's no no hidden hidden development or any close parts it's it's really meant as a resource for the community so at the moment I think there's already quite a few additions to this library I talk about a few of them in the past some from our own lab some also from other labs who who contributed to this some of this actually also started at this course interestingly so in general elephant has different sub modules which then all tackle either a specific set of smaller analysis methods but which sort of topically belong together so for instance things like spike train correlation or spike train dissimilarity where there's a lots of measures which which you can use to account for these things and also some modules which focus on a specific analysis methods of instance unitary events which you will hear about in one or two days which are methods which are a bit more complex and evolved so they they need space of a module of their own so here's just the slides for how to get some documentation and support you'll find this in the PDF so so what types of functions would you want to have in such a toolbox so on the one hand you want to have terribly simple and stupid functions yeah you just want to have them so something like a time histogram of spikes so this is obviously something that's that's super simple to do and is something that you would never want to load a toolbox for obviously but it has two advantages on the one hand you can basically do even these simple operations in a very consistent way and the advantage of this is that you can then also modularly build up the toolbox on top of such elementary operations so for instance assume you have two very complex methods and you still think they should be somehow comparable then if you know that they do some elementary things like for instance spinning and stuff like this on the same algorithm then at least you can exclude a few things of where where you to analysis results would differ or not differ the other thing is basically standard analysis functions to compare analysis results across data sets what this means is that there's certain things that we keep doing and doing but when you read the methods part of a paper then typically never really explained how exactly you did it and which options you precisely said and so on so by just giving some specific standardized analysis functions in here and specifying the parameters to set and so on this basically gives a more formalized way of representing your analysis workflow but of course the most important thing is of course then to have methods in there which are intrinsically difficult to implement first of all because you're probably not the expert because somebody spent his whole PhD developing the method and second because they have some know how on how to do certain things for instance numerically that you are just not aware of and I'd like to give you one example of a method which is the unitary events method so this is something that you will hear in detail about in the talk by Sonya later so I will not tell you exactly what's happening here but roughly it's a method to detect synchronized spiking activity between neurons so for instance here between neurons one and two across many trials and so this method the student wanted to implement it in for the elephant library so there was no Python code available as a matter of fact there were many much love and whatever codes available was not really clear anymore how the original publication in 90s was exactly reproduced produced at the time and so on I guess you all know the situation what he did is basically he programmed the method from scratch as it was described and I might add also completely correctly so that was no bug in the method or anything and he got a result on the original data which we found after some time and when this was compared to the original publication then you see at first glance now without knowing exactly what these red dots mean that the synchronous spikes and without knowing exactly what this s-curve significance means then at a first glance you think like well looks reasonably good yeah and actually it is reasonably good but it's really not precise so for instance if you look at here for instance these little lips they're quite different from here if you would now zoom into the exact spikes which are basically marked in here and here it's not 100% the same so bottom line is even if you follow an algorithm in a paper and re-implemented you will get something very similar yes but there are so many decisions you have to make for complex method on how to do it like this or like this and do your first one if statement here and then and so on and so on this will make differences so in this case what we did is we worked on this method in elephant until we were able to completely reproduce the original finding which you see here it's basically now the final result where basically the original article time points of these red dots are compared against the reproduced ones so we were quite sure that this is now doing exactly the same thing it was published in this journal called re-science which is a github based journal I don't know who of you knows this yes one re-science it's a very nice journal they publish basically replications of old science so you can reproduce papers and publish it and what's very nice is it's basically published on github on a version control repository along with the reproduction so essentially what's happening now is that the reproduction and the data leading to this reproduction is now basically downloaded from github when the testing framework of elephant runs through and basically that each time somebody changes something in the unitary events basically you can check whether it still reproduces this old result and I think this is one of the big advantages you can have by by contributing to such toolboxes whether it be elephant or anything else by really trying to have tests incorporated in your library that not just test whether your function fails if you put a string instead of a number or something like this but also where you try to test something that reproduces something that has been done in the past because then you can be much much sureer that what you do with the method is then really comparable to what has been done in the past and especially for these higher-order correlations these very difficult type methods it really depends partly it really depends it can make differences also one thing I just want to quickly mention is that of course complex methods also require more course and parallelization and also here we try to implement methods which reduce basically the compute time so basically one of the higher-order correlation methods was optimized by a computer expert who significantly brought the time from 300 seconds to 20 seconds that this is the time scale you gain when you have a scientist implement something versus somebody who was an expert in computer programming but also to parallelize the thing so it basically scales with with with parallel compute architectures so how does this work I just have a very short mock-up toy example and again the method itself is something you will hear about by Sonya in a few days now assume we want to find some synchronous spiking patterns and data so maybe the first thing we want to do is generate some mock data to for us just to understand what we're doing yeah so what we do here is we call elephant that spike train generation compound possum process this is basically a a method that generates a possum process with a specific higher-order events of a given amplitude in the data this is implemented and then in the second stage we add some homogeneous possum process without any higher-order correlations to it and the second part so so what comes out of this is basically a list of spike train objects these are basically this sts is a list of neo spike train objects in the second step what we do is we call elephant spade spade is basically the name of the method we're trying to use to detect these patterns and as you can see the main thing that we pass to this method is data equals sts so we pass it this list of spike trains in this common neo format plus some parameters so when we let this run this is basically the output you see here the 100 neurons that we created neurons 1 0 to 9 or 1 to 10 whatever basically contained these higher-order patterns and then red overlaid is what the spade method picked out and you see that by this toy mock-up example we can assure that we somehow at least roughly know what this thing is doing of course in a real world you would do something smarter than this but just as a matter of concept so now assume now we want to get to the data set that we've just been talking about and repeat the analysis I think this very nicely shows the beauty of neo in a way all I do here is I use the nix IO of neo so for the nix file format I give it the file name and I say read block I get out a neo block which is called real block in the next step from this loaded file I basically want to extract the spike trains what I can do here is call a filter function on the real block filter function basically means extract from this block certain objects and in this case I'm telling the filter functions get me all spike trains which have an annotation unit id 1 so in other words it's asking me give me all spike trains which are the first unit on each recording channel so the best unit in our words from each channel and what I get out is again then by definition a list of spike trains sts and this I can plug into the same spade caller as I've just shown it's exactly the same and then immediately I can basically get out the same analysis for for real data and this is sort of the the overall idea now there's elephant there's of course also other things around and this is I think again coming back to this neo picture the beauty if we can agree to to common data formats and underlying data representations then basically we can basically pool the types of tools we need from all over the community I have here a list of of the things I know of and I heard about at least which may be just some some starting points there's neuro tools this I want to mention specifically because it's actually the the father of both neo and elephant so this was a project started in the early 2000s during the EU facets project and was one of the first python based initiatives to actually start such tooling endeavors as a whole basically this was still more or less a cluttered combination of neo objects and and analysis tools and that's why the idea came out to basically separate the two into a separate data representation and analysis there's also graphical tools so for instance open electrophys something I'm not sure how much it's developed by Samuel Garcia um there's also a viewer tool which I think still works by Robert Harper which can basically visualize generic neo objects and also has some basic analysis tools on board which are also meantime integrated and elephant then there are some other things so maybe uh just to mention a few py spike for instance by Thomas Kreuz he's he's continuously publishing his his spike train metrics in this toolbox I'm not sure if it's neo aware yet or not I didn't check and of course there's also tools from other subfields of neuroscience which could definitely be of interest so for instance there's nipy which is the neuro imaging for python initiative they have a sub project called night time which has time series type analysis so spectral analyses and these kind of things the m&e community has has a tool for me g and e g type data and so on and so on plus yeah I won't go to all of it you'll find it in the slides plus what I just want to mention is that since people keep asking me about spike sorting yes spike sorting is currently exploding as you may have seen there's many initiatives going there's some more classic projects such as these but also some newer projects such as spiking circus 3d clue which are both both neo aware I think at least I think spike circus is in the meantime quite sure 3d clue is for sure mountain sword for instance so I think there's support for many different fields of applications on the on the rise so in the remaining time I'd like to spend just a few words on workflow management as a whole so now we did discuss a few tools themselves but of course we need to embed these tools in some sort of broader workflow and here I'd just like to mention that there is of course a large number of tools that that are available so personally I'd like to think of these as several categories in a way so of course there's something you want to have to represent your analysis workflow pipeline as a whole so this would be by of course the tools themselves maybe elephant but also by things like workflow engines so things that orchestrate your workflow and I'll say something to this particular one called snake make in a second there's also the issue of versioning and provenance so you want to keep track of the data that you generate and about how you generate it and also here there's lots of tools which come into play github and versioning using git is probably the single most important thing that every scientist should be able to do and as you know there will be also a party devoted to this topic later on but also there's things like docker environments for instance and other things which allow you to have basically consistent environments where you know that if you rerun your analysis a year later on your own pc you won't get the error message sorry version of the software doesn't work anymore then often you will have the need to access high performance computing there's also tools which are typically a bit more specific to your research institution which you need to integrate into your workflow from that side be this queuing systems or generic systems like unicor or other things which allow you to distribute your work on hbc machines and finally there's data management I think we already heard a lot from from Thomas about this gin this data line there's an interesting initiative and so on maybe related to this coming to snake make so I don't have time to make a complete snake make tutorial but what what is snake make it's a tool that you maybe may find interesting to check out it's basically a way to describe your work in terms of yeah the flow of scripts you need in order to get to a certain analysis result and it does so in in sort of a make file style so I don't know who of you knows make okay not so many so make is a linux tool that comes originally from programming and make is a tool that makes a program and what snake make does is it makes your research yeah and it's called snake make because it's a python based and it allows you to work particularly nicely with research projects which are in python because you can basically write your workflow descriptions in a pseudo pythonic language yeah at that yeah integrates nicely what this means is that you describe your workflow in terms of so-called rules and rules basically tell you how you get from an input file to an output file by running a certain set of commands or in snake make you could also say by running a certain set of python scripts now the beauty of snake make is that it allows you to specify your inputs and your outputs not just as pure file names but you can specify them programmatically which is something that often comes up in in research right you have for instance this thing I don't know you do an analysis for every trial so for animal a I have a hundred trials and I get 100 files out I don't know a jpeg with the firing rate and in the other thing I get I don't know 130 trials yeah of jpegs and so on so so you have sort of a variability and you often don't know which files specifically come out and again I can't do a tutorial that would take another two hours but it gives you a very nice ways of basically describing your workflows for instance what we have here is a rule move to manuscript the idea is here to take the figures which you produced from your script and put them in the manuscript folder yeah so the idea is you take the input from a previous rule which is the rule to make the plots themselves and you take the output files as the input files to this rule the outputs will be something in the manuscript directory yeah and the figure directory in your manuscript file and then what you want to run is basically just a copy command of copying the file from a to b so this is a very stupid example but you can see that also for your individual research steps you can basically build up a very nice sequence here now the nice thing is of course that here you don't specify the workflow by saying run this run that then run this but what you say in the end is I want to have this file snake make you figure out how to build it and if you don't know how then just tell me then I'll I'll have a look yeah and snake make will know by these rules what files exist what files are too old which need to be recreated will know which rules to run in order to have a consistent output and get you exactly this file based on updated versions of all other files timestamp based basically what's also very nice is again coming back to your question is that this supports basically trivial parallelization on HPC architectures for a number of queuing systems and so on so basically you can specify in the rules that you want to have this rule executed so if you have a rule that's basically needs to be executed several times you can say okay execute this please using a a wrapper basically and also what's very nice is that if you what talk about reproducibility you can directly say for each rule which environment so which precise conda environment or which docker environment you want to run this rule under in order to be sure that whatever you generated was generated with exactly these python versions these neo versions whatever so it's a very it's a it's a heavily used project I think in at least bioinformatics not so much in neuroscience yet but we're working on it um and I would very much recommend this if you're looking for a solution to to have a nice way of structuring your work it's it's a it's a nice read in the in the read the docs takes maybe a bit of thinking until you get used to the style of writing make files instead of typical thinking of do a do b to c but in the end it's much more flexible okay so I skip this so we still have five minutes so I want to do one last step which is now a bit stepping one step further however I'm sure that when you start on the projects you will maybe be slightly reminded maybe of this of these slides now so I want to talk a bit about going one step further model validation so you may ask so so why am I actually doing this yeah why am I analyzing data yeah why why do I care yeah so the reason you care typically is not because it's fun well that's not true um maybe because it's fun but often you have some sort of system of interest yeah you want to understand something and just analyzing data of course doesn't create much understanding uh understanding comes when you try to build some sort of model out of this now this can be a mathematical model I think often it will be but it doesn't have to be it can be also some other model of understanding where where you try to make sense of what you actually analyze your data however neuroscience what we often do is that we actually really try to come up with mathematical models of of the neural dynamics that that we observe and I think a crucial step to to understand here is that the mathematical model is of course something we come up with all we can do is sort of a step is sort of a step of confirmation we can we can think about whether this sort of matches the system is this a good description of the system is this plausible as a description and so on what we then typically want to do is of course see whether this model really reproduces something interesting in the environment for this we need some sort of executable model so this is basically a simulation some sort of a script that generates a representation based on our original mathematical model so one thing we have to do here typically is a step called verification which means that we try to assess that the implementation of our executable model is a correct one computationally correct numerically correct in particular and then comes the interesting part so our simulation produces something and the question is now does it reproduce our system of interest is this so this could be reality but it could be also a concept it's like an abstract concept or whatever and this is what we call then validation you know validation is the concept the idea is I I built a model from from my understanding and I validated against the original system of interest be mindful that there's no no parameter tuning here so we're not talking about a step where I take the system of interest then parameter tune an executable model I don't have to validate it back of course I have to cross validate it but in this case of course it's it's very trivial I'm I'm building exactly this system of interest it's really a question of how far does does this model match or not so um so in this validation I think we we have many different scenarios we can account for so there's model to experiment that's I think what typically happens there's model to model so sometimes we may have two models and we want to see does my new implementation actually fit to the original one um then we can ask level uh uh validation on the network level so for instance um um does the network as a whole the dynamics I observe in a neural network simulation as a whole match the activity I see in the brain or do I try to match like a single synapse you know this is also something I could validate so on which which uh structural level do I do I validate and then of course um I can also ask about the acceptable agreement the acceptable agreement indicates when would I be happy with this validation process so um is it enough to just have like looks roughly the same or do I want something that's like spike by spike the same thing yeah so that's that's also something to think about so in this case I just like to do one little example scenario just because I think it could be maybe a nice way of starting to sync then in your projects uh on the network validation testing level and I go through it really quick um so the data I'm presenting to you now is basically from an old paper by Itzy Kiewicz some of you may know it's called Polychronization and um it's um so the reason we took this paper is because we have to be really thankful to Itzy Kiewicz because he was actually one of the first people who actually published model code precisely on his webpage so we are actually able to have his original um simulation code from the 2006 and at this time this was not not yet standard it was actually a big crisis at the time where people were starting to think about how can we actually convey our theoretical research it was basically a model of three populations um and the the gist of the paper at the time was these Polychronous groups so basically sequences of the network was able to learn sequences of spikes provided with this was basically a C implementation so a custom written C program in addition to also a MATLAB program at the time and uh basically what we did is both uh port this to the nest simulator but also to the neuromorphic uh hardware called Spinnaker this is basically a a hardware chip performing neural network simulations and the question was a bit um okay now we have three different simulations so are they doing the same thing and how can we start assessing whether something is the same or not and what's what's now real and what's what's wrong and so on yes exactly and that's the first thing to realize of course you will never get spike by spike right never for something like this um that that is something we need to live with um of course we would hope that when we run it twice uh we would get the same for something like nest for instance but of course for for neuromorphic hardware it would get very difficult um but this is basically what happened and I think there's three things we we learned so what you see here is a comparison of the C simulation so the original it's a Kiewicz model against the first attempt so this was very uh uh not so convincing yeah so you see this by eye the second attempt was a bit better the third attempt you could almost say well this looks reasonably good so um if you if your acceptable agreement is very low then you may actually just stop and be happy here however I do like to point out that looking at um then measure such as for instance the pairwise correlations in the C versus the spinnaker you then see some spurious differences you saw somehow this long tail of correlations for the spinnaker simulation and in correlation matrices this showed up as these crosses here somewhere and then only when you look closely at the activity then you saw that sometimes there are these these blips of activity coming up which were obviously still a bug even in this third implementation so you may not be happy bottom line is that you you continue to improve these simulations and essentially what you have to do is here I always have to look at many different measures in parallel so for instance here we looked at firing rates local coefficient of variation and the cross correlation structure again in three iterations one two three and what you see is that progressively by looking at these things the distributions get closer and closer together yeah so that in last iteration we would say okay the LV so the regularity is already pretty good the cross correlations well and the firing rates are still too low for spinnaker and not yeah and now you could say okay LV is good so that's fixed it works but no just be aware that maybe you do fixes that now the cross correlations in the next iteration they fit perfectly the firing rates fit perfectly and then suddenly the LV doesn't fit anymore so essentially what this means is that when you're trying to compare different network states and so on it's always a good idea to really look at broadly at many different features just thinking that well the correlation structure seems to be preserved in the two things that must mean that these these lower order measures much also match this doesn't hold right the regularity can differ independent of the correlation structure that's something that is intuitively maybe not always so clear so in the end then you come to some final result and you think like oh this is really nice now we have like six different measures who cares maybe maybe this is now good enough yeah i see okay the firing rates are still a bit off but maybe it's enough and then i think okay what was this the polychronization model yeah it's all about patterns and now we think about the middle of this lecture and we thought okay we have a method for patterns yeah so let's find patterns in this data so we use again this method the spade method i talked about earlier now with like spatial temporal patterns small difference but the same thing yeah and what we do we just count the patterns and voila although in this last iteration all the measures look so nicely the number of patterns we find are vastly different in the two simulations although the activity by eye looks basically the same yeah the regularity of the patterns is the same the frequency of them but not the numbers and this is probably just due to a slight difference in rate so always be aware that when comparing different data sets whether these are model simulations or real data then look at many different things and think about how some measures may influence other measures and so this is my summary share your code not by dumping it somewhere but by building and contributing to community projects it's worthwhile don't reinvent the wheel every time even if it sounds fun and tempting for scientists focus on the science and think of the technical assurance but also of the time for fun activities you will have when you know that your results are 90 percent part of a triple checked and error corrected community effort as opposed to 100 percent only you and with this I'd just like to thank the people who contributed to most of the thing on this on these slides but also in a broader scope all the people contributing to these software projects and also Thomas team thanks