 Good afternoon. I will talk about evaluating neural network models within a formal validation framework. This is work done by my colleagues and me at the research center in Julech. I will start at the very beginning with models. So what are they and how can we learn from them? So as you certainly know, models are abstract simplified description of some system of interest and the key attribute is that they can generate testable predictions. And when these models do this very accurately, we might want to infer our knowledge of the model to the system. This is called inductive entrance and is good scientific practice. However, this doesn't mean that we actually uncovered some some truth about the system. It just means that we found a useful description. But how can we find out about useful descriptions and how can we decide which is the most useful? This is exactly where validation comes in. Validation is a process of quantifiably evaluating the prediction accuracy with respect to the system of interest. But let's define it in a more broader modeling context. This is a schematic you see about the modeling environment with three key components. On the top, you see the system of interest. I already mentioned and to give a very simple example what this could be. This could be, for example, a ripe apple tree with falling apples. And by careful analysis and some theoretical considerations, one might be able to construct the mathematical model of it as Newton did. You see where I'm going with it. The interesting part of this is a explicit separation between the mathematical model and an executable model. The executable model is a part which can run simulations and actually do the predictions. For Newton's laws, it would be, for example, the standard physics engine. And this, of course, this code needs to be checked for correctness by a process called verification and only a verified executable model can then be continued to be validated. The validation step directly compares the simulation outcome to observations of the system of interest. And when the two match, well, we are in luck, we found a very useful model. This rather straightforward workflow enables us to compare and evaluate models and therefore an indispensable part of simulation science. However, there are exceptions and additional aspects to consider, especially when the models become more complex. I will briefly mention three of such aspects first. Well, assuming you did the test and you got some result and what now? Well, the result of a validation test is called a score and the score tells you how much credibility you are allowed to place into your model. But any single score can never encapsulate all the features over the whole domain of interest of this model. Therefore, you need to employ many tests with different statistics and focusing on different features. So this also means that the score doesn't necessarily directly tell you if a test passes or fails. This highly depends on the intention of the model. So in the ideal case, the model knows a priori what the acceptable agreement, so the range in which the score should lie, actually is and can formulate this. To give a brief example, assume you have a spiking model and with which you want to describe some data. And in case you were interested in only the rate profile, you might be in luck because this matches quite nicely. However, if you were when you created the model rather focusing on the exact spike times, as for example, the latency of the first spike, you find a discrepancy and the resulting tests would fail. And there might be an additional test, for example, the regularity of the spike intervals, which uncovers another discrepancy, which is totally invisible to the first two tests. So you need many tests and you need to be aware of what you actually want to model. Secondly, in engineering, this is common practice to start at the very lowest level of simple building blocks, validate them and then build on top of that to validate the larger structures. However, this is not possible in neuroscience for several reasons. On the model side, we have all these different scales from ion channels, neurons, networks, up to behavior. And the relations between these are very complex and often unknown. And also in the data side, we don't have these kinds of multi-scale data available. Additionally, if you have, for example, a single cell model and validate it in some parameter regime, this might not necessarily be the parameter regime which the neural model occupies in the network model. Therefore, the validity of a network model is not implied in the validity of the neural model. And also the other round doesn't work. As, for example, Potience and Desmond showed that the large-scale network dynamics can be accurately modeled by a network model of simple leaky integrated and fiber neurons which are not very biologically detailed. And also beyond this conceptual framework, you might face additional considerations in practice when doing validation. For example, if you want to validate a model, you might lack an appropriate data set depending to that model to come to a strong conclusion. And these cases can be very helpful to validate not against data, but against another model which is more trusted. Or also validate against the same model, but in a different version or immutant of this model. This might be helpful to quantify the influence of little changes as, for example, the underlying neural model or the solver for the ordinary differential equations. One of these changes can also be the choice of the simulator engineer use. And I will come back to this precise example later on. But how could a software tool look like which actually reflects these different aspects and is versatile enough to adapt to the different challenges specific workflows can have? So the first thing you do when you develop a tool, you look out if there's already something like it out there. And in this case, indeed, there is a Python package PsyUnit which provides a general framework for validating scientific models. So our tool, NetworkUnit, for network level validation of neuroscientific models, fits directly on top of that and also uses the implemented electrophysiology analysis methods provided by the open source project elephant. So let me show you the structure of this tool. So the key components are the data and the model. The model is a class object which is able to run the simulation and if Newton would have been into coding, this is what he would have written. The data would be our measurements of the following apple. And the corresponding test could, for example, be a falling velocity test which tests the velocity of the apple after a given distance. The test is also a class object which is able to compute the velocity from the model simulation and then compute the corresponding score in comparison to the data. This score is not only the quantified result of the test but also this object contains all the information of the test and the model to help with the interpretation of the result and also to ensure its reproducibility. Besides reproducibility, another design object here was modularity. So aspects of the test like the type of score used, the statistic is not hard coded into the test but attached via a separate object. And also the parameters used are attached separately so that consequently there exists a more general test class which is agnostic about score type and parameters and can be reused for different variations of this test. This would be a general falling test here and scientists might be interested in having a more general test on and evaluating the velocity of an apple when it's thrown horizontally. So there can be a more general base test which handles all movement with arbitrary initial conditions and this is an effort of writing no calculation twice so that everything depends on the same calculations. For network neuroscience this could be for example and as it is implemented in network unit, one base test is handled all the calculation of correlations. A child test of this can then for example compare directly the distribution of correlation coefficients and another child test could instead use these correlation coefficients, build a graph, have further child tests which then test graph centrality measures and although these are very different tests they necessarily agree on the definition of correlations because they use the exact same code. Finally this framework also makes sure that the test actually makes sense and checks whether this model can produce the property in this case movement the test wants to evaluate. And indeed this framework proves versatile enough to also incorporate the practice of validating one model with another model simply by inheriting one dedicated test class we can use the same test to not test against data but against model and the test is actually formally equivalent. For our apple example this could be for example a relativistic model of motion and here we can identify the limitations of Newton's law without having to actually measure superfast apples. Okay let's jump directly to a real world application. So this is in the context of a reproducibility study of the polychronization model which was published in 2006 by Itzy Kiewicz and this model is interesting for at least two reasons. The first one is it produces a very rich network dynamics of spatial temporary organized and repeating spike patterns but the second reason is arguably more important. Itzy Kiewicz actually published also his code which made it possible to reproduce it. So first study focused on the exact reproduction of this model here using the engine nest. This was worked by Pauline Vidal and they discussed the model specifications which are necessary to actually do such an exact reproduction. And in parallel we worked on porting this model to the neuromorphic hardware system called Spinnaker and a neuromorphic hardware uses very different kinds of computation than a conventional computer and a conventional simulator so an exact reproduction is out of the question. So we necessarily needed to employ validation techniques to be able to somehow quantitatively compare this to the original simulation which was by the way written in some custom C code. So one study focused on the implementation details and the verification techniques and the second study on the validation techniques to compare this to the original study and I will show a few points of what we learned here. So we started off by a very naively porting this system as this model to the neuromorphic hardware system using all the default settings and you see in the very first line the spiking activity which we obviously had to improve on and we did this in two iterations. You see in two and three by tweaking the solver for the for the ordinary differential equations which improved a lot and by I already see that the third iteration looks way more similar to the original implementation simulation and see but it's very hard to see by I if this is a good reproduction or not. So over this early stage of the development of this implementation we use validation tests. You see again three rows for the three different iterations and three columns for three different tests focusing on the feature of firing rates regularity measured by the LV the local coefficient of variation. And the correlation coefficient. And this we quantified with an effect size to the distance between the two distributions. And here we could actually quantify that each iteration got more similar to the to the C simulation. However, this is still not a good agreement. So to to jump to the very end of the story we further dove into the model and approved upon the temporal resolution and the algorithm for the threshold detection algorithm and and the right here. Here we have a better fit of firing rates for their V also for the correlation coefficient and we introduced three more additional tests to have a more complete picture. The inter spike intervals the rate correlation and the eigenvalues of the rate correlation matrix. So at this stage we we can say okay we were able to qualitatively reproduce this model on this neuromorphic hardware and quantify the level of agreement by these effect sizes. However, this is not a strong statistical agreement. There are still discrepancies here most obvious in the in the higher firing rates and slightly higher correlations. And although these differences might be small they can have potentially a very large effect on the occurrence of spatial temporal patterns. So they should illustrate that all the upside validation testing can have even beyond the standard scenario of of comparing model and experimental data. So to take a take a step back I want I'll give some broader context where we are going with this. So in it was a network unit we mostly focused so far on measures quantifying characterizing the network activity based on pairwise and single measures of spiking activity. And we are now starting off also integrating measures corresponding to the spatial dynamics of of networks. Here for example as you see in an LFP recording of an implanted electrode array which shows wave-like behavior of the activity. Also I really have to mention that there are many other tools like it specialized for certain validation scenarios for as for example a neuron unit which handles the single cell validations. And they all and it's aimed to harmonize these efforts. Especially also as they should be integratable with the validation framework of the HPP which is which basically provides a searchable database of models and tests and scores. So to wrap this up three points you really should take away from this talk. First validation is important. Any simulation without quantification is just guesswork. And there are tools out there like network unit and many others to help with it so you can make use of them. And third if you run validation tests please use more than just one test. And even if it's not enough data you can use other models to evaluate your model. And with this I would like to thank you for your attention and all my wonderful collaborators. And if you want to dig deeper into this topic there's a poster on the simulator comparison study. And there's also a demo on SciUnit by the developer Richard Gerken. And what was this I would like to thank you again. Okay, thank you very much. Questions. Well I think there's no general recipe of how to tweak a model. You really have to dig deep inside and understand the model. And gladly we had collaborators who could very well understand the model and also the underlying structure of the neuromorphic hardware. And were able to find out the discrepancies where some dynamic in the model can't be represented in the hardware system. And there we then dig deeper to identify these mismatches. And for example found that the temporal resolution initially chosen by the model of the polychronization model doesn't work on this neuromorphic hardware. So we had to go to a final time resolution at least for the integration of the membrane potentials. So in this case you had a fully defined simulation as your reference. Yes. Suppose your reference was a bunch of noisy data, experimental data. Yeah. So this is of course always a problem. Data is variable as we know. And a model fitting to one pair of data might not fit to another pair of data. So as I showed, we are able to compare two models. But we are also able to compare two data sets and actually quantify the variability within these data sets. And this can be helpful to define this acceptable agreement I mentioned where we could expect the model, how accurate we could expect the model to be. Thank you. The network unit, the tests that you have there now. So like just to step back for a second. In single cell studies, I think people have been thinking a lot about what are the features of a spike train, what are the features of the membrane potential that they might want to look at and put into an optimization and put into some sort of validation of a model. And for network models, I think that's not, from the amount of time people are working on the science of that has not been super well established. And I know in the network unit you have seven or eight main things that you test, like you showed in the slides comparing the models. What else looking forward do you think people should be asking about when they're trying to validate network models beyond that? Like what are some future tests you could see in network unit that would be ways of looking into whether or not two models are similar, whether models are good representation of the system? Yep. Well, there are several aspects to this question. First, this highly depends on the intended use of the model, what you want to test, and what are the important features and which features you might want to neglect. As for example, in the example, if you only are interested in rates, you don't care for some spike regularity or latency. Another aspect is related to the variability in data is you probably want to identify features of the model which are fairly stable across data sets and don't vary too much. And these would be the features you want to focus on, which again might be different for different species or different kinds of data. So this certainly can be expanded upon those measures and probably should be. So one aspect of this is certainly graph measures, which can be employed independent of what the underlying weight measure is. It's correlation or functional connectivity or maybe causality. And it's certainly very useful to then explore the graph measures of these kinds of activity measures. I hope this answers the question. Okay, if there are no further questions, then let's thank Robin again.