Evaluating neural network models within a formal validation framework

1) What can we learn from models?

Let's start with models. What are they? What can we learn from them?
Models are abstract and simplified descriptions of some system of interest, which able to formulate predictions about this system.
And if a model does this very accurately we might want to infer our knowledge of the model to the system of interest.
This would be called inductive inferences. And although, this is a good practice in science it does not imply that we uncovered some truth about the system, it just means that we found a useful way to describe it.
But how to find out which is a useful description and which is the most useful?
This is where validation comes in. Validation is the process of quantitatively evaluating a model's prediction accuracy, and thus it's usefulness.

2) How can we learn from models?

Let's define validation in the broader modeling context.
Here you see an adapted version of a schematic introduced in 1979 by the Society of Modeling and Simulation, and which has been used and advanced in various fields.
It shows the key components of the modeling environment. On the top there is the system of interest, which is a specific and bounded domain the modeler aims to describe.
This could be, to pick a very simple example, a ripe apple tree.
By analysis of this system and some theoretical considerations, a conceptual mathematical model of this system can be formulated. As for example Newton did.
The interesting part in the schematic is the explicit separation of the model in a mathematical one and an executable one.
The executable model is the part of the model which can run simulations and thus make predictions.
Typically, a model becomes executable by implementation either software or hardware.
The correctness of this translation from mathematical formulas into code has to be checked by a process called verification.
For Newton's laws, the implementation would be for example represented in a standard physics engine.
(And only if an executable model is verified, a subsequent validation step can actually make sense.)
The validation step then directly compares the prediction of the model in form of a simulation against observations of the system of interest.
And when they match, the model proves to be useful.
This rather straight forward validation workflow enables the evaluation and comparison of models, and is thus indispensable for rigorous and reproducible simulation science.

3) Aspects of validation

However, along the line of this basic workflow there are exceptions and additional aspects to consider which might become more relevant, the more complex the modeling and validation scenario becomes.
And the structure of a corresponding software tool to help with the validation workflow should therefore also reflect these aspects and should be versatile enough to adapt to the challenges of specific workflows.
I will briefly mention 3 of such aspects.
1) Did the test, got a score, what now?
- First, assume we performed a validation test and got a result, what now?
- The result of a validation test is called a score. And this score quantifies how much credibility we can ascribe to the model.
- However, a single test and a single statistic can never evaluate all the aspects of the model over the entire domain of interest.
- Thus generally, one should apply many different tests, evaluating different features, and using different statistical measures.
- This also means, that a validation score is generally not a pass-or-fail assessment. The interpretation from the quantitative score to a pass-or-fail statement highly depends on the intended use of the model. In the ideal case the modeler should formulate a priori an 'acceptable agreement', a range in which the score should lie to consider the test being passed.
- To give an example. When testing a spiking model against a dataset. A test focusing on the instantaneous rate might show a good agreement. However, if the modeler was actually interested in the exact spike times, the corresponding test might be considered failed. And even if the exact spike times were not of interest, an additional test might reveal a discrepancy of the model that is invisible to the rate test, for example a spiking regularity test.
- [analogy in software: easy combination of test and score classes to build a range of tests]
2) Approaches: Bottom-up & Top-down
- In many engineering scenarios, the typical approach is to validate the elementary structures and then iteratively build up to the validation of the larger system. However, this is generally not possible in neuroscience. This has several reasons.
- On the model side, the relations between scales of ion-channels, neurons, networks, and behavior are very complex and often unknown.
- On the experimental side, corresponding multi-scale data is typically not available.
- And additionally, the parameter regime of validation tests of e.g. single-cell does not necessarily reflect the relevant parameter space they populate within the network dynamic.
- Thus, having a well validated single-cell model does not directly translate to accurate dynamics on the network level. And the other way around, inferring single-cell validation from network-level validation also does not work, as has been shown for example by Potjans & Diesmann, who demonstrated that large scale network dynamics can be accurately modeled by networks of simple LIF neurons, which is not a very biological detailed model.
- [analogy in software: ??]
3) 'Validation' beyond validation
- Besides a sound conceptual framework, there are several considerations to face in practice when working with real data.
- Often you would like to validate a model, but lack an appropriate dataset to validate against, or have too little data to come to a strong conclusion. In these cases it is very helpful to instead validated against the simulation outcome of a more trusted model.
- Such comparisons, although being practically identical to common validation testing, are not actual validation tests as they don't evaluate the predictiveness with regards to the system of interest. But however, they can gather evidence that a model is reasonable, and identify and quantify limitations and deviation from other models.
- What can also be helpful, is to compare a model not to another model but to another version of the same model. This can be used to evaluate incremental changes in the model development or to quantify the influence of small tweaks in the model to the network activity, like changing the solver for the ordinary differential equations, or the underlying neuron model.
- One of such changes could also be the choice of the simulator engine which is used to run the model. As long as the simulators and the implementation is correct, the results should be identical, right? And if not, at least one simulator is less useful. So, this comparison could practically 'validate' simulators.
- I will come back to this scenario and show a study of such a simulator comparison.
- [analogy in software: M2M Test class]
So how could a software tool look like which is able to facilitate such various validation workflows, and also help to ensure their replicability?

4) Developing a network-level validation tool

The first step when building a software tool is to check whether there exists already something like it. And indeed there is. The python package SciUnit provides a general framework for validating scientific models. So the tool we build for network-level validation, NetworkUnit, directly builds on top of that.
Furthermore it uses the basic electrophysiology analysis methods provided by open source project Elephant.
The basic ingredients are the implemented model, and the experimental data.
The model is represented in a class object, which is able to run a simulation. If Newton would have been into coding, this is where he would have written his formulas to test his theory.
In the most general case, the data may come in any form, as long as the associated test knows how to deal with it.
For example the validation test here could test the velocity of a falling apple after a given distance.
The test is as well a class object, which can compute the corresponding velocity values from the model simulation, and compare the two to generate the validation score object.
Beside the quantitative validation outcome, the score object also has all the metadata from the model, the test, and the data. This enables not only the interpretation of the result but also its reproducibility.
Besides aiding reproducibility, another design principle here is modularity. So the type of score is not hardcoded into the test, but attached via a ScoreType object (This could be for example an effect size, or a student's t test, or something else).
Similarly, any parameter settings are set manually for the test.
Consequently, there exists an abstract test class implementation which is agnostic about the score type and the parameter settings, and therefore reusable for other test variation without the need to rewrite any code pieces.
In the example of the apple this would be the falling velocity test. However, some scientist might also want to test the apple's velocity when thrown horizontally. Such a test has obviously some similarities and uses some of the same calculations as the falling test. Therefore, in order to code no calculation twice, there is then a another parent test clas1s which implements the more general case of apple movement with arbitrary initial conditions.
To give another example for this modularity from network neuroscience. One typical base test would calculate correlations from neural activity, a derived child test could either directly compare the distribution the correlation coefficients, or could generate a corresponding weighted graph. This graph test class would then be parent to various tests of graph centrality measures. These are then very different tests, but both use the same calculation of correlation.
Finally, this framework also ensures that a test against a particular model actually works, and the model is able to describe the property that was measured in the dataset. This is realized by a capability class.
Here, the capability the test makes use of would be the movement of the apple, thus Newton's model simulation should have the capability to produce a movement to be evaluated.
This modular structure is indeed also versatile enough to also incorporate the validation practice of comparing two models. This requires only one additional inheritance of a dedicated test class. This way, validation tests with experimental data and comparison tests between two models can be formally equivalent.
So, in our apple example, Newton's model could be directly compared to a relativistic model of motion, and evaluate the domain in which Newton's model has a sufficient accuracy, without needing to measure very fast apples.

5) Minimal code example

How does this actually look like in code?
Here I show a simple example for comparing the inter-spike-interval distributions in three versions of a Poisson-process network model with different average firing rates.
Such basic models and tests are provided by NetworkUnit, but can be expanded on by the user.
In these three line the three models are initialized with rates 5, 15, and 45 Hz.
The test class is specified as child class of a more general test, and the model-to-model class.
As a score statistic we choose an Kolmogorov-Smirnov Distance, and set the parameters as far as necessary.
The validation is then performed by calling the judge function of the test, which returns the score (or here scores).
And as you would expect the score here illustrates that the larger firing rate difference, also shows the larger the ISI difference.
But the tripeling of the firing rate from 5 to 15Hz shows about the same as the change in the inter-spike intervals as the tripeling from 15 to 45Hz.

6.1) Application: Simulator comparison

Ok, let's move to a less trivial example, and apply the validation workflow with NetworkUnit to compare simulators.
This is in the context of a reproducibility study of the 'Polychronization' model, published in 2006 by Izhikevich.
This model is interesting for at least two reasons. First, it produces rich network dynamics, e.g. spatio-temporal organized and repeating spiking patterns. Second, the actual implementation of the simulation was published alongside the paper, which only made it possible to start reproducing the study.
A first paper by Pauli and Weidel, focuses on an exact reproduction of the original simulation, which was written in custom C code, with the simulator engine NEST. Following this example they discuss model specifications which are necessary to enable such an exact reproduction.
In parallel we employ a reproduction of the model on the neuromorphic hardware system SpiNNaker.
Since, the underlying computation of the neuromorphic hardware is substantially different from a conventional computer, there is no way to achieve a bitwise reproduction as would be between software simulators. Therefore, the evaluation of the simulation on SpiNNaker requires validation techniques.
This work is split into two papers. One focusing on the implementation details relevant for reproducibility, and corresponding verification techniques. And the other focuses on the evaluation of the simulation in comparison to the original results.
I will briefly show, what we learned from this comparison.

6.2) Application: Simulator comparison

We started off by naively porting the model to the hardware using its default settings, and then improving on the underlying ODE solver.
Here you see the spiking activity of the first three iterations of the SpiNNaker model implementation in shades of green, and the original C simulation in blue.
By eye, you could already see that the activity resembles the orginal more and more with each iteration.
However, latest for the third iteration it becomes hard to judge whether by visual inspection whether this is already a good reproduction of the model.
Therefore, we employed validation tests already while working on the model implementation, and used the results not to calibrate but to guide the development process.

6.3) Application: Simulator comparison

Here you see the illustration of three tests, comparing the distributions of three different features of the network activity, firing rates, spiking regularity measured by the local coefficient of variation, and correlations.
The difference between the distributions (blue for the original C simulation, and green for the SpiNNaker simulation), is quantified by an effect size.
Each row is one iteration on the SpiNNaker implementation. In the second and third row, we improved on the underlying ODE solver, and used the validation test results as feedback if the model improvements had the expected effects.
So with these test you can assess, that just tweaking the ODE solver doesn't lead to a satisfactory agreement for the firing rate and the correlation coefficients. Only the LV regularity measure shows a good match.
So, further investigation of the model dynamics lead to use a finer time resolution and a better threshold detection.

6.4) Application: Simulator comparison

Introducing finer integration steps, switches the picture. Now, there is a fairly good agreement for the firing rates and the correlation coefficients, while the match for the LV measure became worse.
This is particularly interesting, since one might assume that a good agreement of measures of more statistical complexity, like the correlation coefficient which takes into account the interplay between pairs of neurons, would entail the agreement of single neuron measures, like the LV.
However, this example nicely shows that there is no such hierarchy of failure.
But thanks to this revealed discrepancy in the LV measures we became aware of a shortcomming in the implementation of the threshold crossing detection.

6.5) Application: Simulator comparison

This fixed, in the final iteration of the comparison we also compared additional measures to get a more complete evaluation of the quality of the SpiNNaker simulation, the Inter-Spike-Intervals, the Rate Correlations, and the eigenvalue distribution for the rate correlation matrix.
From this comparison it is evident that although SpiNNaker can qualitatively reproduce the simulation results, a exact reproduction was not possible and there are still small discrepancies in the simulations, which are here most evident in the slightly higher firing rates and correlations in the SpiNNaker simulation.
As a short note: even though the discrepancies are small, they can have a potentially large influence on the spatio-temporal patterns (or 'polychronous groups') of the model.
The big advantage of having quantitative validation scores like this. That they can be used as reference, when the model is run on the next version of SpiNNaker, or another model is being reproduced.
This application example illustrates the upsides of using validation techniques for quantitative model evaluation, even beyond the standard scenario.

7) Outlook

Ok, taking a step back now I'd like to put this work in some more context.
So far, the tests of NetworkUnit mainly focus on single and pairwise measures of spiking neuron activity to characterize the network dynamics. To provide another angle to describe the network activity we are currently working methods to characterize network activity in terms of spatial dynamics, such as you see here in a LFP recording of an implanted electrode array.
Also I'd like to mention that besides NetworkUnit there are similar specialized validation packages in development. Most prominently NeuronUnit for single-cell validation, but also HippoUnit for Hippocampus models, MorphoUnit to test Morphologies, and others.
To make validation more collaborative, the individual validation workflows are to be integrate-able with the in-development Validation Framework of the Human Brain Project, which basically provides a searchable database of models, tests, and scores.

8) Take-aways

Finally, to wrap this up, I'll give you the three main points you should take away from this talk.
1) Validation is important! Simulation without quantification is just guesswork. And validation makes modeling and simulations science quantifiable and also more reproducible.
2) NetworkUnit and other tools exists to help. Use them.
3) When you run validation tests, please run more than just one. And even if you don't have an appropriate data set, validate against another model.