Practical experience and framework for sensitivity analysis of hydrological models: six methods, three models, three criteria

. Sensitivity Analysis (SA) and Uncertainty Analysis (UA) are important steps for better understanding and evaluation of hydrological models. The aim of this paper is to briefly review main classes of SA methods, and to presents the results of 10 the practical comparative analysis of applying them. Six different global SA methods: Sobol, eFAST, Morris, LH-OAT, RSA and PAWN are tested on three conceptual rainfall-runoff models with varying complexity: (GR4J, Hymod and HBV) applied to the case study of Bagmati basin (Nepal), and also initially tested on the case of Dapoling-Wangjiaba catchment in China. The methods are compared with respect to effectiveness, efficiency and convergence. A practical framework of selecting and using the SA methods is presented. The result shows that, first of all, all the six SA methods are effective. Morris and LH-15 OAT methods are the most efficient methods in computing SI and ranking. eFAST performs better than Sobol, thus can be seen as its viable alternative for Sobol. PAWN and RSA methods have issues of instability which we think are due to the ways CDFs are built, and using Kolmogorov-Smirnov statistics to compute Sensitivity Indices. All the methods require sufficient number of runs to reach convergence. Difference in efficiency of different methods is an inevitable consequence of the differences in the underlying principles. For SA of hydrological models, it is recommended to apply the presented practical 20 framework assuming the use of several methods, and to explicitly take into account the constraints of effectiveness, efficiency


Introduction
Hydrological models are widely used to simulate natural phenomena, mainly for the purpose of generating forecasts.
Deterministic forecasts inevitably raise the issue of its uncertainty.This uncertainty mainly comes from the error of gathering input data, e.g.rainfall and evapotranspiration, parameters of the model and the model structure itself.Nowadays, the interests to Uncertainty Analysis (UA) methods and procedures have grown considerably.The study of the UA will not only improve the credibility of the model itself but also be conductive to decision making under uncertainty.
Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2018-78Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 28 February 2018 c Author(s) 2018.CC BY 4.0 License.due to that in inputs or parameters, whereas UA provides a more general and often more detailed and rigorous account of model uncertainty.Saltelli et al. (2008) formulates the three main specific purposes of SA:


Factor Prioritization (FP): ranking the factor in terms of their relative sensitivity;  Factor Fixing (FF), or screening: determining the factors are influential or not to the output uncertainty;  Factor Mapping (FM): given specific output values or ranges, locating the regions in the factor space that produces them.
In this study, we only focus on ranking and screening.
SA is typically categorized into Local Sensitivity Analysis (LSA) and Global Sensitivity Analysis (GSA).LSA concentrates on the sensitivity of factors at particular points in the factor space, for example, around the vector of the calibrated parameters.GSA, on the other hand, assesses the sensitivity of the factor through the whole factor space.By design, LSA is simpler and faster.
A simplest expression of local sensitivity is the first-order partial derivatives of output to the factors.Define a model y = f(x), where y is the output of the model; x is factor of the model.The sensitivity of the factor (S) is defined as: where i is the i-th factor of the model.(Note, that in quite many studies instead of model output y the model error is used, e.g. Root Mean Squared Error or Mean Absolute Error.)Higher value of Si indicates higher sensitivity of the factor.Such measure of sensitivity is often called Sensitivity Index (SI). Figure 1 shows the expression of sensitivity of a model with two parameters (factors).
If we randomly sample several points in the whole parameter space, and obtain Si for each sample point.After that we can aggregate the results (e.g.calculating the mean value of these Si), assessing thus the global sensitivity of the model.
Global Sensitivity Analysis methods can be classified into Generalised Sensitivity Analysis method, variance-based methods, GLS (globally aggregated measure of local sensitivities) methods, density-based methods and meta-modelling methods.Different methods arew based on different theories and principles, and as a result, have different efficiencies.Which method is the best to use is always an issue to discuss in the field.There are various studies comparing different SA methods.
In the study of Tang et al. (2007), four SA methods have been analysed and compared on SAC-SMA coupled with SNOW-17.
The results of the study show that the choice of SA methods has great impacts on the parameter sensitivity of the model.Pappenberger et al. (2008) tested five SA methods on a flood inundation model (HEC-RAS).It is demonstrated that different methods result in different ranking of factors, thus solid conclusions about the sensitivity of the factors are impossible to draw.Gan et al. (2014) have evaluated the effectiveness and efficiency of ten widely used SA methods on SAC-SMA model.The result demonstrates qualitative SA methods are more efficient than quantitative SA methods, whereas quantitative SA methods are more robust and accurate.Song (2015), Razavi and Gupta (2015) and Pianosi et al. (2016) gave systematic reviews of SA concepts, methods and framework respectively.Suggestions on how to choose SA methods are provided.However, these With respect to sample-based SA methods, the coverage of the factor space is the key point in SA accuracy: with small samples the SA results are imprecise.In other words, there is also uncertainty in SA (in fact, the same can be said about UA).
In order to deal with this issue, convergence of the Sensitivity Indices should be studied.
In spite of many reviews and comparison of SA methods that have been carried out, there are not too many studies that investigate the convergence and uncertainty of the SA results.Yang (2011) assessed the convergence of Sensitivity Indices for five different Global Sensitivity Analysis methods using Central Limit Theory (CLT) and bootstrap techniques.In her study, the estimates of mean and Confidence Interval (CI) are plotted against increasing base sample size for each method.Once there is no significant fluctuation in the values, the convergence is reached.Sarrazin et al. (2016) proposed a methodology to study the convergence of Sensitivity Indices, ranking and screening.They have defined quantitative criteria for the convergence of Sensitivity Indices, ranking and screening, and tested the methodology on the three widely-used GSA methods applied to three hydrological models.
Yet another aspect worth attention is the choice of SA method(s).Most of the studies concerning SA could not draw firm conclusions about how to choose the best SA method (and this is understandable since there are many ways to define what is the "best" one).Also, the uncertainty in SA is not investigated much.
The first objective of the study is to test and compare the widely used classic SA methods as well as the SA methods developed recently (e.g., PAWN, Pianosi and Wagener, 2015a) in the aspects of efficiency, effectiveness and convergence.
The second objective is to give suggestions on how to choose SA methods for various hydrological (or hydraulic models) based on their computational cost, robustness and easiness of implementation.The third objective of the study is to formulate a practical framework of sensitivity and uncertainty analysis of hydrological models, thus contributing to and complementing the guidelines published earlier (e.g.Saltelli et al., 2008;Baroni and Tarantola, 2014;Song et al., 2015;Pianosi et al., 2016).
Each individual SA study has its specifics so it is hardly possible to have a unified framework or procedure that would fit all possible requirements.Each researcher or practitioner would have a choice of various approaches, principles and components to combine and follow in SA.
The structure of the paper is as follows.Section 2 gives detailed introduction and description of Global Sensitivity Analysis methods.Section 3 presents the methodology and case study of this study to evaluate GSA methods.The results of the study are shown in Sect. 4 and followed by discussions in Sect. 5. Finally, conclusions are drawn in Sect.6.

Global Sensitivity Analysis methods
This section is in no way a detailed presentation of the methods, but rather a brief introduction to the techniques compared in this study.For comprehensive reviews, please refer to Song et al. (2015), and Pianosi and Wagener (2015), and for a relatively recent interesting insight into the SA problem to Razavi and Gupta (2015, 2016a, 2016b).

Generalised (Regionalised) Sensitivity Analysis method
Generalized Sensitivity Analysis method (also referred to as Regionalised SA) has gained popularity in environmental and water-related research in the end of 1970s, especially after the papers by Spear andHornberger (1980, 1981); to some extent Whitehead and Young (1979), and it is also worth checking the earlier work by Spear (1970).This approach was positioned as a Monte Carlo framework used for "probabilistic calibration" which aimed at finding regions in parameter space leading sets of "behavioural" (good) and "non-behavioural" (bad) models (which point at the regions of critical uncertainty), rather than aiming at finding one "best" model.Simulation results are split into these two groups based on their performance (e.g.model error), the Cumulative Distribution Function (CDF) of each factor is generated for each group, and their difference is analysed.Typically, the Kolmogorov-Smirnov statistic (Massey Jr, 1951) is used to compute the discrepancy between these CDFs.GSA allows for identifying the regions of the model parameter space in which parameters have the significant effect on the model behaviour.One can see also that GSA, being based on the Mote Carlo framework and using statistical analysis of outputs, can be also seen as a representative of UA.
One of the drawbacks of RSA is that the results are influenced by the selection of different thresholds and so this undermines its objectivity.To resolve this problem, Wagener et al. (2001) presented an extension of this method.The parameter sets are grouped into ten groups instead of two, based on the model performance.They are sorted from best to worst, in which the first group produces the best 10% results (e.g. the results with least 10% model error), the second group produces the best 10%-20% results and so on.Empirical CDFs of the parameters are also plotted for each group, if the curves are concentrated or overlapped, the parameter are not sensitive, and vice versa.For detailed description and implementation of the method, please refer to Jakeman et al. (1990) and Wagener et al. (2001).

Variance-based methods
Variance-based methods are today the most popular approaches for SA.The underlying assumption of variance-based methods is that the sensitivity can be measured by the contribution of the factor's variance (the contribution of the factor itself, or interactions with two or more factors) to the variance of the output.The biggest advantage of a variance-based method is that it can compute the main effect and higher-order effect of factors respectively, and make it distinguishable which factor have high influence on the output by its own, and which factor have high interaction with others.It is normally unrealistic to analytically compute the Sensitivity Index because of the complexity of hydrological models.
Instead, Sobol' proposed an efficient sample-based approach to compute first and total-order Sensitivity Indices -the called Sobol' method -which is perhaps the most popular variance-based SA method (Sobol', 1993).A detailed description of the method and its implementation can be also found in Saltelli et al. (2008).
Though the result of Sobol' method is robust, often considered as benchmark run for study, it is computationally expensive, requiring large number of base samples.Another popular approach to numerically compute variance-based Sensitivity Indices is the Fourier Amplitude Sensitivity Test (FAST), presented by Cukier et al. (1973).The key idea of FAST is applying the ergodic theorem to transform the n-dimension integral to one-dimension integral.Saltelli and Bolado (1998) provide a detailed description of principles and procedures for implementation of the method.One of the drawbacks of FAST method is that it can only compute the main effect.However, an improved version of FAST, which is extended FAST (eFAST, Saltelli, et al., 1999), can compute first and total order Sensitivity Indices.

Globally aggregated measure of local sensitivities (GLS) method
As mentioned in sect. 1, the globally aggregated measure of local sensitivities methods use average value of SA measures (e.g.first-order derivative) at each local sample points in the factor space as Sensitivity Index for each factor.Morris (1991) proposed an approach which he referred to as Elementary Effects Test (EET) to compute the sensitivity.It is also called Morris Screening method.Its modification was proposed by Campolongo et al. (2007).Its principle concept is to use the mean and standard deviations of the gradients of each sample as the measure of the overall effect and interaction effect of each factor across the p level factor space.Morris Screening is a simple but effective method, widely used for screening in hydrological modelling.A more detailed description of the method can be found in Saltelli et al. (2008).
Since sampling is time-consuming, it is reasonable to use economical techniques for it, and e.g., van Griensven et al. (2006) employed Latin Hypercube Sampling, followed by assessments of the local error derivatives at each point "one at a time" (OAT), which they named LH-OAT method.The Sensitivity Index of each factor is obtained by averaging the derivatives of all perturbed samples.
All GLS methods conceptually are quite simple and their reported implementations typically do not require large number of runs.However, Razavi and Gupta (2015) have pointed out that they may suffer from scale issue, that is, the selection of the step size may influence the results due to the complexity of response surface of the model.Krykacz-Hausmann, 2001;Liu et al., 2006) and the δ-sensitivity measure (Borgonovo, 2007;Plischke et al., 2013) are implementations of this concept.

Both
The production of empirical PDFs is a crucial step in most of the density-based methods.However, the derivation of empirical PDFs is either too simple, so that the results may not be accurate, or too complex to implement.Recently, Pianosi and Wagener (2015a) proposed a novel method called PAWN that partly overcomes this difficulty.The key idea of PAWN method is to compare the unconditional CDF of output with conditional CDFs of output which prescribe one parameter at a fixed value (the conditioning value) while others vary randomly.

Use of meta-modelling to reduce running times
Sampling used in SA requires considerable computational time, for complex models prohibitively long.The basic idea of meta-modelling is to substitute the original model (and hence its response function linking factors and model output) with a simpler function or a model.This substitution is typically done by using statistical or machine (statistical) learning techniques, and employing methods of the so-called experimental design for generating data by the model runs to be used for training the meta-model.SA is carried out using the meta-model, and for this mostly variance-based method is used.
Techniques used for this purpose include Radial-basis function network (RBF, Broomhead and Lowe, 1988), multivariate adaptive regression splines (MARS, Friedman, 1991), support vector machine (SVM, Cortes and Vapnik, 1995), Gaussian processes (GP, Rasmussen, 2004) and treed Gaussian processes (TGP, Gramacy and Lee, 2008).The advantage of metamodelling is that by simplification of the original complex model, the overall running time is considerably decreased; the trade-off is a possible loss of accuracy.

Methodology and Experimental set-up
3.1 Methodology for evaluating SA methods

What aspects do we evaluate
Different SA methods have different concepts and principles behind them, and, accordingly, the Sensitivity Indices may have different meaning and metrics.However, it would be logical to try to follow the general principles behind any method for a model (method) evaluation, i.e. effectiveness and efficiency.The evaluation of SA methods' effectiveness is aimed at finding out whether the relative Sensitivity Indices, ranking and screening of parameters have sense and indeed can be used in SA.
Efficiency of SA methods is assessed by how fast (in terms of computational time) they provide the result: the lesser number of model runs is required, the more efficient the method is.Therefore, the evaluation of SA methods efficiency is to figure out the minimum number of runs required for each SA method to get satisfactory results -and it is not always clear and explicitly defined what "satisfactory" actually means.Due to the fact that sampling is employed, there is always uncertainty in the SA results, and the values of the Sensitivity Indices calculated depend on the sample size.In order to take into account the uncertainty nature of SA results, the convergence of the SA results should be studied, and this forms the last aspect of the evaluation of SA methods.

Evaluation of effectiveness
The result of SA is not an absolute one and nobody can say what is the "correct answer".Unlike assessing the accuracy of a hydrological model, which can be compared with the observation values, for sensitivity there are no 'observations' to be compared with.To start somewhere, we will initially randomly sample a large number (say, 10,000) parameter (factor) vectors and run the model for each of them.The RMSE of the model output will be plotted against parameter values as a scatter plot which will provide a rough image of the sensitivity of each parameter.The preliminary assessment of the sensitivities of each parameter will be treated as a reference.Then all considered SA methods will be run, and their results will be compared with the reference to assess their performance.Effectiveness will be evaluated on the three aspects: Sensitivity Indices values, ranking and screening.
We realize that constructing a reference this way provides quite a rough estimation of sensitivity, and this is an inevitable limitation.Therefore, the results of all the methods will be taken into account, compared and analysed to see the differences and similarities between them and not only with the reference.

Evaluation of efficiency
For each method, one benchmark test will be run with a considerable size of the base sample set of 10,000.Different base sample sizes will be set for each SA method, to be compared with the results of its benchmark run.From the results, the minimum base sample size will be found for each SA method to ensure the effective results in terms of Sensitivity Indices stability and factors ranking.

Evaluation of convergence
Convergence of Sensitivity Indices will be analysed by calculating 95% confidence intervals, mean and variance for various sample sizes.To increase the confidence of estimates, bootstrapping (see e.g.Efron and Tibshirani, 1986) will be used as well.
The following procedure will be employed (adapted from Yang, 2011):

Case study
The presented methods have been tested on two case studies: Dapoling-Wangjiaba catchment in China, and the Bagmati catchment located in central Nepal.Due to data limitations issues not all experiments with the first case have been finalised, so it is not reported here, and is left for the future publications.
Bagmati catchment covers an area of approximately 3700 km 2 (see Fig. 2).The altitude of the region varies from 2913 m in the Kathmandu Valley, to Terai Plain, where it reaches the Ganges River, in India with an altitude of 57 m.The Bagmati River has an extension of about 195 km, flowing from Shivapuri to the Ganges River in the south.In this study, focus is put on the part of the basin that drains to the Pandheradobhan station, with an area of 2900 km 2 and river length of 134 km.
In this study, daily precipitation and air temperature from Kathmandu, Hariharpurgadhi, and Daman station and daily discharge in Pandheradobhan station from 1 March 1991 to 31 December 1995 are used.The daily average precipitation was assessed using Theissen polygon method and the potential evapotranspiration is calculated by the modified Penman method recommended by the Food and Agriculture Organization-FAO (Allen, 1998).The hydrograph is shown in Fig. 3.

Test model
The SA will be tested on three conceptual rainfall-runoff models: GR4J, Hymod and HBV, with increasing complexity and the parameters number.
The modè le du Gé nie Rural (Agricultural Engineering Model) à 4 paramè tres Journalier (4 parameters Daily, GR4J) was developed by Perrin et al. (2003).It uses daily precipitation and evapotranspiration as input to simulate the runoff discharge.
The model structure assumes that after neutralization of precipitation by evapotranspiration, a portion of net rainfall goes to production store, where percolation takes place.The leakage flow, together with the remaining part of the net rainfall, go to routing store, where they are split into two parts and routed by two unit hydrographs.After exchanging with groundwater, the total runoff is generated by adding these two parts.The four parameters with their meaning and ranges are shown in Table 1.
The Hymod model, first introduced by Boyle (2001) and presented in Wagener et al. (2001) has been used quite widely for rainfall-runoff modelling because of its simplicity.It consists of a simple rainfall excess model with two parameters and a routing module with three parameters.In the rainfall excess model, the soil moisture storage capacity is assumed to be variable, described by a distribution function.The routing module contains two sets of parallel linear reservoirs.Three identical linear reservoirs account for the fast runoff component and a single linear reservoir accounts for the slow runoff component.The name, meaning and ranges of the parameters are shown in Table 2.
The HBV (Hidrologiska Bryå ns Vattenbalansaldevning) model is a conceptual rainfall-runoff model widely used in Europe.It was developed by the Swedish Meteorological and Hydrological Institute (Bergström, 1976) and then promoted by Lindström et al. (1997) 3.

Experimental set-up
The experimental set-up is presented in Table 4.The evaluation is done on six SA methods: Sobol, eFAST, Morris, LH-OAT, RSA and PAWN.All software is implemented in MATLAB.For Sobol method, eFAST.Morris screening, RSA and LH-OAT, the codes are constructed by the first author.For PAWN method, the codes from the SAFE toolbox (Pianosi et al., 2015b) are used.In the present study we follow a widely adopted approach when instead of studying the sensitivity of the model directly, the sensitivity of the model error (deviation from observations) is analysed instead.For the model error, we use the Root Mean Squared Error (  ): where    is the simulated model output at time step t,    is the observation value at time step t, n is the number of time steps.To avoid the influences of model initial states, the first three months (90 time steps) are excluded when computing Due to the characteristics of FAST sampling in eFast method, bootstrapping resample is not applicable, evaluation of convergence will not be done for eFAST method.The resample size for other SA methods for evaluation is 100.

Preliminary assessment of sensitivity
The model was run 10,000 times; the scatter plots of the   against parameters for the three models are shown in Fig. 4-6.
From the scatter plot, the relative sensitivity of the parameter can be seen from the randomness of its distribution (i.e.proximity to the uniform distribution).The more randomly the RMSEs are distributed, the less sensitive the parameter is.
In the GR4J model (Fig. 4) X4 is shown to be the most influential parameter, followed by X3 and X2, while X1 seems to be of little influence.X4 is the time when the ordinate peak of flood hydrograph is created, which actually determines the shape of the hydrograph, so it is no surprise it appears to the most influential parameter in the model.X1 is the storage of rainfall in the soil surface, which does not affect the routing process too much, thus it is the least sensitive parameter.
In Hymod model (Fig. 5), ALFA, RS and RF have high influence on RMSE.SM and BETA, however, seem to be noninfluential.This is understandable, because ALFA, RS and RF controls the fast and slow pathway in flow-routing module which is more important in determining flows, while SM and BETA only account for soil moisture routine which is less important.
In Fig. 6, it can be seen that in HBV model, MAXBAS is obviously the most influential parameter; FC, ALFA, BETA and PERC also shows certain degree of sensitivity; LP, K, K4 and CFLUX are non-influential.The reason is that MAXBAS is the (routing) transfer function parameter which controls the shape of the hydrograph.

Effectiveness
Figure 7 shows the results of benchmark runs of each method for three models, and one may see the following: 1) all the methods identify the same set of sensitive parameters (X3 and X4 for GR4J, ALFA, RS and RF for Hymod, MAXBAS for HBV); 2) for less influential or non-influential parameters, different methods show relatively large discrepancy in results; 3) the results of Sobol and eFAST are close, and it is also so for Morris and LH-OAT, RSA and PAWN, which indicates that the methods of the same category have similar results.This is due to the reason that both Sobol and FAST are variance-based methods, they all calculate the contribution of the variance to the output.Both Morris and LH-OAT compute the first-order partial derivatives of the output.Similarly, RSA and PAWN use empirical CDFs and KS statistics to quantify the sensitivity.
These groups of methods share the same principle; 4) comparatively, the results of RSA and PAWN are always quite different from other methods.There may be two reasons: firstly, the generation of empirical CDFs may be inaccurate; secondly, the use of KS statistics to compute Sensitivity Index in both methods may bring instability into the results (sensitivity to sampling) because KS statistics takes into account only the maximum difference between CDFs; 5) ranking of parameters for the three models by different SA methods has many differences, but they are quite close in identifying sensitive and insensitive parameters, which means they are effective in screening.
In general, it can be said that all six methods are effective in computing SI.The results of RSA and (to a smaller extent) PAWN are to be treated with care because they use the (sensitive) KS statistics based only on the maximum difference in CDFs between the behavioural and non-behavioural models' sets.

Efficiency
Figure 8 demonstrates the results of six SA methods for the three models for different number of runs.The minimum number of runs needed to get stable ranking of the parameters can be found in Table 5.As can be seen, among all the methods, Morris and LH-OAT converge quickly and are very stable across all numbers of run: they can get reliable SI and ranking at a very small number of runs (100 base sample for each model for both methods).eFAST is also quite stable, and it can get reliable results after approximately 300 base sample size.Comparing eFAST with Sobol method, it can be concluded that eFAST is more stable and more reliable (note also that at some point Sobol method results even in negative SI). RSA and PAWN are For all methods, it can be seen that with the increase of the model complexity and number of parameters, the results of SA become less stable.Especially for HBV model, except MAXBAS, all other parameters seem to be of similar sensitivity, therefore there is a considerable fluctuation in results.This can also be seen in GR4J and Hymod in which parameters have similar sensitivity (X1 and X3 for GR4J, SM and BETA for Hymod).

Convergence
Figure 9 presents the estimates of the mean and the 95% Confidence Interval of all SA methods for three models with different number of runs.Overall, with increasing number of runs, the width of CI become narrower and have less and less variation.
There are still differences in the width of CI and speed of convergence between the methods.It can be seen that Morris, LH-OAT and RSA converge well already at early runs, and the width of CI are quite narrow across all runs.PAWN method converges comparatively slower and the width of CI is also wider.Sobol method is slowest, especially at small number of runs.
The upper and lower bound of SI significantly exceed the range 0 to 1, which is quite unacceptable.
For all methods, similar conclusions as in efficiency can be drawn that with the increase of the model complexity and number of parameters, the uncertainty of SA also goes up.This increase of uncertainty also results in unstable results when the sensitivities of the parameters are close as shown in results of efficiency.
From the results shown above, it is proven that all six methods are effective in calculating Sensitivity Indices, screening and ranking.Their efficiencies, however, differ.The minimum number of runs for computing Sensitivity Indices, ranking and reaching convergence with each method are presented in Table 5.
In general, it takes many more runs to reach convergence, but many less runs is sufficient to obtain reliable ranking of the parameters.Sobol method requires large number of runs to be stable and reach convergence, which is very inefficient.Same as variance-based method, eFAST method is much more efficient and stable.It is a good alternative for Sobol method with high efficiency.Morris and LH-OAT are also quite efficient and can provide results of ranking after relatively small number of runs.Also, the uncertainties of the values of Sensitivity Indices are not so high, and especially they are good at ranking and screening.The density-based methods, however, need sufficient number of runs to produce reliable eCDFs, thus the efficiency is not so high.Furthermore, using KS statistics to compute Sensitivity Indices may be problematic for some types of distributions.Comparing RSA with PAWN, one can see that RSA performs better, especially for ranking, however, due to its design, it provides less detailed analysis of sensitivity.Therefore, the propagation of the uncertainty in the SA is also simpler and more direct.Variance-based methods, however, with much more complex principles, result in higher uncertainty in SA.On the other hand, density-based methods may suffer from the necessity to producing reliable and accurate eCDFs, and the fact that using K-S statistics to compute SI.As a result, they are highly unstable and uncertain.RSA performs better than PAWN, owing to its relatively idea (dividing the factor vectors into only two or several sets).
Although variance-based methods seem to be less efficient in computational cost, they use more sophisticated mathematical and statistical apparatus and quantify sensitivity most accurately.Comparatively, GLS methods use only the first derivatives to compute SI, which is of course carries less information (and we can say, less accurate).Density-based methods are moment-independent, they do not need complex equations or computation to get SI, but their strength in quantifying sensitivity is problematic, as stated earlier.Generally speaking, the efficiency and depth of quantification are in inverse relationship.To obtain greater degree of quantification, it takes more model runs, and aiming to reach higher efficiency will lead to inevitable sacrifices in accuracy and reliability of the results.The method that best balances these two aspects seems to be the eFAST method.It uses variance to quantify the sensitivities, and at the same time, requires much smaller number of runs than Sobol method.
Another aspect to be mentioned is the easiness of methods implementation and their integration with (hydrological) models.If the method is too difficult to implement and integrate with the already existing and operational models, its use may be quite limited.This is especially true for distributed models, when sampling may be required at every grid cell, so it is not realistic to use too complex sampling methods, such as in eFAST.In these situations, methods with very simple principals like RSA and LH-OAT are more suitable.
Density-based methods seem attractive due to their simplicity, but they have two problems.On the one hand, the reliability of eCDF produced is questionable.On the other hand, the use of Kolomogorov-Smirnov statistic to compute the Sensitivity Indices, unstable by design, may lead to slow convergence.However, they have two advantages: first, they are moment independent methods, which do not need complicated computational process; second, the results of SA can be expressed in graphs which provide yet another instrument for analysis.One of the idea that can be explored is to quantify the results not by K-S statistics, which is the maximum difference between the two eCDFs, but to consider an integral difference (the area between two CDFs).

Recommendations for choosing SA methods
Based on the experiments and considerations presented above, we can formulate the following recommendations for choosing SA method(s): • For simple conceptual hydrological models (not requiring much time for running them multiple times), variance-based methods as Sobol and eFAST are recommended, because they have a strong theoretical background and provide more insight into sensitivity.

•
For more complex hydrological or hydraulic models, that need considerable time to run, GLS methods can be used, since they are more efficient.

•
For distributed models, methods with simple concepts and sampling techniques are more suitable, such as RSA and LH-OAT.

•
For very complex models, e.g.2D (or even 3D) models, like flood inundation models, or high resolution groundwater models of large aquifers, the Local SA instead of Global SA can be used (Hill and Tiedeman, 2007), or LSA at a selected limited number of points in the factor (sub)space, for a reduced number of factors.

•
In situations when only relative sensitivity of the factors is needed, rather than the exact value of SI, it is advisable to aim only at determining ranking or screening of SA which needs significantly less time than the calculation of global SI.

•
If time allows, it is recommended, however, to employ several different SA methods rather than using only one method.

Practical framework
Based on the analysis of effectiveness, efficiency and convergence of methods, and the recommendations above, we may suggest a practical framework for Sensitivity Analysis and Uncertainty Analysis, as shown in Fig. 10.We consider both SA and UA to be important phases of model analysis, both focusing on certain aspects of model uncertainty, so it is reasonable to bring them together under one framework.
This framework assumes the model is already calibrated, however, it is also applicable to uncalibrated models for choosing a (limited) set of the (sensitive) parameters to calibrate which can improve the efficiency of calibration process.
In case there is a possibility to employ several methods, we can suggest to select one method from each category: variancebased methods, methods aggregating the local sensitivity measures, and density-based ones; the overall judgement about sensitivity will be then better informed.If time does not allow for a large number of runs, Local Sensitivity Analysis method can also be used for the calibrated or observed values of the factors.
It is also recommended to first start with a small number of sample size, and then gradually increasing the sample size until the Sensitivity Indices or ranking converges or stabilizes.The stopping criteria is subjective, depending on one's requirement of accuracy or limitation on number of runs.Note that one should balance between the accuracy of the results and the efficiency to obtain these results.

Limitations
We see the following main limitations of this work: First, the models used in this study are only conceptual rainfall-runoff models with similar structures, so the results may be different for other types of models.
Second, the evaluations we did to SA methods are still qualitative, so to evaluate each aspect of SA methods some more rigorous quantitative standard should be set.For example, when evaluating convergence, a threshold of the CI width should be defined for reaching convergence.Quantitative assessment will strengthen the conclusions of the comparisons.

Conclusions
SA and UA are important steps for better understanding and evaluation of hydrological models.For complex hydrological models, sample-based SA methods are often used.In this study, six different Global Sensitivity Analysis methods: Sobol, FAST, Morris, LH-OAT, RSA and PAWN are tested on the three conceptual rainfall-runoff models: GR4J, Hymod and HBV with increasing complexity and the number of parameters.The methods are compared according to the three criteria: effectiveness, efficiency and convergence.
The results of each method are not exactly identical, but still similar to each other.All of the methods are proven to be effective.Methods from the same category show similar results as they are based on similar principles.The credibility of density-based methods is slightly undermined for two reasons: first, the reliability of eCDF produced may not be always high; second, the use of Kolomogorov-Smirnov statistic to compute the Sensitivity Indices lead to slow convergence.
The evaluation of each method's efficiency demonstrates that GLS methods as Morris and LH-OAT are very efficient and stable in computing SI and ranking.Sobol method can provide quantitative results of SA, but it requires large number of runs to obtain stable results.eFAST is much more stable and efficient than Sobol, thus it may be seen as a good alternative for Sobol method.The efficiency of density-based methods is not so high, but RSA can give reliable results of ranking with small number of runs.
All the methods need significant number of runs (>8000) to reach convergence.The uncertainty in the values of Sensitivity Indices is not negligible.One should be careful when interpreting the results if the number of samples is not sufficiently large.
The difference in efficiency of different methods may be due to the difference in the underlying principles.Methods based on simple concepts are more efficient and stable.Methods based on the more complex concept seem to be less stable and efficient, however, their quantification of sensitivity is more accurate and reliable.
The presented recommendation for choosing SA methods, and the framework for SA and UA based on effectiveness, efficiency and convergence, as well as ease of integration with the models, add to other useful SA frameworks (workflows) (e.g.Pianosi et al., 2016), and may be of assistance for practitioners assessing reliability of their models.

Notes:
k is the number of parameters; N is the base sample size, Ms is the number of higher harmonics to be considered; Ncs is the number of search curves; Nu is the number of samples for constructing unconditional CDFs; n is the number of conditioning values for each parameter, Nc is the number of samples for constructing conditional CDFs.For the detailed explanation of 5 the parameters within each method please refer to the literature referred in Sect.
Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2018-78Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 28 February 2018 c Author(s) 2018.CC BY 4.0 License.suggestions are only made based on dissimilar studies and literature reviews, a comprehensive comparison of SA methods applied to one case study is lacked in their studies.
GLS methods and variance-based methods are moment-dependent approaches, which use the first moment (first-order derivatives) or the second moment (variance) to compute Sensitivity Indices.The density-based methods do more, and explore PDFs or CDFs of the output.Sensitivity is measured by the comparison of unconditional PDF derived from purely random samples and conditional PDF derived when prescribing one factor.Entropy-based sensitivity measures (Park and Ahn, 1994; Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2018-78Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 28 February 2018 c Author(s) 2018.CC BY 4.0 License.
to become the HBV-96 model.In this study, a simplified version of the HBV-96 model is used.It consists of the three main modules, which is characterized as tank respectively, with 13 parameters: four of the parameters are related to snow accumulation and melt module, four with soil moisture accounting module and five with river routing and Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2018-78Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 28 February 2018 c Author(s) 2018.CC BY 4.0 License.response module.For river routing and response module, two runoff reservoirs are included.The upper non-linear reservoir accounts for the quick flow and the lower linear reservoir accounts for the base flow.Since there is little snowfall in the applied case study, the snow accumulation and melt module are excluded, so only nine parameters will be analysed.The name, meaning and ranges of the parameters are shown in Table Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2018-78Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 28 February 2018 c Author(s) 2018.CC BY 4.0 License.
Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2018-78Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 28 February 2018 c Author(s) 2018.CC BY 4.0 License.not very efficient, for the reason already stated in the previous section.RSA performs better than PAWN, especially for GR4J and Hymod model, for it can get stable SI at early runs.For PAWN method, the minimum number of runs to obtain reliable results is larger than for the other methods, and the reason is that it needs sufficiently large number of samples to create smooth eCDFs.Besides, the sample size of the conditioning values will affect the conditional eCDFs.It also needs sufficient number of samples of conditioning values to cover the factor space well: this results in high computational cost since for each conditioning value k*Nc runs of the model are needed.
Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2018-78Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 28 February 2018 c Author(s) 2018.CC BY 4.0 License.From the results above we can see that different methods show different performance in computing SI, ranking and convergence.One of the reason is that there are different theories, concepts and principles behind each method, and methods of the same category (sharing similar principles) show similar results.Comparing their performance within each category, it can be seen that GLS methods have the highest efficiency and fastest convergence speed.Variance-based and density-based methods perform less well.GLS methods use first-moments to compute SI and the principles they use are relatively simple.
Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2018-78Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 28 February 2018 c Author(s) 2018.CC BY 4.0 License.Future work will be aimed at considering more SA methods (the first candidate being VARS,Razavi and Gupta, 2016a,   2016b), developing quantitative and more informed measures for their assessment, and testing the results and recommendations against other types of models and scenarios of their practical use.

Fig. 6 .
Fig. 6.Scatter plot of RMSE against parameter values with 10000 runs for HBV model.

Table 4 .
Experimental set-up for evaluation of SA methods. 2.