7 Statistical Analysis of Hydrologic data, Hydrology Frequency Analysis

Ranjana Ray Chaudhuri

Objectives:

The module is an introductory, theoretical module which explains the need for statistics in hydrology.
To understand basic definitions in statistics and probability, extreme event occurrence, distributions.

Introduction

Many hydrological processes exhibit substantial variability which may not be explained by laws of physics, chemistry and biology or climatology only, which means they are subjected to chance, hence the importance of statistics in hydrology. There is substantial difficulty in explaining hydrologic variables like precipitation due to their inherent randomness, and because of the randomness in the hydrologic system itself in which the variable operates, like the watershed. The second reason of variability is the sampling error as often hydrologists must predict from small samples of the population as data set is available for a short period of time only. Also, precipitation data, soil data, infiltration values are collected from only a few points from the entire watershed, while these limited data sets are used to explain the desired characteristics of the entire watershed. As the number of data sets increase over the years the accuracy of prediction also improves. However, statistics must go hand in hand with hydrological process understanding, only then the study will be robust. Though lots of literature on statistics and statistical methods exist, yet, normal literature on statistics does not explain its need and application in hydrology.

Basic concepts of statistics and probability in hydrology

In hydrology most of the data are observations rather than experiments, so once the event has occurred like rainfall, the same event does not occur again, so an extreme event like heavy rainfall or flood does not occur again in the same form. Thus, statistics and probability offer insights into the expected magnitude and variability in future observations. Statistics is a tool which is used to infer about the properties of a population based on the properties of the sample, while probability provides answers to the likelihood of an event occurrence provided the population characteristics are known.

Hypothesis testing

Once the hydrologist has established the characteristics of sample of hydrologic data like stream flow (annual flood flow), aquifer flow or rainfall, there are other causative relationships which need to be established. Certain questions need to be answered like have the annual flood peaks increased over time due to anthropogenic changes, does the groundwater in aquifer meet drinking water quality standards, has the concentration of pollutant in river water increased over time, does it increase only during a season? Such practical questions that we face daily involve a causative agent which has to be taken care of by the hydrologist. The causative agent can be river basin development, change in land use cover in case of increase in annual flood peaks, contamination of ground water due to agriculture return flow in case of ground water contamination or presence of an industry which operates only in certain seasons in case of the river water quality. These questions can be translated into statistical hypotheses, like the following:

Null hypothesis Ho, which is usually a hypothesis of no change, instances of no change in water or hydrology are, as follows:

The distribution of aquifer hydraulic conductivity is identical at two far away points in the same aquifer.

Concentration of pollutant in the river does not depend on flow.

Alternate hypothesis Hi, which is the hypothesis of change, some departure is expected, so the distribution of hydraulic conductivity across two points in the same aquifer may be different due to certain causative factors, and the concentration of pollutant is related to river flow during the season under consideration (as against the null hypothesis explained above).

One sided test, it is a hypothesis test in which Hi is a departure from Ho only in one direction, for example the hydraulic conductivity in the aquifer changes from a point of higher gradient to lower gradient only, or concentration of pollutant increases with increase in flow only.

Two-sided test, where Hi is a departure from Ho in either direction, means that the hydraulic conductivity can change in either direction or the concentration changes with both increase and decrease in river flow.

Hypothesis testing is a measure of strength of statistical evidence, meaning does the statistical evidence provide sufficient reason to conclude for null hypothesis Ho or alternate hypothesis Hi.

Test of Hypothesis is carried out about the sample mean, the null hypothesis being that the population mean is equal to sample mean, while the alternate hypothesis is that sample mean is not equal to population mean.

Extreme rainfall events and floods are hydrologic processes which are very complicated natural events. They occur whenever, many parameters and variables combine, thus analyzing these using any conventional model like the rational method for peak runoff in a catchment, unit hydrograph method do not yield good results. Some variables that are always involved are the catchment characteristics, rainfall intensity and duration and antecedent conditions. Each of these factors in turn depends on a host of other parameters. The statistical approach to hydrology is used for prediction of flood flows and rainfall events.

However, use of statistics in hydrology is bound by the limitations that many hydrological processes cannot be put to formula due to the variability that exists between events. There are three major deterrents to statistics in hydrology, these being an inherent randomness in water-related events and thus variables, accompanied by substantial sampling errors and an incomplete understanding of the processes involved.

This last point further can be eliminated (or controlled) by fully exploring a process before using statistical methods, as the extent to which one’s understanding of a process can be used is also limited, this time by availability of data and the type of system used. Without fully understanding a process thus renders the use of statistics over-reaching due to inherent limitations in the scenario itself. Exploring a process, in turn, must be equally varied; the fundamental laws of physics, chemistry and biology must all be incorporated in any mathematical modeling, respecting the necessity of concept-based models; it must also be understood that while some models must be regressive to truly understand the minutiae of hydrological processes, some models may not exist at all for the self-same reason.

Furthermore, many hydrological data already collected show anomalies when put to statistical enquiry. This is often in the form of skewed distribution functions, lack of independence among variables, censoring due to natural events, or even seasonal patterns. This can be attributed to the fact that while statistics are based on set formulae derived from repeating similarity in results from different experiments, this method in hydrology can only really be used to define expected outcomes and not for modeling due to the nature of water events.

The defining characteristic of statistics takes into consideration is the characteristics of a sample taken from the population. This is often the median of the population under observation, where population can be defined as a collection of objects whose measurable properties are of interest. This is aided using probability in statistics. Probability gives us a certain idea of outcomes based on the characteristics of such a population. Statistics, on the other hand, is the process of drawing inferences based on these same characteristics. A characteristic of the population, as mentioned before, is often the median of the population. In estimating the median, we test hypotheses that we assume based on probability.

In defining the population, another problem often arises, however, that of sampling. The term refers to the fact that while sometimes a population can be finite and thus the individual characteristics of that population can be discerned, usually the researcher is limited to the use of a mere sample of the total population. It is then important to understand the individual characteristics of the sample first and then track their relation to the properties of the population.

This process can be simplified, however, as sampling is of four basic types. The first is the idea of Random Sampling, where each part of the population under study has equal chances of being selected. This random sampling can also be used by dividing the population into groups, and applying the method to each group thus formed; this is called Stratified Random Sampling. Converse to the random method, the Uniform Sampling method allows for a strict rule to prevail on the sampling points, making them equally distant from each other. Fourth and finally, we also have the Convenience Sampling method, where data are collected at the convenience of the experiment. Usually, the two forms of random sampling or a uniform sample are considered ideal, where uniform sampling has the logistical advantage of minimizing the serial dependence on variability; the stratified sampling method is therefore the other extreme, used only when the groups thus formed show substantial variability.

The Convenience Sample is often overlooked here due to the disadvantage of the possibility of sampling bias. Biased or non-representative methods can often lead to biased conclusions, where the relationship between the properties of the sample population and the target population becomes skewed. Here, inferences drawn from the biased sample apply solely to the sample population. However, this bias exists with the usage of the other methods of handling samples as well, such as the inherent misrepresentation when stratified samples are tested as under stratified random samples. For example, this problem often arises while calculating historical samples due to lack of data or gaps in knowledge of historic conditions.

It is thus to be understood that an estimate, computed by using a procedure to describe the properties of a population using a sample known as an estimator, can become biased when its results depart from the true population over repeated use. This estimate is also called a statistic. Basic statistics used are arithmetic mean, median, mode and standard deviation.

The statistical approach for flood frequency analysis estimates the design flood by using past stream flow data of maximum annual flood flow which may be taken from direct observations or estimated by using a suitable method. Frequency analysis is conducted using available record of the maximum annual rainfall events of that region. The probability of occurrence of an event, in this case a flood event (that is maximum flood discharge likely to occur in a year at a location), whose magnitude is equal to or greater than a certain magnitude X is denoted by P (probability). The return period T is defined as inverse of P.

In hydrology estimating the magnitude of an event (storm or flood) corresponding to the return period of such an event/occurrence is of utmost importance. This is done through statistical analyses of past record of flood events, rainfall events to predict events of the future. The statistical studies use records of daily, monthly and annual rainfall events and stream flows for estimation of large storm events and flood flows. For estimation of extremely large events often the past records do not have that range of data, often extrapolation techniques are used. However, the sample size may not be enough to allow extrapolation of that data and allow prediction with accuracy. Statistical tools/methods then help in such predictions with reasonable accuracy. In most cases or situations, the data is inadequate to determine the risk due to large flood peaks, rainfall events, pollutant loadings and low flows. Low flows are also important in times of droughts, when the combined pollutant loading, and low flows aggravate the situation. The hydrologists combine the practical knowledge about the events with robust statistical techniques to develop reasonable estimates of risk. The distributions explained are regarding the magnitude of a single variable like annual flood peaks or low flows (usually 7-day low flows are used), or maximum daily rainfall (24-hour period).

The collection of past data, records of rainfall or flood events is termed as population. The annual maximum flood occurrence at a certain point on the river is the collection point of the data. There are numerous such points on the river where data is collected for flood frequency studies. Similarly, for rainfall events rainfall station data is collected and annual maximum occurrence is used. This is the population term used commonly in statistics; the sample is taken from this very population. The years of recorded maximum flood occurrence during the past years at a site is the observed site data for a finite number of years, hence it serves as the sample of the population. The assumption is that the sample represents the population, as even though the record is for a finite number of years, yet the same trend will be followed in the future. The sample data set is assumed to be random, thus each peak flow or rainfall event is independent of the other and the value of the variable does not depend on previous or next values.

The data is first checked for consistency, features like trend and extreme rise and fall are determined. Trend analysis is carried out over 30-year dataset/50-year data set, often it shows a rising trend or falling trend. The trend analysis points towards anthropogenic interference like change in watershed characteristics from agricultural land to urbanization, deforestation or for that matter new afforestation drives. All these affect the hydrological cycle at local level, which can be seen from rising or falling runoff. Such changes can be detected through trend analysis. Similarly, there are outliers which may be seen when studying data set for a period of thirty/fifty years. The exceptional runoff many times point towards factors like landslides, earthquakes, cloud bursts and the like.

The statistical population represented by the data set fits into a convenient probability distribution. A distribution is an attribute of the statistical population. The maximum annual flood values, also known as observed peak values are random values which are representative of the population. The probability of a variable for example an annual flood flow is defined as the number of occurrences of such a variate is the probability of the variable. The total probability of all the variates is equal to one, the distribution of probabilities of all the variates is called probability distribution and the curve describing this distribution is called the probability distribution function (pdf), shown in the y axis as f(x) as shown in the figure below, fig 1. The cumulative distribution function (cdf) is shown in figure 2.

Statistics uses many frequency distributions, f(x), however the commonly used frequency distributions in hydrology are as follows:

Normal

Log Normal

Pearson Type III

Gumbel

The given data set is checked against these standard distributions to check which one of them offers best fit. Once decided and chosen then it is adopted to calculate magnitude of floods with corresponding return periods. Once a data set is available, fitting it into the most appropriate distribution is important, so that extrapolation to frequencies beyond the range of data set can be carried out. The parameter estimation of the data set then becomes important to define the distribution. Suppose the annual flood flow data set or the low flows, rainfall or water quality variables are fitted into any of the standard distributions defined below, the parameter estimation is important. In case of normal distribution, the two parameters µ (mean) and σ2 (variance), the parameters are estimated using the method of moments or the method of maximum likelihood.

Normal Distribution

The normal distribution is one of the most commonly used distributions because of its bell shape (figure 1), it turns out to be a symmetrical distribution with coefficient of skewness equal to zero. The normal distribution is used to study the average annual stream flow or the average annual pollutant loading in the stream. The natural parameters of the normal distribution are µ and σ2

Log normal distribution (two parameters)

This statistical distribution is also seen in hydrology, when the hydrological variables act multiplicatively rather than having addition properties. The frequency distribution of variables is skewed, so the logarithm of these variables is considered which follows a normal distribution. The µ, σ, x0 are called the scale, shape and location parameter, x0 is usually equal to zero.

Pearson Type III distribution is known as a three-parameter distribution, it is also called gamma distribution with three parameters. The µ, β, ɣ are the location, scale and shape parameters.

The most commonly used distribution for flood frequency events is Gumbel distribution, which is extreme value Type 1, EVI, distribution.

In terms of reduced variate: z= (x-µ)/β

The probability distribution function pdf becomes f(z) =e^-z-(exp-z)

The Gumbel distribution is alternatively simplified by as shown below. The general equation of hydrologic frequency analysis for Gumbel distribution as per Chow (1998) is as shown below:

XT= ͞ xmean+Kσ

Where XT is the value of the variate X of a random hydrologic series with return period T, ͞x is the mean of the variate, K is the frequency factor which depends on the return period, T and σ is the standard deviation of the variate.

K = yT-͞yn /Sn

The values of yn and Sn are selected from Gumbel’s Extreme Value Distribution considered depending on the sample size, when number of years of data exceeds 100 years, N≥100, yn and Sn are 0.577 and 1.2825 respectively.

YT= -(ln.lnT/T-1), where in the reduced variate Y_T is afunction of the return period T, ln is the natural logarithm.

Limitations

This current module does not discuss non-normal, skewed distribution functions, or outliers (a few values much larger or much lower than most of the data) in detail, what happens if there is lack of independence amongst observations, what if distributions are dependent on other variables, the effect of seasonal patterns (statistical characteristics tend to vary with season). These may be considered in the next module.

Solved example

1.The monthly rainfall recorded in millimetres at a station for a period of twelve months is as given below. Determine the mean rainfall, variance, standard deviation, coefficient of variation and coefficient of skewness (taken from Ojha et.al. 2008)

Standard deviation, average have the same dimension as the original data while variance is a square of the deviation, so it is not used for comparison. Standard deviation is important in statistical hydrology, as many times due to extreme events, the higher or lowest value is much larger than the second highest or lowest value. Skewness coefficient is an indicator of the symmetry in the distribution. Smaller the coefficient of skewness closer is the distribution to normal distribution, Cs=0 is normal distribution.

2. The mean annual flood of a river is 600m3/s and the standard deviation of the annual flood time series is 150m3/s. Determine the return period of a flood of magnitude 1000m3/s occurring in the river.

Use Gumbel’s method and assume sample size to be very large (taken from Engineering Hydrology by K. Subramanya,2008). Answer

Using XT= ͞x +Kσ,

Given that xmean=600m3/sand σ=150 m3/s and XT=1000 m³/s, then substituting in the above equation, the frequency factor K=2.667

But K=yT-y͞n /Sn Given for N large, ynand Sn are 0.577 and 1.2825 respectively.

Therefore, 2.667= (YT-0.577)/1.2825

YT=3.9970, YT = -(ln.lnT/T-1),

Therefore T=54.9years (taking antilog of Yt)

So, the return period of flood of magnitude 1000m3/s is 55 years.

Summary

So far we have discussed about extrapolation of data. However, the prediction about high flood magnitudes and return periods are carried out. However, it is possible to assign probability to a data point. If the sample data has N data points, that is for example N annual flood peaks, and then they may be arranged in descending order. The event which has highest magnitude X is assigned the highest rank one, then accordingly the lower values are assigned ranks two, three, four and so on, till the lowest value has a rank N. The probability (P) of each rank is calculated as m/N+1 in case Weibull method is used. For example, if the highest annual flood value is 1000 m3/s and it is ranked 1 out of a data set of 30 years (such that N=30), the probability is calculated as P=1/31 and the return period is 31 years. There are other methods to calculate the probability of exceedance of X (flood value) by using California and Hazen models as well. The common statistics application in hydrology and frequency analysis has been discussed, the solved examples shall explain the applications

you can view video on Statistical Analysis of Hydrologic data, Hydrology Frequency Analysis

References

Chow, V.T., Maidment D.R., and Mays, L.W. Applied Hydrology, Mc Graw Hill, New York, 1988
Maidment, D.R. Handbook of Hydrology, Mc Graw Hill, New York,1992
McCuen, R.H. Modeling Hydrologic Change: Statistical Methods, Lewis Publishers, a CRC Press Company, USA, 2003
Ojha, C.S.P., Berndtsson, R., and Bhunya,P. Engineering Hydrology, Oxford University Press,2008
AL-Mashidani, G., PandeLal, B. B. and Mujda, M. F. A simple version of Gumbel’s method for flood estimation / Version simplifiée de la méthode de Gumbel pour l’estimation des crues, Hydrological Sciences Bulletin, 23:3, 373-380, (1978). DOI: 10.1080/02626667809491810 To link to this article: http://dx.doi.org/10.1080/02626667809491810
Subramanya, K. Engineering Hydrology, Tata-McGraw Hill, 2008