Time series is any data collected at regular intervals over a time period. Hydrological time series is very important in making great water decisions. In this article we look at six things to consider when analysing time series.

## Why time series are important

Time series can be used to:

- study the past behaviour of hydrological extreme events
- forecast water availability in the future
- to evaluate the achievements of any water decision in practice.

We are in an opportune era to obtain and analyse time series records. There are a lot of reasons for this. Real-time monitoring datasets have now become popular. Data storage technology has been enhanced significantly. Data sharing procedures can be done quickly via internet; and getting access to international archives is the easiest it has ever been.

However, the task of analysing time series has not become less difficult. We still need to overcome many issues. By being aware of them, we will obtain valuable insights. And by ignoring them, disinformation may emerge.

## 1: The curse of the Digital Era: Data is rich, but quality is poor

We now can easily get access to data. A *lot* of data! But the key aspect in time series analysis is not data quantity, but *quality*. It is definitely better to have 100 time series with no missing data than to have 1000 time series when each has 50% of the data missing.

Figure 1a is a graph of mean daily discharge from 1962 to 2009 at a stream gauge located at South Australia. With this temporal coverage, you can expect to get some good information. Unfortunately, it has a lot of missing data, especially from 1977 through to 1996. As a result, the value of this series is significantly reduced.

Time series with many missing data points is not very bad, because the level of data quality is obvious, and we can quickly decide to remove it from our analysis.

Unfortunately, we have more difficult problems. Figure 1b illustrates a typical example of “interesting” pattern in time series. You can easily find that values of peak discharge in this time series remain the same over 50 years *(exactly 6.5 cubic meter per second from 1950s through to 2000s)*. In this situation, we need to carefully look over the upstream catchment to see if there is any explanation for this pattern (e.g. there is a dam controlling maximum value of streamflow).

The problem is, you usually need to deal with numerous time series (as we have a lot of data), and are unlikely to have enough time to carefully check every single series showing “interesting” patterns. Therefore, you need your own solution to deal with data quality. Removing all time series with unreasonable patterns may be the optimal scenario in some cases.

## 2: We need strategies when data assumptions are not met

Many time series analyses involve carrying out statistical tests, which are always accompanied by assumptions about the data. At the beginning of the analysis, we should appreciate not only our time series data quality, but also assumptions applied to the statistical test we are about to use.

A common assumption is that data are normally distributed, and test results may have no meaning if the data follow a strongly non-normal distribution. Extreme variables, however, are usually not normally distributed. For example, annual peak discharge time series generally have a positively skewed distribution, and generalized extreme value (GEV) or log Pearson III (LP3) families are widely considered as appropriate distributions [1].

The independence of data values is another assumption that can lead to misleading results. In hydrology and climatology, this assumption is usually not met because we should consider autocorrelation within one time series (correlation from one time value to the next) or spatial correlation among two or more time series. That is why flooding is hard to understand.

If you need some statistical tests where data assumption may not be met, **prepare a strategy** with which to evaluate the results. For example, you should use resampling techniques to evaluate significance of the tests that are based on normality assumption.

## 3: Choosing the right reference period is a challenging task

To obtain insight from time series, choosing the right period of time for analysing is very important. It depends on the objective of the analysis, and maybe the time series itself (ie. temporal coverage of time series). Unfortunately, choosing different periods may produce very different results.

**Imagine** that you are making decision about whether to invest more money into a flood defense system in a flood prone region. You may want to analyse the peak flow record over the last four decades. By detecting an increasing trend in this time series (figure 2a, below), you will think that improving the flood defense system is a wise decision. But, if you extend the period to the last eight decades (figure 2b, below), it turns out that what you have found is just a transitional period from a flood-poor to flood-rich phase in an overall cycle of flooding. With this larger picture, you can expect another flood poor phase to come in the near future. The “great decision” is no longer great, because shifting money in another direction (e.g. water storage to adapt with the potential dry period) may be more cost-efficient. Therefore, to ensure the reliability of your analysis, it is a good practice to analyse time series over different periods.

## 4: Uncertainty is all around us

A common mistake in time series analysis is that we expect too much certainty. Uncertainty is all around us, and we should consider it. We can somehow “quantify” uncertainty through probability. But probability is also accompanied by some conditions.

In a hypothesis test, the condition is that the null hypothesis is true. We will then try to check whether the observed series is consistent with this hypothesis. To do this, we may want to estimate the *p-value* of the statistic.

The *p-value* will determine the probability of obtaining a result equal to, or more extreme than, what was observed in our data. A low *p-value* represents the low probability that the observed result is due to random chance, and therefore there is strong evidence to reject the null hypothesis. Some of us may expect that a low *p-value* indicates low uncertainty in the analysis. But, we are still uncertain about one thing: Whether the null hypothesis is true or not.

Because of uncertainty, we need to be extremely careful in communicating the results of time series analyses. The assumptions need to be clarified, the conditions need to be explained, and the probability of the results needs to be clearly discussed. Lacking these components, we risk over-interpretation.

## 5: Causation or correlation? That is the question.

Time series can tell you about important changes in the real world. If you find one, you may want to know the reason behind it. And, therefore, you may try to identify the consistency in patterns of different time series.

Here comes the tragedy: *Correlation* is easily confused with *causation*. If two things are related to each other, some of us may conclude one thing is the cause of the other. Sticking in mind the difference between causation and correlation has never been an old lesson.

In figure 3, you will find that precipitation in Illinois (US) time series is very well-correlated to cheese consumption of US (from 2000 to 2009, the correlation is 84%). If correlation is enough to explain causation, it means precipitation is the reason behind cheese consumption (or vice versa)! What about common sense?

You can also see 30,000 more graphs of “spurious correlations” at just one website. Be careful! One day, your finding may become one of them.

Of course, in your analysis, it is far more difficult to figure out what is just a correlation. You understand your field too well, and every factor you think about may be the real causation. You can, however, reduce the uncertainty in your findings by analysing both consistent and inconsistent relationships.

By removing all factors showing inconsistent pattern, step-by-step, you may pinpoint the cause of the problem.

## 6: Time Series Analysis may not be well communicated

As a time series analyst, the last thing you want to hear is a complete misinterpretation of your findings. Therefore, carefully communicating the results of time series analysis is very important, as we mentioned in back in number 4.

However, if your research is significant, your analyses produce interesting insights, and your findings are echoed by many news channels, the result of your analysis may be over-interpreted by other people. You can see the problem in the news cycle from the picture below. Not you, but other people had forgotten the assumption, ignored the condition, and mistaken correlation for causation. The scientific findings, gained under strict conditions, were then be transformed into some disastrous news. Unfortunately, you cannot do much in this situation. So, all you can do is be happy that your analysis has been noticed.

## Time series analysis is not easy, but it is important

Analysing time series has never been an easy task. But it also has never been stopped attracting people, because too much important information can be revealed from it. The Intelligent Water Decisions team has been working with millions of time series over the years to help create great water and environmental decisions. Get in touch if you’d like help with yours.

**Are you also working with time series? We would love to hear your stories!**

**Hong Do is a PhD candidate of Intelligence Water Decisions team. He is working with time series to find out what is the causation of changes in global floods.**

[1] Kuczera G and Frank S, Australian Rainfall and Runoff, Book IV, Estimation of Peak Discharge-Draft, Engineers Australia, Jan 2006, viewed 17 March 2016.

[2] Hall J et al., Understanding flood regime changes in Europe: a state-of-the-art assessment, Hydrol. Earth Syst. Sci., 18, 2735-2772, doi:10.5194/hess-18-2735-2014, 2014.

## What do you think?