Influential data have a disproportionate impact on model predictions. This post introduces concepts of influence, as an untapped resource, to provide exciting new insights into the role of data in model calibration.
Influential data can have a disproportionate impact on predictions
To introduce concepts of influence we will start with a simple hypothetical water decision.
Let’s say we are allocating water supply and want to predict annual crop water use. Crop water use is difficult to measure, so we decide to predict annual crop water use as a function of the annual crop yield (i.e. Water Use = F(Crop yield)). The figure below provides some example data of this relationship between crop yield and water use for a location on the eastern coast of Australia.
Our current crop yield is around 300 hectares/annum and we wish to predict the water usage in 2016. Unfortunately, we only have access to measurements from a field experiment from the 1960s, a time when crop yield and water use was much less. We fit a regression line which looks something like this:
Reconsidering the spread of data in the scatterplot it is reasonable to question the highest historical annual crop yield (~200 hectares/annum). The data point is quite a distance from the other data points in both the X and Y directions and so we decide to see what happens if we remove this point form the regression.
To our surprise we see that the highest historical annual crop yield has had quite a large influence on our regression line:
In fact we have reduced our extrapolated water usage at a crop yield of 300 hectares/annum from 250 to 130 mm/annum, an alarming reduction of 48%.
So we are faced with a dilemma. Should we include the point or remove it?
On one hand if we remove the point we have a resource saving of 48%. Also, if one point is having a disproportionate influence on our model predictions then maybe it shouldn’t be trusted.
On the other hand there is no evidence to suggest the point is erroneous. As it is the data point closest to our desired extrapolated value, then it may just be the most valuable point in the dataset.
But of concern is this: Had we not recalculated the regression then we would be blissfully ignorant of this dilemma and would have simply gone with the larger annual water use value.
So the question is: What does influence look like? And, perhaps more importantly, what can we do about it?
Playing with influence in a linear regression model
To improve our understanding of how influential data points impact model predictions we will use the influence regression app. This app allows you to explore how an influential point can impact model predictions.
Including Point A in the linear regression has a large impact on the prediction line (in red) and so we call this a highly influential data point. In contrast, the inclusion of Point B has little influence on the prediction line and so it is not influential.
So what can we do about influence in water decisions?
At the moment we rarely think about influential data in water decisions, but the implications of including influential days could be huge. The examples above should be enough to alarm most water decision makers.
For example: What if the ‘great flood of 1970’ never happened? If this was a 1 in 100 year event, then erasing it from the record will cause the next largest event (i.e. the ‘not so great flood of 1985’) to step up, impacting our ability to predict ‘great flood events’.
What happens if we include an influential data point that is erroneous?
If we had more flood events in our historical data may we have been able to better predict the 2010-11 Queensland floods?
We applied influence diagnostics to a hydrologically orientated case study and found that a single point could change mean/maximum streamflow predictions by 7/9% for a rating curve model and 13/25% for a hydrological model.
So next time you are preparing data as input to a model that is designing water or flood infrastructure, or making a water resource decision you may want to think about assessing influence to better understand how each data point impacts your decisions.
David Wright is a PhD student in the Intelligent Water Decisions research group.