The Outlier Paradox and Global Warming
© Ken Osborn
Hot winters ahead? I can’t say for sure, but this last January seemed a lot warmer than a couple years ago when the winter was mild. I checked the records for the city of Oakland (airport) and the daily highs for Jan 2012 averaged 2.3 degrees Fahrenheit (dF) warmer than Jan 2010. That doesn’t seem like a lot. But when I checked for the warmest days in 2010 and compared them to 2012 I saw that my impression that it seemed warmer this year than in 2010 was correct. In 2010 two of the Jan daily high temperatures exceeded 60 dF and in 2012 there were 7 days in Jan exceeding 60 dF.
So my impression that this winter was warmer was based not on the average of the daily high temperatures but rather the number of warmer than typical days. This I call the Outlier Paradox: increasing means accelerates extremes. Small changes in a data set's average are associated with large changes in the frequency of values exceeding some threshold value: these extreme values are also referred to as outliers. This is a statistical property of numerical distributions, as I will try to explain.
Measurements repeatedly performed on any item and collected into a data set are frequently randomly distributed. Given a large enough data set, if the individual results are randomly distributed, the shape of the plotted numbers will be symmetrical: values exceeding the average will balance nicely against numbers below the average in a mirror image fashion. This is called a Normal distribution. So are temperature records Normally distributed?
Chart 1 is a plot of the Jan 2012 daily high temperature readings for the City of Oakland. The data are ranked from low to high and plotted against the probability that a given measurement will exceed all other measurements in the set. Except for the four highest values, the plotted temperature readings (red) nicely fit the curve for a Normal distribution (blue). The curve is symmetrical around the center with roughly the same number of readings on either side. Measurements close to the center have a higher probability and measurements far from the center a lower probability. This is visualized by the flattening of the curve at the extreme ends.
Chart 1: Testing temperature readings for Normality
An interesting property of Normal distributions is that even large changes in the extreme values have a smaller effect on the average. When examining changes over time, averages only may be a poor predictor of environmental impacts if the outliers are ignored.
As an example, see Chart 2. Data set 1 represents a collection of measurements with an average of 100 and a standard deviation of 10%. Approximately 4 out of 1000 measurements will exceed a threshold of 125. Change the average to 105 and the number of measurements exceeding 125 increases to 16 out of 1000. A 5% change in the average becomes a 400% change in the values exceeding a threshold just 25% above the average (100*16/4 = 400%). An increase in the standard deviation will magnify the spread of extreme values even more.
Chart 2: A small change in the mean is associated with large changes in the frequency of outliers
This property holds for any set of measurements taken over time when the distribution is Normal. Records of rainfall, temperature, atmospheric pressure, and number of cars per hour passing a given point on the freeway can all be treated as Normal distributions. Even though there are causal factors associated with each of these, any given measurement is randomly determined relative to the measurements that immediately precede or follow. Though we know it may rain tomorrow, the exact number of inches of rain that will fall is an unknown until after the event.
Returning to the records for the City of Oakland for Jan of 2010 and 2012, let’s take a statistical look at the distribution of daily high readings (Chart 3). The average high temperature for Jan 2010 was 55.7 degrees Fahrenheit (dF) and the variance as measured by the standard deviation was 2.6 dF. For 2012 the Jan average high temperature was 58 dF and the standard deviation 4.9. Using the lower 2010 variance for both years, the fitted Normal curves predict 5% of days exceeding 62 dF for 2012 and 0.8% for 2010.
Chart 3: Comparison of Jan High Temperatures for 2010 and 2012 using 2010 Variance
Chart 3 demonstrates the effect of changing the mean of a distribution of data but not the standard deviation. The curve shifts horizontally to the right for an increase in the mean with each individual point moving the same amount so that the two curves are parallel to each other.
Of course the variance was not the same for the two years, and when the change in variance is considered (Chart 4), the differences are even greater with a prediction of 21.8% of the days exceeding 62 dF for 2012 compared to the 0.8% for 2010.
Chart 4: Comparison of Jan High Temperatures for 2010 and 2012
When both the standard deviation and mean are changed, the curve not only shifts laterally but also rotates. Chart 5 using a hypothetical set of temperature demonstrates what happens when the mean is fixed mean but the standard deviation changes. Here the curve rotates around the center but the center does not move horizontally. Thus if two sets of temperature records (or any measurement records) have the same average but different standard deviations, the set with the higher standard deviation will have more extreme values at both the high and low temperatures.
Chart 5: Changing only the standard deviation rotates the curve around the mid-point
The extra warm days of Jan 2012 should not be taken in isolation to determine whether global warming is a reality. These data represent a narrow temporal and spatial snapshot. Next winter may bring even warmer winter days or it could bring winter lows that are the lowest of the decade. While we might take note of unusually extreme temperatures, it is the preponderance of data that must answer the question of whether an apparent trend is merely a statistical excursion or a real trend.