Search This Blog by Google

Search This Blog

Welcome to Dijemeric Visualizations

Where photography and mathematics intersect with some photography, some math, some math of photography, and an occasional tutorial.

Total Pageviews

Sunday, February 26, 2012

The Outlier Paradox and Global Warming

The Outlier Paradox and Global Warming
© Ken Osborn
Feb 2012

Hot winters ahead?  I can’t say for sure, but this last January seemed a lot warmer than a couple years ago when the winter was mild.  I checked the records for the city of Oakland (airport) and the daily highs for Jan 2012 averaged 2.3 degrees Fahrenheit (dF) warmer than Jan 2010.  That doesn’t seem like a lot.  But when I checked for the warmest days in 2010 and compared them to 2012 I saw that my impression that it seemed warmer this year than in 2010 was correct.  In 2010 two of the Jan daily high temperatures exceeded 60 dF and in 2012 there were 7 days in Jan exceeding 60 dF. 

So my impression that this winter was warmer was based not on the average of the daily high temperatures but rather the number of warmer than typical days.  This I call the Outlier Paradox: increasing means accelerates extremes.  Small changes in a data set's average are associated with large changes in the frequency of values exceeding some threshold value: these extreme values are also referred to as outliers.  This is a statistical property of numerical distributions, as I will try to explain.

Measurements repeatedly performed on any item and collected into a data set are frequently randomly distributed.  Given a large enough data set, if the individual results are randomly distributed, the shape of the plotted numbers will be symmetrical: values exceeding the average will balance nicely against numbers below the average in a mirror image fashion.  This is called a Normal distribution.  So are temperature records Normally distributed? 

Chart 1 is a plot of the Jan 2012 daily high temperature readings for the City of Oakland.  The data are ranked from low to high and plotted against the probability that a given measurement will exceed all other measurements in the set.  Except for the four highest values, the plotted temperature readings (red) nicely fit the curve for a Normal distribution (blue).  The curve is symmetrical around the center with roughly the same number of readings on either side.  Measurements close to the center have a higher probability and measurements far from the center a lower probability.  This is visualized by the flattening of the curve at the extreme ends. 

Chart 1: Testing temperature readings for Normality

An interesting property of Normal distributions is that even large changes in the extreme values have a smaller effect on the average.  When examining changes over time, averages only may be a poor predictor of environmental impacts if the outliers are ignored.  

As an example, see Chart 2.  Data set 1 represents a collection of measurements with an average of 100 and a standard deviation of 10%.  Approximately 4 out of 1000 measurements will exceed a threshold of 125.  Change the average to 105 and the number of measurements exceeding 125 increases to 16 out of 1000.  A 5% change in the average becomes a 400% change in the values exceeding a threshold just 25% above the average (100*16/4 = 400%).  An increase in the standard deviation will magnify the spread of extreme values even more. 

Chart 2: A small change in the mean is associated with large changes in the frequency of outliers

This property holds for any set of measurements taken over time when the distribution is Normal.  Records of rainfall, temperature, atmospheric pressure, and number of cars per hour passing a given point on the freeway can all be treated as Normal distributions.  Even though there are causal factors associated with each of these, any given measurement is randomly determined relative to the measurements that immediately precede or follow.  Though we know it may rain tomorrow, the exact number of inches of rain that will fall is an unknown until after the event.  

Returning to the records for the City of Oakland for Jan of 2010 and 2012, let’s take a statistical look at the distribution of daily high readings (Chart 3).  The average high temperature for Jan 2010 was 55.7 degrees Fahrenheit (dF) and the variance as measured by the standard deviation was 2.6 dF.  For 2012 the Jan average high temperature was 58 dF and the standard deviation 4.9.  Using the lower 2010 variance for both years, the fitted Normal curves predict 5% of days exceeding 62 dF for 2012 and 0.8% for 2010.

Chart 3: Comparison of Jan High Temperatures for 2010 and 2012 using 2010 Variance

Chart 3 demonstrates the effect of changing the mean of a distribution of data but not the standard deviation.  The curve shifts horizontally to the right for an increase in the mean with each individual point moving the same amount so that the two curves are parallel to each other. 

Of course the variance was not the same for the two years, and when the change in variance is considered (Chart 4), the differences are even greater with a prediction of 21.8% of the days exceeding 62 dF for 2012 compared to the 0.8% for 2010. 

Chart 4: Comparison of Jan High Temperatures for 2010 and 2012

When both the standard deviation and mean are changed, the curve not only shifts laterally but also rotates.  Chart 5 using a hypothetical set of temperature demonstrates what happens when the mean is fixed mean but the standard deviation changes.  Here the curve rotates around the center but the center does not move horizontally.   Thus if two sets of temperature records (or any measurement records) have the same average but different standard deviations, the set with the higher standard deviation will have more extreme values at both the high and low temperatures.   

Chart 5:  Changing only the standard deviation rotates the curve around the mid-point
The extra warm days of Jan 2012 should not be taken in isolation to determine whether global warming is a reality.  These data represent a narrow temporal and spatial snapshot.  Next winter may bring even warmer winter days or it could bring winter lows that are the lowest of the decade.  While we might take note of unusually extreme temperatures, it is the preponderance of data that must answer the question of whether an apparent trend is merely a statistical excursion or a real trend.  


Friday, February 10, 2012

Predator - Prey: Part III - The Interactive Model

Forage (aka carrying capacity) feeds the deer; deer feed the wolves; wolves keep the deer herd in balance with the forage.  It's a nice model, but things don't always work that smoothly.  Two previous posts discuss a mathematical model written in Excel that interactively explores some of the possible outcomes given initial conditions of deer herd size, number of wolves, intrinsic growth rates of deer and wolves, and the carrying capacity and carrying capacity variance factor.  This latter is a random card designed to incorporate the unpredictable effects of climatic variation, forage decline, or other things that can change the amount of forage but are not predictable from one year to the next.

Previous posts on this topic are at:

Ready to try you hand at creating your own scenarios?  A simplified interactive model programmed in an Excel spreadsheet is available for the curious in Google Docs at

There are eight variables in the model you can work with shown in the diagram below.

The population size of the deer herd (here it is 5000), deer growth rate, carrying capacity for the deer herd, variance in carrying capacity (K), wolf population (here it is 10), wolf pack growth rate, predator efficiency, and number of deer required for wolf survival.  You can change any and all of these starting numbers to see what happens.  For example, to evaluate the effect of dramatic swings in climate try changing the variance in K (now at 2000) to a higher or lower number.  A higher number would represent greater unpredictability and a lower number a more stable environment.

Have fun.