Random
Events, Trends, and Hot Temperatures
(c) Ken
Osborn Jan 17, 2015
Today's
papers (Jan 17, 2015) say 2014 was the hottest on record since 1880. An
engineering friend says the trend may be just a random thing. Could that be?
Though the data do show a reasonable closeness to an independently produced set
of random values with the same average and variability (standard deviation), I
decided to test it by comparing the first half of 1880-2014 to the second half.
If
random, dividing the data into two halves and ordering the values from lowest
value to highest within each half would produce two superimposed sets that
would be indistinguishable. If they are part of a trend of increasing
temperatures, the sets would not superimpose and the second set would be
shifted to the right on the chart. So which is it? Check out the graphs.
First,
let me state that I do understand that random events can generate what appears
to be a trend, as in this example:
Chart 1:
Example of a 'trend' from a random generation of values
The
Random Walk chart displays what appears to be a strong trend with a correlation
coefficient of 0.96, sufficient to earn an award for superior performance by an
inebriated soul staggering along a mostly straight path. I should note
that it took over 50 runs of the program to get this result and doubt I could
reproduce this particular set again.
But if
we can get a randomly generated collection of results to look like a trend, how
do we know that our trend was not generated randomly?
One way,
of course, would be to rerun the events and see if we get the same trend.
However, since I'm looking at data collected from 1880 to 2014 that is
not a realistic approach. Another way would be to take the data, split it
into two halves, and compare the two halves. If the data are a random
distribution around a central value, the first half of the data should match
the second half of the data. But first, let's try that with some randomly
generated data.
Chart 2:
Plot of 100 random values generated by a Monte Carlo simulation
In chart
2, 100 random values ordered from low to high are plotted against their rank
(probability). Half of the values exceed and half are below the average of zero
and are symmetrically distributed around the center. In other words,
these data exhibit a nice Normal type distribution. Note that the average
and standard deviation for these data are the same as the temperature data to
follow.
So now if
those data are split into two halves and each half is ordered independently
from lowest value to highest value and plotted against its rank, what do we
get?
Chart 3:
Comparison of two halves of a randomly generated set of values after
ranking each half
The
values in blue represent the first half of 100 random values and the ones in
red the second half. Each half was then ranked from lowest to highest
value and plotted against the ranking (1 to 50). The superimposition is
not exact, but these are randomly generated results so one should not expect an
exact agreement between individual values. But maybe a real trend would
also show something like this. Let's try it.
Chart 4:
A trend (Y= 10X+5) of 100 values partitioned into two ordered halves
The
trended data set of 100 values was generated from the formula Y = 10X+ 5.
The set was divided into two halves and each half was ordered from lowest
value to highest value then plotted against its rank from 1 to 50. Unlike
in chart 3 with conformable data sets, these two sets show no overlap at all.
So how will this work with real data?
Chart 5: Comparing the
ranked temperature anomalies from 1880 to 2014 with a random data set (Source: http://data.giss.nasa.gov/)
The values in green in
chart 5 represent the temperature anomalies from 1880 to 2014 ranked from
lowest to highest value. Each value represents the deviation from the
average for the 20th century. The values in blue were randomly generated
using the mean and standard deviation from the temperature anomaly set.
They do look as close as two separate runs of a random number generator.
But remember, the real test is to see if the first half of the data (1880
to 1946) matches the second half (1947 to 2014). Any guesses?
Chart 6: Comparison of two
halves of the 1880-2014 temperature data anomalies
In chart 6, the values in red are for the years
1880 to 1946 and the values in green for 1947 to 2014. Each set is
ordered from its lowest to highest value and plotted against corresponding
year. They do not match and are clearly two separate distributions.
I leave the conclusion to you as to whether these data have been
generated by random events.