Random Events, Trends, and Hot Temperatures
(c) Ken Osborn Jan 17, 2015
Today's papers (Jan 17, 2015) say 2014 was the hottest on record since 1880. An engineering friend says the trend may be just a random thing. Could that be? Though the data do show a reasonable closeness to an independently produced set of random values with the same average and variability (standard deviation), I decided to test it by comparing the first half of 1880-2014 to the second half.
If random, dividing the data into two halves and ordering the values from lowest value to highest within each half would produce two superimposed sets that would be indistinguishable. If they are part of a trend of increasing temperatures, the sets would not superimpose and the second set would be shifted to the right on the chart. So which is it? Check out the graphs.
[Data source: http://data.giss.nasa.gov/]
First, let me state that I do understand that random events can generate what appears to be a trend, as in this example:
Chart 1: Example of a 'trend' from a random generation of values
The Random Walk chart displays what appears to be a strong trend with a correlation coefficient of 0.96, sufficient to earn an award for superior performance by an inebriated soul staggering along a mostly straight path. I should note that it took over 50 runs of the program to get this result and doubt I could reproduce this particular set again.
But if we can get a randomly generated collection of results to look like a trend, how do we know that our trend was not generated randomly?
One way, of course, would be to rerun the events and see if we get the same trend. However, since I'm looking at data collected from 1880 to 2014 that is not a realistic approach. Another way would be to take the data, split it into two halves, and compare the two halves. If the data are a random distribution around a central value, the first half of the data should match the second half of the data. But first, let's try that with some randomly generated data.
Chart 2: Plot of 100 random values generated by a Monte Carlo simulation
In chart 2, 100 random values ordered from low to high are plotted against their rank (probability). Half of the values exceed and half are below the average of zero and are symmetrically distributed around the center. In other words, these data exhibit a nice Normal type distribution. Note that the average and standard deviation for these data are the same as the temperature data to follow.
So now if those data are split into two halves and each half is ordered independently from lowest value to highest value and plotted against its rank, what do we get?
Chart 3: Comparison of two halves of a randomly generated set of values after ranking each half
The values in blue represent the first half of 100 random values and the ones in red the second half. Each half was then ranked from lowest to highest value and plotted against the ranking (1 to 50). The superimposition is not exact, but these are randomly generated results so one should not expect an exact agreement between individual values. But maybe a real trend would also show something like this. Let's try it.
Chart 4: A trend (Y= 10X+5) of 100 values partitioned into two ordered halves
The trended data set of 100 values was generated from the formula Y = 10X+ 5. The set was divided into two halves and each half was ordered from lowest value to highest value then plotted against its rank from 1 to 50. Unlike in chart 3 with conformable data sets, these two sets show no overlap at all. So how will this work with real data?
Chart 5: Comparing the ranked temperature anomalies from 1880 to 2014 with a random data set (Source: http://data.giss.nasa.gov/)
The values in green in chart 5 represent the temperature anomalies from 1880 to 2014 ranked from lowest to highest value. Each value represents the deviation from the average for the 20th century. The values in blue were randomly generated using the mean and standard deviation from the temperature anomaly set. The do look as close as two separate runs of a random number generator. But remember, the real test is to see if the first half of the data (1880 to 1946) matches the second half (1947 to 2014). Any guesses?
Chart 6: Comparison of two halves of the 1880-2014 temperature data anomalies
In chart 6, the values in red are for the years 1880 to 1946 and the values in green for 1947 to 2014. Each set is ordered from its lowest to highest value and plotted against corresponding year. They do not match and are clearly two separate distributions. I leave the conclusion to you as to whether these data have been generated by random events.