Monday, May 6, 2013

A Risk of Big Data: Armchair Statisticians

In the mid-1980s, laser printers became affordable, word processor software became more capable, and many people found that they were able to publish their own documents. They proceeded to do so. Some showed restraint in the use of fonts; others created documents that were garish.

In the mid-1990s, web pages became affordable, web page design software became more capable, and many people found that they were able to create their own web sites. They proceeded to do so. Some showed restraint in the use of fonts, colors, and the blink tag; others created web sites that were hideous.

In the mid-2010s, storage became cheap, data became collectable, analysis tools became capable, and I suspect many people will find that they are able to collect and analyze large quantities of data. I further predict that many will do so. Some will show restraint in their analyses; others will collect some (almost) random data and create results that are less than correct.

The biggest risk of Big Data may be the amateur. Professional statisticians understand the data, understand the methods used to analyze the data, and understand the limits of those analyses. Armchair statisticians know enough to analysis the data but not enough to criticize the analysis. This is a problem because it is easy to mis-interpret the results.

Typical errors are:

  • Omitting relevant data (or including irrelevant data) due to incorrect "select" operations.
  • Identifying correlation as causation. (In an economic downturn, the unemployment rate increases as does the payments for unemployment insurance. But the UI payments do not cause the UI rate; both are driven by the economy.)
  • Identifying the reverse of a causal relationship (Umbrellas do not cause rain.)
  • Improper summary operations (Such as calculating an average of a quantized value like processor speed. You most likely want either the median or the mode.)

It is easy to make these errors, which is why the professionals take such pains to evaluate their work. Note that none of these are obvious in the results.

When the cost of performing these analyses was high, only the professionals could play. The cost of such analyses is dropping, which means that amateurs can play. And their results will look (at first glance) just as pretty as the professionals.

In desktop publishing and web page design, it was easy to separate the professionals and the amateurs. The visual aspects of the finished product were obvious.

With big data, it is hard to separate the two. The visual aspects of the final product do not show the workmanship of the analysis. (They show the workmanship of the presentation tool.)

Be prepared for the coming flood of presentations. And be prepared to ask some hard questions about the data and the analyses. It is the only way you will be able to tell the wheat from the chaff.

No comments: