Sherlock Holmes and the nature of data(mining)

Holmes… thinking.

I spend many days sitting at my desk just thinking, reading, writing and then thinking some more about survey method, instruments, data and analysis. It’s all great fun, because while I’m comfortable with qualitative, I’m also quantitative.

But after a solid day of switching from one spreadsheet to another (fisher socioeconomics and mobility preferences of SE DC), my mind is drifting off and I’ve randomly recalled quotes from Sir Arthur Conan Doyle’s Sherlock Holmes, on the subject of data, research and hypothesizing.

Holmes rather ingeniously contradicts at least some of our ideas of scientific method and hypothesis testing. This is hardly just Holmes being fanciful; he actually does a rather good job of showing us why we need to be careful about putting too much stock in our brilliant hypotheses.

It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.
A Scandal in Bohemia

Our scientific method generally tells us that we should do the opposite. We should theorize and then design experiments or collect data to analyze/test our hypothesis. My master’s thesis, for example, examined a macroeconomic theory — of productivity conditioned upon physical and human capital — using socioeconomic census data of coastal fishers in India.

My own hypothesis, based on political economy/ecology literature, was that the basic elements of the production function theory were inadequate to explain poverty. I tested data and largely confirmed my hypothesis. Holmes suggests we do the opposite, on multiple occasions.

It is a capital mistake to theorise in advance of the facts.
The Adventure of the Second Stain

It is a capital mistake to theorize before you have all the evidence. It biases the judgment.
A Study in Scarlet

It seems that Holmes would advocate the kind of research that is castigated by some scientists as “data mining” or “data dredging.” Described negatively, data dredging involves looking at a whole range of statistics and picking obscure ones to form a thin hypothesis about any observed patterns and relationships. I suspect that some folks who dismiss statistical analysis (“Anyone can say anything with statistics”) may be thinking of data mining. More generously described, however, data dredging is simply post-hoc analysis or looking at data after the fact for trends or patterns that were unknown or inconceivable prior to the experiment/data collection.

The critical view has some merit; the more one looks at the data, the more one finds connections that may have no logic or good theoretical basis; in short, one may find trends that just don’t make sense and may only be artifacts of the data rather than descriptions of reality.

However, hypothesis-experiment designs can also be as flawed. They rely on the researcher’s own judgment to get the conditions/variables of the experiment/observation right. One might incorrectly reject the null hypothesis (Type I error) if, for example, the variables or purported causal chain of the hypothesis don’t actually relate but instead happen to proxy a real-but-untested relationship; at the same time, one might incorrectly confirm the null hypothesis (Type II error) if the proper model isn’t specified in, say, a regression.

Example from my own work: I’m interested in power, social organization and political economy (of natural resources); along those lines I read literature often originating from political economists and political ecologists. My thesis attempted to show that a supposedly apolitical macroeconomic hypothesis simply didn’t fit the facts when one dug (dredged?) a little deeper into the data. I included sociopolitical variables that began to control away the effects of the neoclassical macroeconomic predictors.

I had some theory to back me up, but at the outset, I did not hypothesize the power that geography would have on my model as well. Only when I also controlled for fixed geographic effects or removed geographic outliers did I really begin to see the macroeconomic model break apart. Another variable I found to matter highly — the presence of a post office. This really starts to seem like data mining, but by looking deeper at the statistics, I could see that post offices proxy overall levels of development in a broader economy, so the variable actually made sense. That’s a bit of post-hoc analysis, but without the social, geographic and post office variables, my research would have actually supported the overarching macroeconomic theory.

What’s more: Even my best models didn’t explain even two-thirds of the variation in my dependent variable (poverty). A first-order question: What other variables might the theory (macroeconomic or other) be missing? One of the first steps toward answering that: Looking closer at the data for unexpected interrelationships.

Says Holmes:

“Data! Data! Data! I can’t make bricks without clay.”
The Adventure of the Copper Beeches

Indeed, most of Holmes’ genius comes as Conan Doyle invents scenarios where the seemingly obvious hypothesis is wrong; only upon dredging up more data and observations does Holmes typically arrive at the correct conclusion.

And, in reality, most hypotheses are rarely designed in a purely a priori fashion. In practice, we look at some data, consider some experience, examine results of other research, design our hypothesis accordingly and go out and look at data. After a first pass analysis, we may alter our thinking on the fly, which perhaps approaches data dredging but gets us closer to describing a real relationship or explaining a real trend.

My own statistics professor, for whom I have great respect, told me that looking deeply at the data wasn’t wrong — I do tend to nerd out on my spreadsheets — as long as I had good theoretical, logical (sensible) reasons for seeing relationships.

And now, back to work and survey method/instrument design.

Tags: , , , , , , , , , ,