Understanding Statistical Significance: Clearing Up Common Misconceptions

Statistical significance is a concept that has been repeatedly misinterpreted, particularly in the social sciences. This article aims to illuminate some of the key misconceptions surrounding this idea and its practical implications.

We will begin with a look at our intuitive grasp of statistical significance, followed by a deeper exploration of its scientific underpinnings. Finally, we will address the main issues currently associated with the use of statistical significance and suggest possible solutions. Let’s get started.

Intuitive Understanding of Statistical Significance

People often rely on intuitive statistical reasoning in their daily lives. We observe events, form hypotheses, and eventually accept them as truths based on our intuitive models, even if we don’t use technical terms like “statistical” or “significance.” Consider this scenario:

Imagine you’re driving through town and stop at a traffic light near the square. You let out a sigh of exasperation because you anticipate that if this light turns red, you'll face a series of red lights ahead. You wish you had received a green light instead.

In this case, you have created a mental database from your past experiences and developed an “If-Then” hypothesis. As you gather more experiences (red and green lights), you confirm your initial hypothesis as a truth. You won’t reconsider this “truth” unless a significant number of experiences contradict your original hypothesis.

In this intuitive model, any data point that confirms your hypothesis contributes to its statistical significance. Thus, your intuitive model gains importance through this confirmation.

However, there are significant issues at play. One major philosophical question I have discussed elsewhere involves the raven paradox. Additionally, the scientific interpretation of statistical significance often differs sharply from our intuitive understanding.

Scientific Statistical Significance

When it comes to scientific methods, intuition frequently falls short. One of the trailblazers of modern statistical methodology was Ronald Aylmer Fisher. The contemporary statistical method can be summarized in the following steps:

Establish the null hypothesis.
Conduct an experiment and collect data.
Assess the likelihood of observing the results as exceptions (or extremes) to the null hypothesis, assuming it holds true. This probability is known as the p-value.
If the p-value falls below a predetermined threshold (5% was Fisher’s preferred level), the results are considered statistically significant, leading to the rejection of the null hypothesis. Conversely, a higher p-value indicates that the null hypothesis remains unchallenged.

The initial step involves framing the null hypothesis, which highlights a critical distinction between intuitive and scientific statistical significance. In our intuitive models, confirming events are marked as statistically significant.

In contrast, the scientific approach requires that the data be sufficiently inconsistent with the negation of the hypothesis (the null hypothesis) for it to be labeled statistically significant.

If this concept seems abstract, consider this example: Suppose I claim I can control the sun's rising. Using intuitive reasoning, if you observe the sun rising consistently, you might deem my hypothesis statistically significant.

However, employing the scientific method, the null hypothesis would state that I cannot influence the sun's rising. You could test this by asking me not to make the sun rise. If the sun rises anyway, the null hypothesis stands.

A More Practical Scenario

Let’s say you are testing a cream designed to protect individuals from radio waves and hypothesize that it will be statistically significant in its effects.

The corresponding null hypothesis would be: the cream has no effect on people. To test this hypothesis, you conduct an experiment with 100 participants, splitting them into two groups—50 receive the cream and 50 receive a placebo. You then measure the radio wave-shielding effectiveness for both groups.

It is not enough for the cream group to show better shielding results than the placebo group; it is also essential that fewer placebo participants exhibit comparable or superior shielding (as indicated by the p-value). Only then can the results be deemed statistically significant, meaning the null hypothesis is rejected.

While I’ve simplified this example by not delving into natural shielding probabilities or assumed distributions, it suffices for the purpose of this discussion.

Having covered Fisher's scientific method of null-hypothesis significance testing, let’s turn our attention to the challenges associated with this approach.

Challenges with Statistical Significance

Most challenges arise at what I refer to as interface points. A typical statistical workflow appears as follows:

As illustrated, the experimental and statistical treatment of the issue at hand are largely objective protocols. However, complications arise when formulating the null hypothesis and interpreting the results of statistical significance.

The First Interface Point

One major challenge stems from how we articulate the null hypothesis. We often overlook that the null hypothesis is nearly always false. In practical terms, the cream either enhances radio wave shielding or diminishes it. Most applications yield a binary outcome: they either help or they don't. The notion of zero effect may not exist in reality.

When effects are minimal, we often treat them as negligible. The way we formulate the null hypothesis represents the first interface point, where poor formulation leads to poor outcomes.

The Second Interface Point

The second interface point occurs during the interpretation of statistical results. Essentially, statistical significance (in a scientific context) indicates that the cream has an effect on individuals. Contrary to intuition, it does not denote the “significance” or “importance” of the results.

It is our responsibility to draw conclusions based on these “statistically significant” results, often using subjective language. If our interpretations are misleading, the scientific concept of “statistical significance” becomes equally misleading.

For instance, if I claim that the cream enhances shielding by a factor of ten, it sounds impressive. However, if I reveal that the baseline shielding without the cream was 0.0000000000001 (with 100 being the maximum and 0 the minimum), this would mean that the cream's shielding is only 0.0000000000010—not as compelling!

Addressing the Issues with Statistical Significance

To begin tackling the challenges outlined, we must first focus on the interface points. There is no single solution to these problems.

When confronted with “statistically significant” results, we must critically evaluate the formulation of the null hypothesis and the linguistic interpretation of “statistical significance”.

While there can be flaws in the statistical methods themselves, the interface points pose a more pressing issue—they play with human emotions. A dramatic interpretation could provoke an unwarranted reaction from people (ten times a minuscule number remains a minuscule number).

In conclusion, the term “statistical significance” conveys nothing about the quantitative significance of the results. It merely indicates that we are detecting something worthy of our attention.

Reference and credit: Jordon Ellenberg.

Further reading: Consider exploring How to Perfectly Predict Improbable Events? and The New Industrial Revolution Is Here.

If you wish to support my work as an author, please consider contributing on Patreon.