Should p-value thresholds be cut to raise data standards?

For decades, scientific studies have hinged on showing a p-value of less than 0.05 as evidence that a study readout is genuine—but calls are growing for a new approach.

Almost all published studies rely on that threshold for statistical significance but, according to Stanford University statistician John Ioannidis, M.D., D.Sc., “many of the claims that these reports highlight are likely false” and p-values are often “misinterpreted, overtrusted and misused.”

In fact, less-than-rigorous interpretations of studies that pass the p-value threshold could be a primary reason why even well-established studies across scientific disciplines are often hard to reproduce on retesting, he suggested in a Journal of the American Medical Association (JAMA) article.

“Multiple misinterpretations of P values exist, but the most common one is that they represent the ‘probability that the studied hypothesis is true,’” he wrote, adding that basing scientific conclusions or business and policy decisions on that interpretation is a minefield. Most claims that scrape under the 0.05 threshold are “probably false … i.e., the claimed associations and treatment effects do not exist [and] even among those claims that are true, few are worth acting on in medicine and health care.”

How to fix the problem remains a contentious issue however. One proposal to simply reduce the threshold for significant by a factor of 10 to a p-value of 0.005—with studies meeting the current threshold deemed “suggestive” of an effect—has met with a mixed response.

Proponents—and Ioannidis himself is among the signatories to that call for a redefinition of significance—would help reduce false positive results and address the lack of reproducibility in scientific studies claiming new discoveries. Lowered thresholds have already been used with success in studies looking for associations in population genomics datasets.

He cautioned however that this may be a short-term fix in other types of biomedical research, working “as a dam that could help gain time and prevent drowning by a flood of statistical significance,” while other statistical approaches are sought, such as Bayesian inferential tools.

That could include abandoning statistical significance thresholds or p-values altogether. With big data now being increasingly being tapped in healthcare, statistical significance is becoming irrelevant as “extremely low P values are routinely obtained for signals that are too small to be useful even if true.”

If p-values continue to be used—which seems likely at least in the near-term—Ioannidis does believe that “lower thresholds are probably preferable for most observational research.”