The Limits of Statistical Methodology: Why A “Statistically Significant” Number of Published Scientific Research Findings are False, #3.

Mr Nemo

10 min readMay 13, 2024

By Joseph Wayne Smith

***

1. Introduction

2. Troubles in Statistical Paradise

3. A Critique of Bayesianism

4. The Limits of Probability Theory

5. Conclusion

***

The essay that follows below will be published in four installments; this is the third.

But you can also download and read or share a .pdf of the complete text of this essay, including the REFERENCES, by scrolling down to the bottom of this post and clicking on the Download tab.

***

3. A Critique of Bayesianism

Some methodologists advocate the view that significance tests should be replaced by alternative methods, “the new statistics” of estimation confidence intervals, and meta-analysis (Cumming, 2012, 2014; Gelman, 2014; Morey et al., 2014, 2016). There are many sound points presented by Cumming, including 25 guidelines for improving psychological research, for example, not trusting any p-value, and to accept that any results are “one possibility from an infinite sequence” (Cumming, 2014: p. 8). However, as far as presenting an alternative statistical framework to NHST goes, there are many published criticisms. In general, critics of this approach argue from a Bayesian perspective, that the frequentist approach using confidence intervals, leads to inconsistent inferences, and that confidence intervals do not solve the existing problems with null hypothesis significance testing (Dienes, 2011). However, as we will now see, Bayesianism itself does not fare any better, and has its own conceptual difficulties.

This major competing school of thought, Bayesianism, holds that the inductive support for hypotheses is assessed on the basis of subjective and objective factors. The subjective factor is the prior probability of a hypothesis before the evidence is assessed. It is subjective, because epistemic subjects will frequently differ in their prior probabilities Pr(h), for a hypothesis h. The objective factor consists of direct inference probabilities that a hypothesis h is supported by evidence e. More explicitly,

Bayes’s theorem relates these direct inference probabilities with a subject’s prior probabilities to produce the subject’s posterior probability, the subject’s probability judgment after the evidence has been considered. Bayes’ theorem relates the posterior (or later coming) probability of a hypothesis Pr(h/e) to Pr(h), Pr(e/h) and Pr(e) so that knowing the values of the last three terms will enable the calculation of Pr(h/e) as:

Bayes’ Theorem: Pr(h/e) = Pr(e/h).Pr(h)

Pr(e)

for Pr(h), Pr(e) > 0 (Smith et al., 1999 : p. 33).

Therefore:

scientific inference as involves moving from the prior probability Pr(h) of a hypothesis to its posterior probability Pr(h/e) on the basis of the evidence collected, such that if Pr(h/e) > Pr(h) then e confirms or supports h. IfPr(h/e) < Pr(h) then e disconfirms or refutes h. (Smith et al., 1999: P. 33)

Just as the conventional significance testing approach has been subject to extensive criticism, so too has the Bayesian approach. The critics of the Bayesian approach believe that it has severe limitations and cannot provide a complete statistical methodology for the sciences, with critics raising problems about the limits of rationality and the cognitive capacities of Bayesian subjects, questioning the claim that degrees of justification are Bayesian probabilities, and demonstrating the mathematical and computational intractability of Bayesian methods for even simple problems (Kyburg, 1978, 1993; Hyliand & Zeckhauser, 1979; Garber, 1983; Sowden, 1984; Humberg, 1987; Van Fraassen, 1988; Earman, 1989; Eells, 1990; Howson, 1991; Zynda, 1995; Wagner, 1997; Barnes, 1999; Gunn et al., 2016). What is interesting about this debate, if one adopts a neutral standpoint, is that the experts seem to make telling criticisms of opposing statistical methodologies without begging the question and assuming that their own position is correct, such as (i) the argument from the computational intractability of the Bayesian approach, that holds even if significance tests face independent criticisms, which says that p values are misunderstood as posterior probabilities of the null hypothesis, and (ii) the common fallacious deduction of “no difference” from “no significant difference,” and “non-significant” with “no effect” (Hill, 1965: pp. 299–300; Greenland, 2011). This raises the threat of epistemological skepticism, only this time for the sciences. It certainly raises a very severe challenge to expert knowledge, and there are many astonishing claims made in the technical literature.

Let us consider one of the core foundational challenges to Bayesianism, which is the argument that there are no good reasons for believing that epistemic subjects, have any degree of confidence assignments that in general obey the axioms of the Pascalian probability calculus (Kaplan, 1989). The critical allegation to be considered is that there is little reason for supposing that betting provides a method for demonstrating the existence of degrees of belief (Milne, 1991).

The Bayesian claims that degrees of belief exist because he/she can measure them. The standard Bayesian argument for this, to paraphrase the argument by Glymour, is as follows. No rational agent will accept a bet where a loss is expected, but a rational agent will accept a bet where a gain is expected. The degree of belief in proposition P is the highest amount U that a person will pay to receive U+V for a fixed V, if P is true, but if P is not true, nothing will be received. The expected gain on paying U is zero, if U is the greatest amount willing to be paid for the bet. If P is the case, then the agent’s gain is V, but if P is not the case, the gain is –U. Therefore:

V.Pr(P) + (–U).Pr(~P) = 0.

Since

Pr(~P) = 1 — Pr(P),

then:

Pr(P) = U / (U+V).

Thus, the rational agent striving to maximize expected-gain will make a bet if the expected gain is greater than zero. The degree of belief will be determined by the betting odds accepted (Glymour, 1980: pp. 69–70).

However, the problem with this argument is that it is circular. For the rational agent to contemplate betting at all in this situation, so that the betting odds are accepted, requires positing a wealth of prior beliefs about the betting set-up itself: namely, that the bet will pay if he/she wins, that the set-up is fair and so on. Thus, the argument presupposes degrees of belief rather than proving their existence. As well, there are many beliefs about which we may have a feeling of plausibility, but where we are not prepared to gamble because maximizing expected gain is socially inappropriate. The juror’s belief about an accused person’s guilt or innocence is an example. Betting language seems inappropriate in the context.

Beyond this though, even if there are degrees of belief, as we have seen, most people cannot reasonably attach a specific number to a required level of confidence and this task is even more difficult, perhaps impossible for them to do, when a large set of evidence is presented. The numbers produced to be plugged into Bayes’ theorem will be essentially arbitrary (Humphreys, 1988).

No Bayesian has shown how the Bayesian methodology could be practically applied in a real evidential situation — for example, criminal trials involving thousands of items of evidence to consider. The use of Bayesian methodology in law can serve as a test case. For example, the updating of probabilities by Bayesian conditionalization, where a mere 30 pieces of evidence is introduced, would need the consideration of billions of probabilities (Bergman & Moore, 1991). Justice David Hodgson, of the Supreme Court of New South Wales, had this to say about the practical application of Bayes’ Theorem to a legal problem in evidence:

As an exercise, I have written a judgment for the hypothetical case, which applies Bayes’ theorem, and set it out in an Appendix. It required two assumptions of prior probabilities of hypotheses, and twelve Bayesian steps, each involving two assumptions of numerical probabilities of evidence, given the truth or falsity of hypotheses: twenty-six guesses in all. In all twenty-six, I found I had virtually no confidence in the numbers I initially selected (in some cases partly because of unsureness of exactly what question I was asking, as well as because I just had to guess the answer); and I felt I had to check the numbers against the plausibility of the results, and then adjust (and re-adjust) the numbers, in order to arrive at numbers in which I had very slightly more confidence. (That is, I had to cheat.) Such little confidence as I ended up with depended very heavily on my common-sense assessment of the plausibility of the intermediate results and the conclusion.

I think my hypothetical case shows that, for ordinary contested cases, it is fanciful to envisage a process by which a court manipulates probabilities fixed upon for certain basic statements (premisses) to arrive at a decision of the case (conclusion). In all steps from the premisses to the conclusion, a judge will generally have in the forefront of her mind the actual particular circumstance of the case, and will be making common sense judgments of (non-quantitative) probability in making these steps (as well as in determining upon the premisses). Indeed, the ultimate decision on the facts will generally itself be a common-sense judgment of non-quantitative probability concerning the overall situation, of very much the same kind as gave rise to the premisses — and very often the judge will (rightly) be more confident of reaching a correct overall conclusion ‘on the balance of probabilities’ than of assigning even approximate numerical probabilities to the premisses. (Hodgson, 1995: p. 56)

Philosophers Kevin Kelly and Clark Glymour are skeptical that Bayesianism captures the logic of scientific justification and have said:

the sweeping consistency conditions implied by Bayesian ideals are computationally and mathematically intractable even for simple logical and statistical examples. (Kelly & Glymour, 2004: p. 95)

Indeed, Kelly and Glymour claim that Bayesian confirmation “is not even the right sort of thing to serve as an explication of scientific justification” (Kelly & Glymour, 2004: p. 95), because:

Bayesian confirmation is just a change in the current output of a particular strategy or method for updating degrees of belief, whereas scientific justification depends on the truth-finding performance of the methods we use, whatever they might be. (Kelly & Glymour, 2004, pp. 95–96)

In particular,

conditional probabilities can fluctuate between high and low values any number of times as evidence accumulates, so an arbitrary high degree of confirmation tells us nothing about how many fluctuations might be forthcoming in the future or about whether an alternative method might have required fewer. (Kelly & Glymour, 2004: p. 95–96)

For more critical argumentation along these lines, see also (Kelly & Schulter, 1995; Allen, 1996–1997; Ligertwood, 1996–1997; Norton, 2011).

Bayesianism also faces the difficulty of explaining where the initial priors come from in order to start the inferential process (Simpson & Orlov, 1979–1980). A logician sympathetic to Bayesianism, Patrick Suppes, has pointed out that “there is an almost total absence of a detailed discussion of the highly differentiating nature of past experience in forming a prior” (Suppes, 2007: p. 441). About this problem R.A. Fisher has said that Bayesians

seem forced to regard mathematical probability, not as an objective quantity measured by observable frequencies, but as measuring merely psychological tendencies, theorems … which are useless for scientific purposes. (Fisher, 1960: pp. 6–7)

Similarly, Redmayne concluded this about subjective Bayesianism:

When the only constraint on rational belief is coherence among a belief set, it can seem that anything goes. (Redmayne, 2003: p. 276)

For example, in a criminal law context, if the prior probability of guilt is taken to be zero, then as Eggleston puts it, “no amount of evidence could justify a conviction, since to assume an initial probability of zero is to postulate that guilt is impossible” (Eggleston, 1991: p. 276). However, on the other hand, “no lawyer would accept the proposition that the case should start with any particular presumption as to the probability of guilt” (Eggleston, 1991: p. 276). Rawling has argued that a Bayesian juror starting from an initial presumption of innocence will virtually never reach a judgment of “guilty beyond a reasonable doubt” (Eggleston, 1991, 276; Rawling, 1999).

Finally, to continue my legal example, even if we did grant that jurors had degrees of belief that ideally obeyed the Pascalian probability calculus, there is another reason for regarding Bayesianism as unsatisfactory. As Shafer has observed, there is a constructive character to personalistic probability judgments: these probability opinions are not ready made in a subject’s mind. Rather, such probabilities arise from matching the problem at hand to background canonical examples where there are known probabilities (Shafer, 1986). As Shafer puts it, this process of construction involves us

constructing an argument, an argument that draws an analogy between our actual evidence and the knowledge of objective probabilities in a complex physical experiment or game of chance. (Shafer, 1986: p. 802)

In doing this, that is, in constructing an explanatory structure that accounts for the evidence at hand, the juror (or epistemic subject in general) does not attempt to form a conjunction C1 & C2 & … & Cn of statements and then obtain a probability for them via the multiplication rule. Rather, what is conducted is the attempt to assess whether the plaintiff or the defendant’s explanatory structures more adequately account for the evidence as a whole. In this sense, personalistic probabilities at the final stage will be relevant only to entire systems of evidence, not to isolated evidential propositions as Bayesians suppose (Pardo, 2000).

I conclude that the Bayesian position is flawed as a general decision theory for many good reasons, but in particular, that rationality, as defined by Bayesians, is simply not a general feature of human interaction (Colman, 2003).

Download