Definition of Probability

7/15/19

Frank Harrell, a noted frequentist-turned-Bayesian, wrote a Glossary of Statistical Terms. This is a good resource, and Harrell is a 'we disagree with him on certain points, but we ignore him at our peril' expert statistician, but the entry for probability and description of the frequentist definition of probability are very odd and subjective Bayesian biased to me.

Below I give the full entry (but minus footnotes) for probability as of 7/14/19. Afterwards, I break down a few places where I have disagreement.

probability: The probability that an event will occur, that an invisible event has already occurred, or that an assertion is true, is a number between 0 and 1 inclusive such that (1) the probability of some possible alternative occurring is 1, and (2) the probability of any of a set of mutually exclusive events (i.e., union of events) occuring is the sum of the individual event probabilities. The meaning attached to the metric known as a probability is up to the user; it can represent long-run relative frequency of repeatable observations, a degree of belief, or a measure of veracity or plausibility. In the frequentist school, the probability of an event denotes the limit of the long-term fraction of occurrences of the event. This notion of probability implies that the same experiment which generated the outcome of interest can be repeated infinitely often. Even a coin will change after 100,000 flips. Likewise, some may argue that a patient is "one of a kind" and that repetitions of the same experiment are not possible. One could reasonably argue that a "repetition" does not denote the same patient at the same stage of the disease, but rather any patient with the same severity of disease (measured with current technology). There are other schools of probability that do not require the notion of replication at all. For example, the school of subjective probability (associated with the Bayesian school) "considers probability as a measure of the degree of belief of a given subject in the occurrence of an event or, more generally, in the veracity of a given assertion". de finetti defined subjective probability in terms of wagers and odds in betting. A risk-neutral individual would be willing to wager $P that an event will occur when the payoff is $1 and her subjective probability is P for the event. The domain of application of probability is all-important. We assume that the true event status (e.g., dead/alive) is unknown, and we also assume that the information the probability is conditional upon (e.g. Pr{death | male, age=70}) is what we would check the probability against. In other words, we do not ask whether Pr(death | male, age=70) is accurate when compared against Pr(death | male, age=70, meanbp=45, patient on downhill course). It is difficult to find a probability that is truly not conditional on anything. What is conditioned upon is all important. Probabilities are maximally useful when, as with Bayesian inference, they condition on what is known to provide a forecast for what is unknown. These are "forward time" or "forward information flow" probabilities. Forward time probabilities can meaningfully be taken out of context more often than backward-time probabilities, as they don't need to consider "what might have happened." In frequentist statistics, the P-value is a backward information flow probability, being conditional on the unknown effect size. This is why P-values must be adjusted for multiple data looks ("what might have happened") whereas the current Bayesian posterior probability merely override any posterior probabilities computed at earlier data looks, because they now condition on current data. As IJ Good has written, the axioms defining the "rules" under which probabilities must operate (e.g., a probability is between 0 and 1) do not define what a probability actually means. He also states that all probabilities are subjective, because they depend on the knowledge of the particular observer.

On to the breakdown.

This notion [frequentist- J] of probability implies that the same experiment which generated the outcome of interest can be repeated infinitely often. In the previous sentence, Harrell talks about a limit and long-term (which I do not object to). But limit and long-term do not necessarily imply literally the same experiment, nor that anything needs to literally be repeated an infinite number of times. The notion of a limit of relative frequency does not require an actual literal infinity anymore than, say, R computing an area under a curve requires an actual literal infinity of rectangles.

I also don't like the wording on "invisible event", but think I know what he is saying, and will let that one go.

Even a coin will change after 100,000 flips If that is the case, that can be tested for. One could also simply substitute an identical coin (say identical as defined as from the same manufacturing process and passing quality control), much like casinos do for dice and cards. Of course, we can just simulate "coins" on the computer to understand easily that the long-term relative frequencies converge without worrying about any wear and tear arguments. Of course, how wear and tear happens might be more predictable and easier to understand than peoples' beliefs.

Likewise, some may argue that a patient is "one of a kind" and that repetitions of the same experiment are not possible. I'm not sure I'd even want literally identical settings. I want to make inferences, so if everything is literally identical, findings may not generalize but only apply to that exact setup. I'd say that people are one of a kind, but have similarities too, so large number arguments can apply very well, like with life insurance tables.

I'd also note that Harrell is a distinguished biostatistician, so he is understandably very biostatistics focused. However, there are tons of other areas, of course, that use probability and statistics (survey sampling, quality control, experimental design, econometrics, etc.) that look at things other than patients.

There are other schools of probability that do not require the notion of replication at all. Then as both a practitioner and consumer I'd have to ask those schools how much credence we can put in their results, considering we could have sampled other data than what we did, and just one experiment is not enough, no matter how good the experiment, to make an informed decision or establish an effect.

For example, the school of subjective probability (associated with the Bayesian school)... I just wanted to mention that it is not just loosely "associated with", but there is literally a "subjective Bayesian" school.

de finetti defined subjective probability in terms of wagers and odds in betting. A risk-neutral individual would be willing to wager $P that an event will occur when the payoff is $1 and her subjective probability is P for the event. This is only what the subjectivist would be willing to bet against him/herself, not what everyone would be willing to bet, however. Betting doesn't convert subjective into objective.

Probabilities are maximally useful when, as with Bayesian inference, they condition on what is known to provide a forecast for what is unknown. Since frequentist methods use known data and can provide forecasts, I don't understand this point.

These [Bayesian - J] are "forward time" or "forward information flow" probabilities. Forward time probabilities can meaningfully be taken out of context more often than backward-time probabilities, as they don't need to consider "what might have happened." What Harrell is calling "what might have happened" is counterfactual reasoning and modus tollens logic, falsification, and argument by contradiction. All of these things are valid and essential in science and causality. Because a Bayesian approach doesn't need to consider this for the math to work doesn't mean they shouldn't consider this.

This is why P-values must be adjusted for multiple data looks ("what might have happened") whereas the current Bayesian posterior probability merely override any posterior probabilities computed at earlier data looks, because they now condition on current data. But from another viewpoint, correcting this way is proper and not correcting is highly suspect (ie. look all you want, don't worry about optional stopping, etc.).

Elsewhere in the document Harrell states "frequentist testing requires complex multiplicity adjustments but provides no guiding principles for exactly how those adjustments should be derived" This is semantics, and maybe even false, as there are many methods with detailed reasoning behind them.

He [Good - J] also states that all probabilities are subjective, because they depend on the knowledge of the particular observer. This cannot be true, because relative frequencies converge based on any observer and their beliefs, or even a robot designed to keep track of frequencies.

Harrell also references Kolmogorov's axioms of probability. It is important to note that Kolmogorov himself rested his axioms on von Mises' work.

"The basis for the applicability of the results of the mathematical theory of probability to real 'random phenomena' must depend on some form of the frequency concept of probability, the unavoidable nature of which has been established by von Mises in a spirited manner."

Just perusing through, there are a few other odd entries (and again, I'm talking about two or three entries out of many more useful ones, so I don't want to sound too critical here). For example, the variance entry

variance: A measure of the spread or variability of a distribution, equaling the average value of the squared difference between measurements and the population mean measurement. From a sample of measurements, the variance is estimated by the sample variance, which is the sum of squared differences from the sample mean, divided by the number of measurements minus 1. The minus 1 is a kind of "penalty" that corrects for estimating the population mean with the sample mean. Variances are typically only useful when the measurements follow a normal or at least a symmetric distribution.

The last line is not correct for a general entry on "variance". In complex surveys statistics, say in economic establishment surveys which have a lot of skewness, variance is still calculated and useful, especially as input into a coefficient of variation. For example, one might calculate a stratified jackknife, Taylor series approximation, random groups, delete-a-group jackknife, and other variances. See Introduction to Variance Estimation by Wolter.

For discussions of frequentism, Bayesian, etc., I strongly recommend checking out Mayo's book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.

Any comments/corrections appreciated, and thanks for reading.


If you enjoyed any of my content, please consider supporting it in a variety of ways: