Nobel Prize and Statistical Significance
Recently, three scientists jointly won the 2019 Nobel Prize for Medicine for their pioneering work on how human cells respond to changing oxygen levels: William Kaelin Jr, Sir Peter Ratcliffe, and Gregg Semenza. I understand that science is certainly much more than just statistics, but I thought to myself, "We hear from critics how p-values, statistical significance, frequentism, etc., are supposedly bad for science, but they never seem to talk about the good of these techniques or outright deny there is any. What if these Nobel Prize winners use p-values and statistical significance?"
So I checked some of the papers from the past to the present from the Nobel laureates in the general area of research their Nobel Prize was in. Note that this is not to say the researchers never use any Bayesian techniques or that they always use p-values and statistical significance. In fact, I found a small amount of papers using only Bayesian techniques, frequentist and Bayesian techniques together, and some papers using none at all. Rather, this is to show that scientists of the highest caliber, doing some of the most important work, use p-values and statistical significance. I also didn't compare much dates, but I'm pretty sure some of their papers were even published after the ASA II and Nature pieces saying to not say significant and to not dichotomize.
Here is a small sampling of the results:
William Kaelin JrMutant p53 induces a hypoxia transcriptional program in gastric and esophageal adenocarcinoma
- "P values from pairwise Wilcoxon's rank-sum tests"
- "P value was calculated by Pearson's correlation"
- "P values were calculated by Students' t test"
- "P values were estimated from the empirical Bayes moderated t statistics, and q values were estimated using the Benjamini-Hochberg method"
- "Pairwise comparisons between groups (experimental versus control) were performed using an unpaired 2-tailed Student's t test or Kruskal-Wallis test as appropriate. P < 0.05 was considered to be statistically significant."
- "For all panels, data presented are means ± SD; *p < 0.05; **p < 0.01; ***p < 0.001. Two-tailed p values were determined by unpaired t test."
- "Two-tailed p values were determined by unpaired t test. n.s. = nonsignificant."
- "P value was determined by a mean-based longitudinal mixed-effects model to accommodate repeated measurements within animals."
- "P value was determined by log-rank test."
- "Two-tailed P values were determined by unpaired t test. n.s., nonsignificant."
- "P values for all comparisons other than those pertaining to tumor growth, survival, and gender composition of mouse cohorts were calculated by unpaired two-tailed t test. For comparisons of two groups with significantly different variances, Welch's t test was used. For comparisons of two groups without significant differences in variances, Student's t test was used."
- "...was performed using Fisher's exact test. Statistical significance for all comparisons was determined using a nominal P value <0.05"
Sir Peter RatcliffeInherent DNA-binding specificities of the HIF-1a and HIF-2a transcription factors in chromatin
- "Pvi is the P-value for gene i"
- "(P = 2 x 10-16, Wilcoxon signed-rank test)"
- "(P-value < 0.0001)"
- "Data are shown as the mean ± SEM. Statistical analyses were performed using unpaired Student's t tests. For repeated measures, data were analysed by ANOVA followed by Tukey's multiple comparison test or t test with Holm–Sidak correction for multiple comparisons as appropriate and as described in Hodson et al. (2016). P < 0.05 was considered statistically significant."
- "Significance was tested using two-way ANOVAs (right hand column P value = chronic hypoxia factor; bottom row P value = genotype factor; right column, bottom row P value = chronic hypoxia/genotype interaction factor), followed by t tests (with Holm–Sidak correction) for analysis of individual time points; P < 0.05 highlighted in bold."
Gregg SemenzaGlutaminase 1 expression in colorectal cancer cells is induced by hypoxia and required for tumor growth, invasion, and metastatic colonization
- "Mann–Whitney U-test or analysis of variance (ANOVA) followed by Bonferroni post-test for multiple comparisons was used to determine p-values."
- "*P<0.05, **P<0.01, ***P<0.001 compared to normoxia, unpaired Student's t-test. n=3 independent experiments from 3 biological replicates"
- "Kaplan–Meier curves were generated using Kaplan-Meier plotter (kmplot.com) and the log-rank test was performed. For tumorigenicity assays, the Fisher exact test was performed. For all other assays, differences between two groups were analyzed by Student t test, whereas differences between multiple groups were analyzed by ANOVA with Bonferroni posttest. P values < 0.05 were considered significant for all analyses."
- "P < 0.0001"
Banerjee, Duflo, KremerWinners of the prize in Economics (Banerjee, Duflo, and Kremer) also use p-values and statistical significance language in their work.
- "One year after the end of the intervention, 36 months after the productive asset transfer, 8 out of 10 indices still showed statistically significant gains, and there was very little or no decline in the impact of the program on the key variables (consumption, household assets, and food security). Income and revenues were significantly higher in the treatment group in every country."
- "All treatment effects are presented as standardized z-score indices and 95% confidence intervals"
- "The aggregate test, reported in Panel C, finds that we are not able to reject equality of means across all ten measures (p-value = 0.689)"
- "Second, given that multiple families of outcomes are being reported, we correct for the potential issue of simultaneous inference using multiple inference testing."
- "An exception is Peru, where we see three results out of ten statistically significant at the 5% level."
- "P-value from t-test of equality of means"
- "Finally, for each of these outcomes, we report both the standard p-value and the p-value adjusted for multiple hypotheses testing across all the indices."
- "* significant at the 10% level, ** at the 5% level, *** at the 1% level."
- "Confidence intervals are cluster-bootstrapped at the neighborhood level."
- "F-statistics (and corresponding p-values) are from a joint test of significance in a regression of treatment on all eight variables in each round."
"Within economics, Duflo and her colleagues are sometimes referred to as the randomistas. They have borrowed, from medicine, what Duflo calls a "very robust and very simple tool": they subject social-policy ideas to randomized control trials, as one would use in testing a drug. This approach filters out statistical noise; it connects cause and effect. The policy question might be: Does microfinance work? Or: Can you incentivize teachers to turn up to class? Or: When trying to prevent very poor people from contracting malaria, is it more effective to give them protective bed nets, or to sell the nets at a low price, on the presumption that people are more likely to use something that they've paid for? (A colleague of Duflo's did this study, in Kenya.) As in medicine, a J-PAL trial, at its simplest, will randomly divide a population into two groups, and administer a "treatment" - a textbook, access to a microfinance loan - to one group but not to the other. Because of the randomness, both groups, if large enough, will have the same complexion: the same mixture of old and young, happy and sad, and every other possible source of experimental confusion. If, at the end of the study, one group turns out to have changed—become wealthier, say - then you can be certain that the change is a result of the treatment. A researcher needs to ask the right question in the right way, and this is not easy, but then the trial takes over and a number drops into view. There are other statistical ways to connect cause and effect, but none so transparent, in Duflo's view, or so adept at upsetting expectations. Randomization "takes the guesswork, the wizardry, the technical prowess, the intuition, out of finding out whether something makes a difference," she told me. And so: in the Kenya trial, the best price for bed nets was free."
This was for some 2019 laureates, but what about for some past years' Nobel Prizes?
James P. AllisonCombination CTLA-4 Blockade and 4-1BB Activation Enhances Tumor Rejection by Increasing T-Cell Infiltration, Proliferation, and Cytokine Production
- "Each curve represents 3 independent experiments of 10 mice per group. P values were calculated using the Log-rank (Mantel-Cox) test (* - p<=0.05, ** - p<=0.01, ***-p<0.001)"
- "We conclude that the combination therapy creates higher CD8/Treg ratios than CTLA-4 blockade alone (23:1 versus 11:1, p=0.0002), while also providing a significantly higher CD4 T-effector/Treg ratio compared to either a4-1BB alone (2.8:1 versus 2:1, p=0.027) or to FVAX alone (2.8:1 versus 1.8:1, p=0.0077) which a4-1BB therapy alone lacks."
- "Student's t-tests were performed to determine statistical significance between samples (* - p<=0.05, ** - p<=0.01, ***-p<0.001)"
- "Mice receiving FVAX and both aCTLA-4 and a4-1BB show enhanced ratios of CD8+ T-cells relative to CD11b+GR-1+ MDSC when compared to either FVAX (3.9 vs. 0.7, p<0.0001) or aCTLA-4 alone (3.9 vs. 2.2, p=0.03)"
- "P values were calculated using the Log-rank (Mantel-Cox) test (* - p<=0.05, ** - p<=0.01, ***-p<0.001)"
- "The treated mice showed significant increases (P=0.02) in the numbers of LCMV-specific CD8 T cells, as measured by three different MHC class I tetramers"
- "As shown in Fig. 3h, there were significant reductions in virus levels in the spleen (P=0.008), liver (P < 0.0001), lung (P=.0002) and serum (P=0.003) in the treated mice."
- "Values compared by using paired t test."
- Table 2 shows p-values
- "The graph indicates the percentage of surviving mice over time. Significant (P=0.04, log rank test) difference was found between B16GM-CSF and anti–CTLA-4–treated mice injected with control antibody and mice injected with CD4-depleting Ab"
- "Significant (A, P = 0.025; B, P = 0.0004, log rank test) differences were found between B16-GM-CSF vaccinated mice that received either anti–CTLA-4 Ab or were injected with anti-CD25 Ab plus anti-CTLA-4 Ab"
- "For each group the mean is shown +- SEM. Significant (P < 0.02, Student's t test) difference was found between B16-GM-CSF vaccinated mice that received either anti–CTLA-4 Ab alone or in combination with anti-CD25 Ab"
- "Significant difference (P = 0.03, Student's t test) was found between mice from groups 3 and 4"
- "Significance was determined by one-way ANOVA with Tukey post hoc analysis"
- "Data are represented as mean +- SEM. Unless otherwise noted, significance was determined using t tests (*p<0.05, **p<0.01, ***p<0.001, **** p<0.0001). ns, not significant"
- "Significance was assayed using chi-square tests"
- "Significance was determined using t tests"
- "Statistical analysis of pathological scores, flow cytometry, and immunohistochemical quantifications were performed by using Student's t test, one-way ANOVA, or Fisher's exact test with GraphPad Prism (GraphPad Software). Limiting dilution assay was evaluated using SPSS (SPSS) with chi-square tests. For survival analysis, Kaplan-Meier plots were drawn and statistical differences evaluated using the log rank Mantel-Cox test. A p value < 0.05 was considered statistically significant."
Frances H. ArnoldStructure-Guided Recombination Creates an Artificial Family of Cytochromes P450
- "In order to avoid overfitting the data, p-value testing is used to determine which fragments make a significant contribution to predicting chimera folding status"
- "Blocks 1, 5, 7, and block pair 1–7 remained highly significant in the second round, whereas pairs 1–5 and 5–8 dropped in significance to p > 10-3, a threshold established previously."
- "We observed that lower consensus energies are associated with higher T50 values (Fig. 2a; Pearson r = -0.58, P << 10-9). Furthermore, folded proteins tend to have lower consensus energies than unfolded ones (Fig. 2b; Wilcoxon signed rank test P << 10-9)."
- "The random field's expected fraction of functional sequences shows quantitative agreement with experimental results (r=0.95 with p<0.005). Error bars represent the binomial 95% confidence intervals calculated using the Clopper-Pearson method. (B) The expected additivity agrees well with experimentally determined values (r=0.78 with p=0.21). While the small data set limits the statistical significance of this correlation, all E[A]s are large and within the ranges that are observed experimentally."
- "A three-way analysis of variance shows the protein fold (p<0.001), specific breakpoints (p<0.001), and parent sequence identity (p<0.001) all make significant contributions to the E[fF]."
- the paper also references "Fisher's fundamental theorem of natural selection", the same R.A. Fisher of statistics fame
Remember 2013And lest we forget, in 2013, the Nobel Prize for Physics was shared by Peter Higgs and François Englert "for the theoretical discovery of a mechanism that contributes to our understanding of the origin of mass of subatomic particles, and which recently was confirmed through the discovery of the predicted fundamental particle, by the ATLAS and CMS experiments at CERN's Large Hadron Collider". The research papers in this area tend to use p-values and statistical significance:
- "This observation, which has a significance of 5.9 standard deviations, corresponding to a background fluctuation probability of 1.7x10-9, is compatible with the production and decay of the Standard Model Higgs boson."
- "95% confidence level (CL)"
- "Figure 1 shows the expected local p-values ..."
- "Both the local and global p-values can be expressed as a corresponding number of standard deviations using the one-sided Gaussian tail convention."
More ExamplesIf we look back a few more years in Economics (the Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel), Chemistry, Physiology and Medicine, and Physics, we can find more examples (that I do not list here). I did not include Literature and Peace in my review, although it is possible there was some data analysis done in those areas. If we include other frequentist notions such as sampling, confidence intervals, and standard frequentist methods like Pearson/Spearman correlation, linear regression, logistic regression, ANOVA, large sample asymptotics, and bootstrapping, that would also increase the counts. Counts could also increase if we include in the count other researchers basing their work off of the Nobel laureates' work, and these researchers are using p-values, statistical significance, etc.
Also, I and others (Mayo comes to mind) are not convinced that researchers are not using things like p-values and statistical significance if they don't include them in their published papers. Researchers would have to use something like this if they are making claims of strength (or not) of relationships, terms in models, differences in distributions, and so on. In other words, researchers could still be checking p-values and statistical significance "offline" or "behind the scenes", and then write-up their paper and have it published without mentioning p-values and statistical significance, according to their personal tastes and/or arbitrary journal standards. Therefore, it is not inconceivable that there is an undercounting of examples of using p-values and statistical significance language here.
ConclusionIt seems like learning about and using p-values, statistical significance, adjusting for multiple comparisons, and using nonparametric frequentist tests might be the way not to hurt science as critics claim, but to help do really good science. If Nobel Prize winning science isn't good enough evidence to convince about the merits of p-values and statistical significance, I really don't know what is.
Thanks for reading.
If you enjoyed any of my content, please consider supporting it in a variety of ways: