“Armiamoci e partite!” is a proverbial Italian phrase that can be roughly translated as “Let’s get armed, and then you go!”. It can be found in a poem by Olindo Guerrini:
«Tell us, why do you say “Let’s go!”
and then stay at home?
Why, far away from the blows and the conflicts,
comfortably suffer from putting on weight
while inciting the poor recruits
“Let’s get armed, and then you leave!”»
The “Prince of Laughter” Antonio de Curtis a.k.a. Totò added to this:
«I will follow you later.»
That’s basically Nature’s attitude in its postmodern call for a change in the use of statistics on its journal:
Nature 567, 305, 21 March 2019
In this comment completely sound and agreeable points are glued together into an overall irresponsible and interested position, so we have to do a bit of work to disentangle the single agreeable facts and the ideology that hides behind. The devil is in the details, and we will go after them.
Before digging into it, for all those interested in how scientists misuse statistics let me suggest the great book by Alex Reinhard, Statistics Done Wrong, which is also available online, on which I learnt most (but not all) of this subject matter. And just to make it clear, I am 100% with Reinhard that “a little knowledge of statistics is not an excuse to reject all of modern science. A research paper’s statistical methods can be judged only in detail and in context with the rest of its methods: study design, measurement techniques, cost constraints, and goals”. What this post aims at is discussing how Nature (the journal) places itself in this context.
* * *
“When was the last time you heard a seminar speaker claim there was ‘no difference’ between two groups because the difference was ‘statistically non-significant’? If your experience matches ours, there’s a good chance that this happened at the last talk you attended.”
Nice rhetorical strategy to establish a connection to the reader. However, the two experiences cannot possibly be more different: the perception of a scientist who is evaluating the internal mechanisms of the work of a peer from his or nearby communities, and that of an editor who has no specific understanding of the subject matter, and who attends conferences to capture a second-degree “overall message” and to impose his own “vision of the future” on scientists, are two completely different forms of communication. Overemphasis on derivative forms of knowledge may hinder true understanding, and the external pressure exerted by publishers on communities of scholar may bias and derail research.
“We hope that at least someone in the audience was perplexed if, as frequently happens, a plot or table showed that there actually was a difference.”
Yeah, so why do we do science after all, if things “actually” are one way or another by naked eye? What does “actually” mean to Nature’s editors? Isn’t it the whole purpose of science to find rigorous methods (among which, statistics) that allow to “see the invisible”, and at the same time to avoid seeing what is not there? Well seems like Nature editors have better criteria to propose, a new form of postmodern scientific method…
“How do statistics so often lead scientists to deny differences that those not educated in statistics can plainly see?”
Seriously? Are you kidding? Centuries after Galileo turned the telescope to the moon and the whole question of “what it means to see” kicked in, giving rise to epistemology as a discipline and shipping us into modernity (for the good and the bad) we are just now revealed by a journal that has the ambition to call itself “Nature” that all we need to do is to take a quick look at things and they will reveal themselves for what they are. In an incredible twist, it’s the prejudice of the old boring scientific method (that uses statistics as one of its tools) that leads us astray. We should return children and take away those glasses that make us blind, and see things for what they are, plain and clear!
Wow you really didn’t expect this from the leading scientific journal, did you?
The epistemological slovenliness of these few lines is disconcerting.
“For several generations, researchers have been warned that a statistically non-significant result does not ‘prove’ the null hypothesis (the hypothesis that there is no difference between groups or no effect of a treatment on some measured outcome). Nor do statistically significant results ‘prove’ some other hypothesis.”
Definitely so, let’s keep teaching this for several more generations.
“Such misconceptions have famously warped the literature with overstated claims and, less famously, led to claims of conflicts between studies where none exists.”
So here is Nature’s main point: let’s not deal with that huge and overwhelming amount of bad literature that generates from misusing statistics in an audacious and nonconservative way, giving rise to fake positives, truth inflation, and overstated claims (the kind of literature that has been proven to be selected for the worse on the top journals including Nature, which are known to have a huge problem with scientific reproducibility). No: let’s cherry-pick a few cases where over-conservative and shallow use of statistics has hindered “truth” (whatever that is…).
After all, being a publishing company that makes profit after selling to the academics the results of their own work with little to no editorial work (well, at least Nature re-draws the pictures – and they are damn nice!), it’s not surprising that Nature’s interest into the whole P-value discussion is to make it into an opportunity to turn things loose and deregulate scientific publishing a little bit further. It is this subtle logical twist of things that may have passed unnoticed to the 800 signatories of Nature’s appeal (see below), and that I want to make evident.
“We have some proposals to keep scientists from falling prey to these misconceptions.”
Good! So let’s see what these proposals are.
* * *
“Let’s be clear about what must stop: we should never conclude there is ‘no difference’ or ‘no association’ just because a P-value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero. Neither should we conclude that two studies conflict because one had a statistically significant result and the other did not.”
No one of the old boring statisticians with glasses has ever taught anything like that.
(A little reminder: The p-value is the probability that the observed effect has been caused by a “null hypothesis”, which is the most reasonable neutral scenario. For example if we test whether a coin is biased it’s reasonable to assume as null hypothesis 50%-50%; if we study the effect of a drug we might compare it to placebo effect, or to the best drug already available on the market. Building the null hypothesis is not easy and it involves a lot of subjective trimmings that will hide in the statistical analysis. If the P-value goes below a conventional value (say, 5%), then the study is called “significant”. This 5% can be interpreted as the amount of false positives that are tolerated by whomever wishes to publish results (notice that deciding whether 5% makes sense is an editorial work: here Nature is re-positioning itself, which is perfectly legit – if only they could take some blame and explain why they have to refocus!). It would be best if this could be the maximal amount of false positives tolerated – that is, that good practices beyond reporting the P-value could keep the amount of false positives way below this threshold. Unfortunately, bad practices make 5% the minimal amount of false positives. Cautious estimates show that not publishing negative results and other systematic biases due to the pressure to publish may boost this value to an actual 50% of true negatives being sold as false positives. So, for example, is Nature willing to publish negative results to create more reliable science?)
But, if two studies on the same subject come out one significative and one not, and they only report the P-value, there’s not much you can do about it. The only message you can draw is that that scientific community should come up with a better common strategy to plan their trial or experiment and give more convincing results on that same subject matter. But there is no way one can re-use those data: it’s called “double-dipping” and it is a source of systematic error (the whole exploding literature of systematic reviews is affected by this problem – therefore including this very article of Nature, where a ludicrous “systematic analysis” consisting of only two articles is conducted…).
“These errors waste research efforts and misinform policy decisions.”
My opinion: what wastes research effort is the pressure to publish fast, more, and in higher-impact journals, in an environment of perpetual competition between groups that are not incentivized to collaborate to one common goal, share data, plan strategies, and produce “powerful” experiments, but rather are incentivized to atomize research into tiny fractions of under-powered experiments and trials that have insufficient sample sizes and that are doomed to generate insignificant results, because each individual group – and each individual in the group – has to publish his own thing to constantly update CVs, in a system of evaluation of careers where Nature and other publishers retain monopoly over such a delicate thing as “scientific reputation”.
Does Nature have anything to say about this?
“For example, consider a series of analyses of unintended effects of anti-inflammatory drugs. Because their results were statistically non-significant, one set of researchers concluded that exposure to the drugs was “not associated” with new-onset atrial fibrillation (the most common disturbance to heart rhythm) and that the results stood in contrast to those from an earlier study with a statistically significant outcome. Now, let’s look at the actual data. The researchers describing their statistically non-significant results found a risk ratio of 1.2 (that is, a 20% greater risk in exposed patients relative to unexposed ones). They also found a 95% confidence interval that spanned everything from a trifling risk decrease of 3% to a considerable risk increase of 48% (P = 0.091; our calculation). The researchers from the earlier, statistically significant, study found the exact same risk ratio of 1.2. That study was simply more precise, with an interval spanning from 9% to 33% greater risk (P = 0.0003; our calculation).”
Here Nature uses the strategy to go technical so to make it difficult to counter the arguments, and to distract from the systematic error they commit, which is cherry-picking their favorite case study of a “true positive” that has been made into “false negative”. I don’t think cherry-picking visible cases does good to the cause of better use of statistics: maybe we should be more interested into the overwhelming hidden cases of “true negatives” being made into “false positives” by pressure to publish.
I cannot really evaluate the case under exam and that’s not the point. But I am dazzled by P-value P = 0.0003 (which is remarkably low for an epidemiological study). Is it plausible given the sample size of 32 602 and the hypothesis being tested? I don’t know, the statistical analysis in the original paper is complex and intertwined with clinical considerations. The danger of too advanced statistics is that errors may hide here and there, for example in the hundreds of trimmings with the null hypothesis, and that one could just resort to the statistical tool he favours to turn things the way he wants. In any case the readership will trust his analysis, because rarely real statisticians check on the use of statistics in scientific publishing. And why should they? They are too busy creating even more advanced tools…
So, establishing a serious pipeline of statistical-peer-review would definitely be up to the editor if he cares about his market-share. Here is a constructive proposal to Nature (so that I won’t be pointed at as just a troublemaker): hire statisticians to create a third-party pipeline that systematically reviews submitted papers’ consistency with claimed P-values (and design power). If we have to pay to buy our own research, let’s get some added value at least (apart from the nice figures)!
In absence of such serious chain of statistical checking, P-value is the simplest thing anybody can understand and evaluate (if statistical training is so poor that even that is not understood, how do you expect reporting confidence intervals – if even possible – might help?).
But of course the P-value is not all. Coming back to the cherry-picked controversy above, one way towards making a third-party judgement would be to have the second-simplest statistical tool. The power (see Chapter 2 in Statistics Done Wrong) is a tool for meta-analysis which is hardly ever reported. It is the probability that, assuming that the effect is true, the experiment can reach a given P-value. While it is a little tricky to calculate it, it’s not crazy difficult if one puts his mind into it.
This probability basically depends on the hypothesis to test and on the sample size. So, if for example the probability of reaching P = 0.0003 comes out 99%, you can pat the authors and say: this was a well-designed experiment, congratulations! If it’s 50%, you’ll say: damn you got lucky! And if it’s 1%, well you know there’s something fishy going on. Maybe they did 100 such experiments and they just reported on the one that came out significative?! Nobody knows, because Nature & Friends certainly did not publish the other 99 papers.
Thus knowledge of the power would probably solve the controversy over these two papers, as (assuming everything was done properly) the power calculation is an indicator of the quality of the experiment. If you don’t know the power or other similar measures of quality, you cannot make any such claim. So, going back two lines, notice that Nature makes an assessment on the quality of the study based on the P-value, and not on any structural property that qualitfies and quantifies the design study:
“That study was simply more precise, with an interval spanning from 9% to 33% greater risk (P = 0.0003; our calculation).”
You cannot claim that the study was more precise because it was more significant! That is mixing a judgement on the quality of the study with a judgement on the result of the study. That could just be due to luck, and the whole point of science is that we want to gauge-out luck! Unless we return childish and “take away the glasses”…
If hiring real statisticians were “impractical” to Nature, a first good strategy on their side would be request as mandatory that the power of the study be calculated and denounced along with the P-value. This would also force people to think a bit more consciously about what a P-value is at all, and at the same time would create a higher simple universal standard, inducing people to better design their experiment, maybe collaborate to reach reasonable sample sizes, and definitevely dropping the number of publishable papers by a considerable amount etc.
But this is not among Nature’s proposals. They go for something more postmodern.
“It is ludicrous to conclude that the statistically non-significant results showed “no association”, when the interval estimate included serious risk increases; it is equally absurd to claim these results were in contrast with the earlier results showing an identical observed effect. Yet these common practices show how reliance on thresholds of statistical significance can mislead us (see ‘Beware false conclusions’). These and similar errors are widespread. Surveys of hundreds of articles have found that statistically non-significant results are interpreted as indicating ‘no difference’ or ‘no effect’ in around half (see ‘Wrong interpretations’ and Supplementary Information).”
This is definitely a mistake: finding no significance does not imply that the hypothesis being tested is not true.
But is this the real problem after all? We have so many hypothesis to test: modern sampling tools allow entire fields to basically generate random hypothesis by the millions (genetic correlations? neural patterns? metabolic pathways? correlations between dietary patterns? you name it…). If we don’t impose more rigid standards and provide tools to thin out what is plausible, and to disqualify useless hypothesis somehow, if we keep everything always open, even those results that have been deemed insignificant by very generous standards, how are we to ever make progress?
Even high-energy physics has had this problem back in the ’70s: they were generating way too many hypothesis of new particles and interactions compared to the 3-sigma confidence interval (one order of magnitude higher than the P = 0.0003 reported above…), and the new machines were observing lots of them just because they were so many. What they did was not to abolish the P-value and the notion of statistical significance altogether, but to make it much stricter, establishing the present 5-sigma standard.
“In 2016, the American Statistical Association released a statement in The American Statistician warning against the misuse of statistical significance and P values. The issue also included many commentaries on the subject. This month, a special issue in the same journal attempts to push these reforms further. It presents more than 40 papers on ‘Statistical inference in the 21st century: a world beyond P < 0.05’. The editors introduce the collection with the caution “don’t say ‘statistically significant’”.
Yep, they write “don’t say ‘statistically significant’”. They don’t write “don’t say ‘statistically insignificant’”. Turning this sentence the other way around is an obvious example of the dishonest twist Nature gives to this whole issue.
“Another article with dozens of signatories also calls on authors and journal editors to disavow those terms.”
The first postmodern proposal of Nature is to rename things. This is often the case in our society, where complex social problems are dealt with by creating a linguistic taboo and changing words, and not by establishing better practices. “To change everything in order for nothing to change”, says another famous Italian sentence from Il Gattopardo.
“We agree, and call for the entire concept of statistical significance to be abandoned. We are far from alone. When we invited others to read a draft of this comment and sign their names if they concurred with our message, 250 did so within the first 24 hours. A week later, we had more than 800 signatories — all checked for an academic affiliation or other indication of present or past work in a field that depends on statistical modelling (see the list and final count of signatories in the Supplementary Information). These include statisticians, clinical and medical researchers, biologists and psychologists from more than 50 countries and across all continents except Antarctica. One advocate called it a “surgical strike against thoughtless testing of statistical significance” and “an opportunity to register your voice in favor of better scientific practices”.”
Of course bad use of statistics is commonly agreed upon to be a huge problem, and impulsively I would also sign a petition to make things better (and to end war, poverty, carbon emissions etc.). But I wonder if all signatories took time to reflect on the particular twist Nature gave to this issue.
“We are not calling for a ban on P-values. Nor are we saying they cannot be used as a decision criterion in certain specialized applications (such as determining whether a manufacturing process meets some quality-control standard). And we are also not advocating for an anything-goes situation, in which weak evidence suddenly becomes credible. Rather, and in line with many others over the decades, we are calling for a stop to the use of P values in the conventional, dichotomous way — to decide whether a result refutes or supports a scientific hypothesis.”
“We agree, and call for the entire concept of statistical significance to be abandoned”; “We are not calling for a ban on P values”. Wow that’s almost symptoms of bipolar disorder! Jokes aside, what exactly is Nature proposing? What is Nature going to do? Do something for god’s sake!
* * *
“The trouble is human and cognitive more than it is statistical: bucketing results into ‘statistically significant’ and ‘statistically non-significant’ makes people think that the items assigned in that way are categorically different6–8. The same problems are likely to arise under any proposed statistical alternative that involves dichotomization, whether frequentist, Bayesian or otherwise.”
Yep humans tend too much to categorize expecially when all social pressures impose them to categorize, by quantifying the impact factor of the journal they publish to, optimize the H-index, etc. Proposal to Nature: quit you own obsession at maximizing the impact factor! Be the first to give a good example. That’s the only way to teach a lesson: by examples.
“Unfortunately, the false belief that crossing the threshold of statistical significance is enough to show that a result is ‘real’ has led scientists and journal editors to privilege such results, thereby distorting the literature.”
Indeed. So, again, what is Nature going to do about it?
“Statistically significant estimates are biased upwards in magnitude and potentially to a large degree, whereas statistically non-significant estimates are biased downwards in magnitude. Consequently, any discussion that focuses on estimates chosen for their significance will be biased.”
This is a very subtle example of the logical mistake of mixing causation and correlation. Here Nature suggests that the bias is intrinsic to the tool (the P-value), which somehow on its own pushes for more false positives and for less true negatives. But the tool is obviously neutral (I’ve never seen P-values threatening Ph-D’s). So its misuse is due to social pressures that may bias the very way we conduct scientific discussion. Using the same logic I could just rewrite the last sentence as: “Consequently, any discussion that focuses on
estimates chosen for their significance papers published in Nature will be biased”. But while I do believe that Nature is part of the problem, it certainly is not all of the problem.
“On top of this, the rigid focus on statistical significance encourages researchers to choose data and methods that yield statistical significance for some desired (or simply publishable) result, or that yield statistical non-significance for an undesired result, such as potential side effects of drugs — thereby invalidating conclusions.”
Indeed: overanalysis, and choosing the statistical test at will would create ever worse troubles. That’s why P-value was agreed upon as a universal tool and simple enough for all. What higher standards of acceptance is Nature going to impose?
“The pre-registration of studies and a commitment to publish all results of all analyses can do much to mitigate these issues. However, even results from pre-registered studies can be biased by decisions invariably left open in the analysis plan9. This occurs even with the best of intentions.”
Pre-registration is a great tool, though not perfect. Where is Nature’s preregistration protocol and requirements? Here we find: “Authors who wish to publish their work with us have the option of a registered report.” It’s an option, not a requirement; and there is no real incemptive. It’s just a possible opportunity, like so-called ‘open-access’ and all those tools of empowerment that the industry is more than willing to have a share of.
“Again, we are not advocating a ban on P-values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors. One reason to avoid such ‘dichotomania’ is that all statistics, including P values and confidence intervals, naturally vary from study to study, and often do so to a surprising degree. In fact, random variation alone can easily lead to large disparities in P values, far beyond falling just to either side of the 0.05 threshold. For example, even if researchers could conduct two perfect replication studies of some genuine effect, each with 80% power (chance) of achieving P < 0.05, it would not be very surprising for one to obtain P < 0.01 and the other P > 0.30. Whether a P value is small or large, caution is warranted.”
Again some of this completely gratuitous technical discourse that is all smoke in the eyes. Yet no proposal, apart from very vague “we should not treat them categorically”.
“We must learn to embrace uncertainty.”
How nice! But I would rather rephrase this as: “We will support uncertainty (= freedom) in all the preparatory phases of the experiment, and we will enforce higher certainty standards in the communication of results”. We should not allow publishers to interfere with the creation of scientific hypothesis!
“One practical way to do so is to rename confidence intervals as ‘compatibility intervals’ and interpret them in a way that avoids overconfidence.”
Again with this renaming business. I can hardly confuse “confidence” with “certainty”, and actually “compatibility” sounds to me as a much more dichotomous word than “confidence”, which was more humanly and humble. But maybe it’s just me.
“Specifically, we recommend that authors describe the practical implications of all values inside the interval, especially the observed effect (or point estimate) and the limits. In doing so, they should remember that all the values between the interval’s limits are reasonably compatible with the data, given the statistical assumptions used to compute the interval. Therefore, singling out one particular value (such as the null value) in the interval as ‘shown’ makes no sense.”
We are heading towards the grand finale. This is Nature’s new protocol: to encourage authors to write even more words so they can bullshit* their way through publication in a more postmodern way.
“We’re frankly sick of seeing such nonsensical ‘proofs of the null’ and claims of non-association in presentations, research articles, reviews and instructional materials.”
Hear the voice of the master! All of a sudden the tone of the article goes from friendly to patronizing. Sounds like: You are working for us, you are doing a bad job, and we are sick and tired of it: behave! Zero self-criticism.
“An interval that contains the null value will often also contain non-null values of high practical importance. That said, if you deem all of the values inside the interval to be practically unimportant, you might then be able to say something like ‘our results are most compatible with no important effect’.”
What is “high practical importance” if we canot discern it with the tools of the scientific method? Again, by naked eye? What does it mean to “deem”: isn’t the whole scientific process an attempt to get rid of personal subjective opinions?
Furthermore, notice that here Nature is asking you to not downplay values that are neither statistical significant nor relevant to your opinion – you just have to apply some cosmetics on words here and there. They really don’t want to renounce to any asset.
“When talking about compatibility intervals, bear in mind four things. First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible. In fact, values just outside the interval do not differ substantively from those just inside the interval. It is thus wrong to claim that an interval shows all possible values.”
On the same page as above.
“Second, not all values inside are equally compatible with the data, given the assumptions. The point estimate is the most compatible, and values near it are more compatible than those near the limits. This is why we urge authors to discuss the point estimate, even when they have a large P value or a wide interval, as well as discussing the limits of that interval.”
Again, bullshit* your way through and we’ll give you a pass.
“For example, the authors above could have written: ‘Like a previous study, our results suggest a 20% increase in risk of new-onset atrial fibrillation in patients given the anti-inflammatory drugs. Nonetheless, a risk difference ranging from a 3% decrease, a small negative association, to a 48% increase, a substantial positive association, is also reasonably compatible with our data, given our assumptions.’ Interpreting the point estimate, while acknowledging its uncertainty, will keep you from making false declarations of ‘no difference’, and from making overconfident claims.”
Bla bla bla. Again, rephrasing. How is this going to help make a point?
“Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention. It is based on the false idea that there is a 95% chance that the computed interval itself contains the true value, coupled with the vague feeling that this is a basis for a confident decision”. A different level can be justified, depending on the application. And, as in the anti-inflammatory-drugs example, interval estimates can perpetuate the problems of statistical significance when the dichotomization they impose is treated as a scientific standard.”
See how Nature is very good at riding the wave of statistical unrest: yes, 95% is not the probability that the interval contains the *true* idea (whatever that is). So what is it according to Nature, and how is Nature going to make this fundamental, defining property of P-value into an actual policy? No clue given.
“Last, and most important of all, be humble:”
“compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval. In practice, these assumptions are at best subject to considerable uncertainty. Make these assumptions as clear as possible and test the ones you can, for example by plotting your data and by fitting alternative models, and then reporting all results. Whatever the statistics show, it is fine to suggest reasons for your results, but discuss a range of potential explanations, not just favoured ones. Inferences should be scientific, and that goes far beyond the merely statistical. Factors such as background evidence, study design, data quality and understanding of underlying mechanisms are often more important than statistical measures such as P values or intervals.”
Of course, in princple this is all right. But is it doable in practice if Nature and other journals do not raise their own standards? Abandoning P-value in favour of more refined statistics; emphasis on discoursive analysis etc. may have a terrible impact on that unregulated shithole that is the marketplace of scientific ideas. We don’t even have statisticians checking on the credibility of simple statistical tests, so how will we ever be better placed when even more discoursive blabla kicks in?
“The objection we hear most against retiring statistical significance is that it is needed to make yes-or-no decisions. But for the choices often required in regulatory, policy and business environments, decisions based on the costs, benefits and likelihoods of all potential consequences always beat those made based solely on statistical significance.”
Yet another level of postmodernism. The question of regulatory, policy and business use of scientific knowledge should be completely external to the mechanisms of the production of scientific knowledge itself. Keep this whole stuff away, don’t even weight in this argument, or science is doomed to become an advanced product of the industry of opinions (as is nearly every aspect of the information business…) And my god do policy-makers need some categorical yes/no inputs! If also scientific facts become liquid, these guys will be allowed to just say anything they want, whenever they want.
“Moreover, for decisions about whether to pursue a research idea further, there is no simple connection between a P value and the probable results of subsequent studies.”
But this is not an editor’s problem! The discoursive flexibility in shaping one’s own research hypothesis, and of trusting one’s own intuition is definitely the most fun part of science: the long and artisanal way of constructing and designing an experiment. But editors: keep away from it, it’s none of your business! Early preparation should not be published. Only the final, less poetic study results should be published in the most dichotomous way as possible.
A wonderful example of this is given by physicist in the gravitational wave community, who just earned a Nobel prize. When they receive a signal, before opening it they go through a highly creative and flexible “playtime” period when they decide how to analyze the signal. When agreed upon, they first write the full paper with all details on the analysis and all in place for a dicothmous message. And only then they open signal box. No flexibility whatsoever is allowed after that moment, and it all comes down to the very dicotomous: yes we have seen it / no we have not (and in fact they do not publish in Nature).
The problem with pressure-to-publish is that it does not allow people to clearly separate these two phases, and it forces people to publish even speculative attempts that should be in the “playtime” area of their activities, but are not yet real science. Rushing results disrupts the possibility of even more robust results later on, because of the “double-dipping” problem hinted above. Unless one just hides what he’s doing…
“What will retiring statistical significance look like? We hope that methods sections and data tabulation will be more detailed and nuanced. Authors will emphasize their estimates and the uncertainty in them — for example, by explicitly discussing the lower and upper limits of their intervals.”
“We hope”… Armiamoci e partite!
“They will not rely on significance tests. When P values are reported, they will be given with sensible precision (for example, P = 0.021 or P = 0.13)”
For what purpose? To give a semblance of rigor? And why do we require the second digit in decimals, while we could choose for other systems of representation of numbers? In base ten, the second decimal digit can only be achieved if more statistics is collected. Is Nature going to enforce standards (e.g. on the power of studies) by which the second decimal digit in base ten is reached?
“— without adornments such as stars or letters to denote statistical significance and not as binary inequalities (P < 0.05 or P > 0.05). Decisions to interpret or to publish results will not be based on statistical thresholds. People will spend less time with statistical software, and more time thinking.”
No they will spend less time trying to have their software spit out some number whatsoever and they will spend more time trying to have their prose spit out some rhetorical figure whatsoever.
“Our call to retire statistical significance and to use confidence intervals as compatibility intervals is not a panacea.”
“Although it will eliminate many bad practices, it could well introduce new ones. Thus, monitoring the literature for statistical abuses should be an ongoing priority for the scientific community.”
Again: Armiamoci e partite! Notice that Nature does not consider itself as part of the scientific community, and rightly so.
“But eradicating categorization will help to halt overconfident claims, unwarranted declarations of ‘no difference’ and absurd statements about ‘replication failure’ when the results from the original and replication studies are highly compatible. The misuse of statistical significance has done much harm to the scientific community and those who rely on scientific advice. P values, intervals and other statistical measures all have their place, but it’s time for statistical significance to go.”
So, to conclude, Nature will not do anything at all, will not impose higher standards, will not coordinate with other public institutions to make pre-recording of clinical trials mandatory, will not hire statisticians to do what an editor is supposed to do – editorial work. Because of course they don’t want to change even a comma in their business model. Nature’s only real concern is to appeal to the audience who had their false negatives rejected or contested – which is a noble intent, I admit – but Nature has hardly anything to say about the monstrous problem of false positives and truth inflation, which infest its own journals. And the way they think to attack this is by asking people to rename things and renounce to quantitative measures to focus on qualitative discourse, thus buying out the internal narrations of the preparatory phases of scientific work, and not setting any quantitative standard to evaluate those results. No shadow whatsoever of self-criticism: the call is on the scientific community. That this journal serves so well.
* Bullshit is an officially scientific word. In physics corridors and in all those informal situations at conferences and workshops where the ear of the friendly editor is not allowed, Nature has a reputation of being a receptacle of boasted claims and of self-referential communities. People writing papers for Nature focus much more on story-telling than on the actual message; they are obsessed with getting nice pictures and making connections to fags and fashionable topics; it is crucial to be able to write in proper English and to have an elegant exposition – which creates an obvious bias against researchers from non-English-speaking countries. But this is my perception, and it’s not scientific.