Sir Ronald Fisher in England, who pretty much invented statistical significance testing, came across a woman who claimed she could tell whether her afternoon tea had been poured tea first or milk first. He gave her eight cups, four prepared each way, and randomized the order of presentation. She got all eight correct. Chances of this with random guessing were 1 in 70.
LOL! Excellent example. Sorry for the diversion, but this is a good example of the weakness of null hypothesis significance testing (NHST) that I will use in my teaching. The main charge against NHST in psychology is being led now by Geoff Cumming, who published a good primer on why NHST is inappropriate in many cases in science. I am reproducing a figure from the article, which you can read freely I think in the journal Psychological Science here
Basically, the figure shows that any given experiment, even with a medium strength effect which is considered good in behavioral data, we can get results that vary widely in their statistical "significance" (scare quotes used intentionally) from p
= .001 to p
= .75. A p
value < .05 is tradiationally considered statistically significant, and only likely to occur once in 20 times. 1 in 70 is p
= .014 which is a statistical significance that would have me and my students jumping for joy in the lab.
What this all means for things like blind or double-blind testing of coffee and espresso, etc. is that without numerous replications and meta-analysis, we cannot know whether the results we are getting are "true" per NHST. We may be getting successes when there is none, but most importantly, if we fail to detect a difference, it could be simply a failure that occurred that one in 20 times, whereas the next replication may succeed. Thus, NHST gives us little confidence in any one result without replication and meta-analysis. Geoff's argument is that estimation of confidence intervals give us a much richer information base to interpret the strength and importance of any single result. But, that is getting into extremely esoteric ideas better left for another day.
This is not to diminish the importance of our use of double-blind and triangulated tasting protocols as some of the more scientifically inclined folks around here do, but rather to temper our interpretation of these results if they are based on NHST.
Reference: Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25