A note on comparison tests

Want to talk espresso but not sure which forum? If so, this is the right one.
User avatar
Team HB

#1: Post by another_jim »

Several of us coffee amateurs do a lot of coffee and coffee equipment testing. And occasionally, people misunderstand the results of such tests. So here is a short and plain spoken primer on comparison tests:

Comparison Tests are about statements like these:
  • "90% like Brand A better than Brand X."
  • "For the treatment of vegetative myopathy, Simglobulon significantly outperformed placebos in the FDA approval trials."
  • "I can't tell the difference between properly frozen and fresh roasted coffee when used for espresso"
In each of these statements, two things are being compared, and an assertion is being made about their differences or lack of it. How are such statements tested?

The basics are really easy, and frequently misunderstood. Suppose you are comparing A & B:
  • If A beats B every single time, the pattern becomes obvious in 3 to 4 trials, and pretty much incontrovertible after 7 to 10 trials.
  • But suppose A beats B only 11 times out of 20. Then the pattern only becomes obvious in 100 to 200 trials, and requires around 500 to a 1000 to become pretty much incontrovertible.
In other words, the tighter the race between A & B, the more races it takes to say for certain which one is better.

"Pretty much incontrovertible," "saying for certain," "beyond all reasonable doubt," ... statisticians have a phrase that means the same thing ... "a statistically significant result." This does not mean the difference between A & B is big, it means enough trials have been run to say the difference, whatever it may be, exists beyond reasonable doubt. Statisticians put a number on this significance, typically 5%, 1%, and 0.1%. This means that their statement of asserted differences will be true 95 time out of 100, 99 times out of 100, or 999 time out of 1000.

So how many trial does it take to establish a significant difference? That depends how much better A is than B. Here's a handy table:

Frequency of A Beats B     Trials needed for 5%  for 1% for 0.1%
   100%                                       5       7      10
    90%                                       8      11      14
    80%                                      11      19      27
    70%                                      21      37      65
    60%                                      80     145     245
    55%                                     300     560     980
    51%                                   7,000  14,000  24,000
So the next time you hear about a drug trial that used a cast of thousands; ask whether the drug company was being really thorough, or whether the test needed that many people since the drug's potency isn't all that different from a placebo's.

As far as coffee tests are concerned, my guess is that people are hardly interested in making a change if it turns out not to beat their current set up at least 6/10 times. Unfortunately, even that requires more trials (75 to 200) than any of us amateur testers can easily handle. So, if we discover no difference, it's somewhere from 30/70 to 70/30, and if we do discover a difference, it's because its more extreme than that.

I hope this has been educational; and now back to our regular programming.
Jim Schulman

User avatar
Supporter ♡

#2: Post by sweaner »

another_jim wrote:"For the treatment of vegetative myopathy, Simglobulon significantly outperformed placebos in the FDA approval trials."
Now Jim, you're not being paid by the Simglobulon people are you? :lol: You are so right regarding drug trials. The companies often quote "relative risk reductions" that look great, but absolute reductions are minuscule.
LMWDP #248

La Marzocco · Home: customized for espresso aficionados
Sponsored by La Marzocco · Home
Ken Fox

#3: Post by Ken Fox »


I'm so glad you posted this. Your succinct post is actually a good substitute for the "article" we talked about writing on the "Scientific Method," but which we never could get around to writing. Scientific study of common issues is both simpler and more complicated than it appears at first glance. In order to have even a chance of answering an important question, the question you are asking (and hence what you are studying) must be distilled down to something very simple and direct; otherwise, your "results" are apt to be as fuzzy as the question you were trying to answer.

I have tried to live by two major principles when it comes to coffee studies I have done and participated in. The first principle, is to try to design a study that answers a question that real people want the answer to. An example of such a question would be, "does freezing so seriously damage coffee that using it as a method of preservation should be absolutely discarded, because the previously frozen coffee is obviously and grossly inferior to fresh and never frozen coffee??" This is not an unreasonable question to ask, given what has been said by many notable people about the use of freezing to preserve coffee. Here, we aren't asking whether frozen coffee is 9% worse than fresh-never frozen; in order to answer THAT question, you would have to compare more espresso shot pairs than anyone I know would have the patience for.

The second principle has been to try to design the actual test to be as simple as possible, and few things are more simple than comparing two cups of espresso at the same time and coming up with a preference (or a lack of preference) by asking one's self the simple question, "which one is better?"

Statistics are useful for some things and not so useful for others. In the "not so useful" category I would place the subjective experience of tasting two cups of espresso at the same time, and coming to some sort of a conclusion on the overall comparison of a relatively small number of shot pairs. No doubt it is true that you would need a HUGE NUMBER of such comparisons to detect a small but real difference between the two conditions being tested (in this case, previously frozen coffee vs. never frozen coffee). It is also true, however, that after one has the chance to taste a much smaller number of such paired cups, and to then learn right after forming an opinion, which cup was from frozen coffee and which was from fresh -- an open minded person is simply forced to acknowledge that he prefers one condition one time, and the other condition the other time, and in reality the inter-shot differences overpower the differences one observes between shot pairs.

Taking this "previously-frozen" vs. "fresh, never frozen" example a step further, a taster in such a study (Jim and myself in this case) is forced to conclude that (1) freezing doesn't do that much "damage" to coffee, certainly nothing within the magnitude alleged by its detractors; and, (2) it sure seems to do a pretty good job of preserving coffee, when what would otherwise be undrinkable (4 month old coffee left out exposed to the environment at room temperature) is virtually indistinguishable from fresh, never frozen coffee.

And with all its limitations, a small (repeat) study such as we (Jim and myself) have just completed, proves, at least to US, that freezing is a very reasonable method for preserving coffee, and the previously frozen coffee (within the limits of how our coffee was handled) is a close enough approximation to fresh that freezing should be considered a very reasonable way to extend the shelf life of coffee, certainly better than any other way we have to deal with coffee that would otherwise oxidize into oblivion if left at room temperature.

Getting back to Jim's above post, the study that we have just done proves the merit of what Jim is trying to explain. One benefits from considering statistics in the design of any study. If one expects or wishes to consider the possibility of a large difference being present in the two studied "conditions," the "thing" being studied has to be simple if one is going to detect the difference within a reasonable number of comparisons. If on the other hand one hopes to find a subtle difference, then this has implications for both the mechanics of the test itself, and for the feasibility of testing it in the first place. Another way of saying this is that one has to decide if a subtle difference is sufficiently important to detect, that it would be worth spending several weeks in tasting hundreds of shot pairs in order to detect it. In day to day espresso preparation, my palate is simply not THAT discerning that I'd be willing to invest the time and effort in trying to detect a difference of that magnitude. There are others here who may feel differently, and I'd encourage them to conduct such a study if they thought it was worth doing.

What, me worry?

Alfred E. Neuman, 1955

User avatar
another_jim (original poster)
Team HB

#4: Post by another_jim (original poster) »

Actually, I think the sum of all the fresh frozen tests is enough to assert that fresh won't even beat frozen more than 60% of the time.

If the difference between two coffees is consistent and large enough to overcome shot to shot variation, it would score a win 100% of the time. Both Ken and I have invested a good deal of money and time in reducing that variation in our tests. So the result of not even 60% of fresh beating frozen means the actual difference is very small.
Jim Schulman

User avatar
Supporter ♡

#5: Post by GC7 »

That was a very good and PRACTICAL primer on the value and reality of statistics. P-values are really not of any practical value when the assays have such a high noise level.

I think the most telling statement Ken made here and one that folks need to realize is that statistical measurements and their predictions are easy when the assay you are using has no or very little noise. That is the value measured in each assay is exceedingly accurate. Once you get into an assay such as my interpretation of how good an espresso shot is (or if it has caramel flavors, chocolate, blueberry etc) then the noise level is exceedingly high and in fact the accuracy of the assay (my taste buds and interpretation thereof) makes an answer with any accuracy impractical or impossible.

User avatar

#6: Post by michaelbenis »


But which is better? :D
LMWDP No. 237

User avatar

#7: Post by GVDub » replying to michaelbenis »

The one you paid more for, of course. :twisted:
"Experience is a comb nature gives us after we are bald."
Chinese Proverb