Best method for comparing home espresso machines?

#1: Post by HB » January 18th, 2017, 8:16 pm

For years I've been testing espresso machines and grinders and writing reviews. One of the highlights of the research phase is a group taste test, frequently held at Counter Culture's training center in Durham, NC. The tasting itself is done blinded, i.e., we discreetly mark the bottom of one cup and hand the taster two espressos. They simply have to pick their favorite and place that cup on the "winner" side. In the end, we typically have at least 8 pairs. A few times it's ended in a near draw. Only once has there been a complete blowout where one was picked unanimously. Most of the time, there's a clear winner, but it's far from a landslide victory.

In preparation for the tasting, we've tried different formats to isolate variables. For example, we'll use the same grinder type, coffee, basket, temperature, and dose. We also agree to the same brew ratio. In the past, some have argued that not exploiting a particular piece of equipment's unique features (e.g., pressure profiling for the Vesuvius), we effectively handicap one competitor by "normalizing" acceptable output.

This past Friday, we held a group taste test as part of the Profitec Pro 800 Review. Cognizant of the aforementioned concern, we changed the format slightly: While the coffee, basket, dose and so on were the same, each barista was instructed, given 20 minutes, to dial in the best espresso they could with whatever brew ratio or grind setting they deemed appropriate.

Once the final results were tallied, a few participants expressed concerns that the comparison was not valid.

I agreed with them that when cupping coffees, it's absolutely essential that one adhere to a precise recipe. But if testing real-world usage of an espresso machine, isn't it valid to compare the results from two espresso machines that were independently dialed in by competent baristas? I argued that allowing two baristas, both of which are intimately familiar with the equipment before them, sufficient time to dial in a coffee more accurately reflects real-world usage versus a rigid protocol that may favor one and disfavor another.

Maybe, maybe not. Perhaps we ended up comparing the skills of the two baristas. Or maybe it was preference bias that favored one style of preparation. Or maybe it was a perfectly valid real-world comparison. What do you think?

Final score: 6-2. No, it wasn't close. Was the competition fair?

#2: Post by **keno** » January 18th, 2017, 8:49 pm

Interesting points and this Is similar to the difference between efficacy and effectiveness in medical research. How much should you control all potential variables in a study? Efficacy trials compare treatments under ideal circumstances, while effectiveness trials compare treatments under less than ideal or real-world circumstances. By controlling for more variables efficacy trials are better poised to answer whether one treatment is better (under ideal circumstances) than another. While effectiveness trials have the advantage of allowing you to better generalize to the real-world. The FDA is more interested in efficacy because it wants to know for regulatory purposes whether a drug really works, while policy makers tend to be more interested in what's actually going to work better in the real-world.

I see the merits of both when it comes to equipment testing and it would be interesting to see how (or if) the results would differ when conducting the experiment both ways. With the design you're using the problematic variable now is the skill of the barista in dialing in. Ideally you'd do a highly controlled (same recipe) study and then do the dialing in version in which you also switch the baristas. I think that's the only way you're really going to know. So c'mon get back to work!

January 18th, 2017, 9:17 pm

Like Ken I'd say both kinds of comparison are valuable. "Best shot" comparisons can help highlight that different machines might have different strengths. Of course, this might also illustrate the extent to which preferences for a machine are actually driven by preferences for a particular style of shot. All-other-things-being-equal comparisons can help illustrate the ways in which the machines themselves produce differences, but might mask the fact that such comparisons might require one or the other machine to operate outside of its ideal zone. In the end, either kind of comparison is most informative when it includes thoughtful tasting notes from evaluators, rather than just a vote for which machine was "better" in a subjective sense.

January 18th, 2017, 10:12 pm

Exactly.....you might make an informed decision based on both tests!

#5: Post by **MNate** » January 18th, 2017, 10:39 pm

I like the idea of seeing what the machine is good at, though I don't like that it does sound a bit more like testing the barista than the equipment. I'd be more likely to buy a given machine if what it excelled at was the sort of thing I liked. If comparing two machines that take the same approach to espresso, then the same recipe makes sense. But if comparing a lever to a pre-infusion machine to an HX to a stable-temp machine it would be neat to hear the differences in the best shot produced.

#6: Post by **gr2020** » January 19th, 2017, 10:27 am

Maybe the test should be split - run x pairs of shots as you did, and then have the baristas swap machines, dial in again, and repeat. Then you could isolate whether a preference was for a machine or a barista...

#7: Post by **SonVolt** » January 19th, 2017, 11:18 am

Any type of back-to-back taste test is going to be problematic. Whatever shot you taste first is ultimately going to skew your judgment of the 2nd. Have you ever switched from one beer to another and been slapped in the face with how weird/awful the 2nd beer tastes until your palate readjusts?

#8: Post by **peacecup** » January 21st, 2017, 2:16 pm

I've posted this link before. It is the classic tea tasting test of RA Fisher, one of the fathers of statistical theory:

https://en.wikipedia.org/wiki/Lady_tasting_tea

Following Fisher the simplest way to test two different drinks (machines, grinders, coffees, whatever), is to give one taster a number of drinks from each (Fisher used 4 of each). The taster should get all 8 drinks correct to reject the null hypothesis (there is no difference between the machines).

This has very practical significance for espresso equipment. If one cannot correctly and reliably ID the "good" machine from the "bad" there is no point laying out the cash for it in terms of the quality of espresso. Of course many other factors come into play when choosing a machine; ease of use, style, capacity, etc.

#9: Post by **homeburrero** » January 21st, 2017, 4:55 pm

I agree with Ken and others that with two machines and two different baristas, if you're comparing the machines you probably want to analyze separately for each barista. Same idea would apply if you were to do two machines and two coffees - do the analysis separate for each coffee, then perhaps extrapolate from there if you find them significant and consistent.

HB wrote:Final score: 6-2. No, it wasn't close.

I'd have to disagree with that. That result is well within expectations due to random chance. If you do the binomial expansion for p=0.5 and n=8, the odds of seeing that difference or greater (includes scores of 8-0, 7-1, 6-2, 2-6, 1-7, and 0-8) from identical coffee samples would be about 29%. For this small sample, even a 7-1 or 0-7 score would not quite be enough to call it significant at the 0.05 level.

peacecup wrote:I've posted this link before. It is the classic tea tasting test of RA Fisher, one of the fathers of statistical theory:

+1

Along those lines, here's an illustrative example of a triangle test and simple chi-square analysis to decide if there is a significant difference in the taste of two samples: SSP Triangle test .
In this one, you may need to pull 18 shots from the two machines for each taster evaluation. A lot of work there.

January 21st, 2017, 5:22 pm

Two different machines, two different baristas, and two different grinders (same model, but maybe not the same design burrsets!): too many effects for the experiment's design to handle.