Best method for comparing home espresso machines? - Page 2

Need help with equipment usage or want to share your latest discovery?
User avatar
HB
Admin

Postby HB » Jan 21, 2017, 7:13 pm

SonVolt wrote:Have you ever switched from one beer to another and been slapped in the face with how weird/awful the 2nd beer tastes until your palate readjusts?

Keep in mind we're comparing two samples of the same coffee. To put it another way, it's more like comparing Coke and Pepsi than a stout and lager beer.

homeburrero wrote:That result is well within expectations due to random chance.

In this specific case, the difference was clear enough that I'm confident that if we had increased the number of samples, the proportions would remain about the same. That is, if I had been served 4 pairs, I'm confident I would have picked the same one every time. I consider myself an average taster and the difference was significant enough to overcome my middling taste abilities.

baldheadracing wrote:Two different machines, two different baristas, and two different grinders (same model, but maybe not the same design burrsets!): too many effects for the experiment's design to handle.

I was surprised by Ben's comment about the burrs; nobody mentioned that prior to the test or during our group discussion.

For what it's worth, I have noticed that when the "expected favorite" doesn't win the group taste test, all sorts of interesting theories emerge to explain the unanticipated result. For example, in the Ratio Eight Brewer Review group taste test, there was talk about the level of the Bonavita negatively impacting its results. We did another round with the level verified and the two units in their standard configuration (we had used the same KONE filters in the first round). There was no change. :?

The main reason reviews include a "Forgiveness Factor" is that I believe they have a significant impact on the real-world results. In the hands of a world class barista and with scrupulous attention to eliminating all possible variables, you may be able to demonstrate statistical significance that most will accept. But most buyers are not world class baristas, they're average home baristas, where I believe the ease of use carries more weight than absolute potential. In the case of last week's group taste test, I believe that was the main difference.
Dan Kehn

baldheadracing

Postby baldheadracing » Jan 21, 2017, 7:55 pm

HB wrote:... I was surprised by Ben's comment about the burrs; nobody mentioned that prior to test or during our group discussion. ...

For what it's worth, I have noticed that when the "expected favorite" doesn't win the group taste test ....

I'm glad that you were surprised, as I was surprised that you hadn't mentioned that comment in this thread.

My "expected favourite" was the Pro 800, given the coffee. I would expect a commercial spring lever group to do better than a Linea with that coffee (unless the LM's pump pressure was dialed down to well below 9 bar). Hologram may have given a different result.

In any case, all good food for thought. No doubt some folks will look at the results and jump to the conclusion that the Pro 800 is 'better' than the Linea, but that (jumping to conclusions) is their folly.

P.S. Is there any significance to the OCD being on the winning side of the upside-down cup pic?
What I'm interested in is my worst espresso being fantastic - James Hoffmann

User avatar
[creative nickname]

Postby [creative nickname] » Jan 21, 2017, 8:06 pm

I think there is no need to get carried away in an effort to duplicate an arbitrary convention within social scientific practice. "P-values below 0.05" do not indicate that your measured effect is real or meaningful, just that it is unlikely to have arisen by chance variation in your sample population. And conversely, the choice to set the threshold at 0.05 is a semi-arbitrary formalism, and makes the most sense in environments where it is easy to collect large amounts of data points and where false positive findings are very costly.

When doing machine comparisons, it makes more sense to ask how clear the measured differences were across (ideally blinded) testers, and if a difference exists, whether testers with a variety of taste preferences could easily agree on which was better. I think Dan's reviews go MUCH farther than most in terms of soliciting a variety of blinded taste-testers, which makes it more likely that a reader's preferences will correspond with his findings. Conversely, while you couldn't publish a finding that had a 1/3 probability of arising by chance in a peer-reviewed journal, it is still probably true, and more information you had than before Dan published his review.

In short, although it is fine to give advice that might help make published testing more useful, I think it is a bit much to criticize the sort of careful reviews that Dan is posting for not living up to an arbitrary standard of perfection, especially one that can easily miss the forest of practical significance for the tree of worrying about one very particular way in which a study might not mean what it seems to mean. At the very least, consider publishing some useful comparisons of your own that live up to your high standards before you start demanding that other people do the same!
LMWDP #435

User avatar
HB
Admin

Postby HB » Jan 21, 2017, 8:15 pm

baldheadracing wrote:I'm glad that you were surprised, as I was surprised that you hadn't mentioned that comment in this thread.

I hold all written comments for a few days to give everyone a chance to respond. They certainly looked like identical grinders. Due to other obligations, I am no longer able to make it to Counter Culture every Friday, otherwise I would have already followed up on this mystery.

baldheadracing wrote:P.S. Is there any significance to the OCD being on the winning side of the upside-down cup pic?

No. I wanted to use Lem's WBC trophy as the indicator of the winning side, but I couldn't find it. :lol:

[creative nickname] wrote:At the very least, consider publishing some useful comparisons of your own that live up to your high standards before you start demanding that other people do the same!

Indeed! Jim Schulman is one of the few members who has demonstrated the necessary stamina for that level of certainty.
Dan Kehn

User avatar
dominico
Team HB

Postby dominico » Jan 22, 2017, 12:21 am

What I'm reading here is that in order to make the results more acceptable you should have simply served a lot more than 8 rounds of coffees! I don't think anyone would be complaining about that.
http://bit.ly/29dgjDW
Il caffè è un piacere, se non è buono che piacere è?

User avatar
TomC
Team HB

Postby TomC » replying to dominico » Jan 22, 2017, 12:51 am

Not to be the voice of dissent, but I don't see a place as high caliber as Counter Culture Coffee's training lab having a lot of free time for non essential operations with the Brewing Championships right around the corner. And to the pessimists, I'd say it seems like a bit of a stretch for someone to point fingers at a grinder in a lab of this quality. Like a grinder in a setting like this isn't going to stand out if it's operating out of spec to the others? I highly doubt it. Sounds to me like an anachronism happened to make the modern tools lose a bit of luster in this test.

User avatar
AssafL

Postby AssafL » Jan 22, 2017, 6:53 am

What question did you try to answer. Was it:
1. Is there a better machine of the two (or of any...)?
2. Is the coffee made by Barista A on system A more liked by the group than coffee made by Barista B on system B?

I think the answer to 2 is obvious (given the likelihood of a random response as calculated in a previous post).

Question 1 is always badly defined on purpose since it is (usually) a marketing question. A GS/3 is not better than a Nespresso for someone seeking a capsule machine. When asked, question 1 would always lead to subjectivist theories unless reigned in by scrutiny.

A better (and easier to answer - at least for me) question may be: given 2 baristas using 2 systems - can they both create a coffee on system 1 that they cannot on system 2. For example, create an EY of 24 on both systems - but whatever the timing and grinder and so forth - one comes out sweet and the other a sink shot. Or one can get to 24 and the other not over 21 - and so forth.

At that point (perhaps) one could extrapolate to what this means and find other similar experiments....

That is why all these "my new grinder brings out otherworldly sweetness..." threads are so meaningless.

(turning this upside down: if a really good coffee machine can make a really good coffee shine - who says a really bad one can't make a really bad coffee drinkable - perhaps by converting everything to an Americano...? - and how would you differentiate the two?)
Scraping away (slowly) at the tyranny of biases and dogma.

baldheadracing

Postby baldheadracing » Jan 22, 2017, 3:47 pm

TomC wrote:... And to the pessimists, I'd say it seems like a bit of a stretch for someone to point fingers at a grinder in a lab of this quality. Like a grinder in a setting like this isn't going to stand out if it's operating out of spec to the others? I highly doubt it ...

I believe that the people who made comments about grinder and burr differences work at that lab.
What I'm interested in is my worst espresso being fantastic - James Hoffmann

User avatar
TomC
Team HB

Postby TomC » replying to baldheadracing » Jan 22, 2017, 4:30 pm

I know.

chrisbodnarphoto

Postby chrisbodnarphoto » Jan 22, 2017, 5:13 pm

baldheadracing wrote:I believe that the people who made comments about grinder and burr differences work at that lab.

If I'm not mistaken, weren't both grinders the Mythos? Doesn't the Mythos only use one burr set? I imagine they are Clima Pro's and equipped with the coated burrs, as the older Mythos use the non-coated ... I don't see how there could be a mixup, or how they wouldn't have gone into the test knowing that one Mythos performed differently to another considering both are being used in a testing environment ... just seems bizarre.