Best method for comparing home espresso machines? - Page 3

Need help with equipment usage or want to share your latest discovery?
User avatar
another_jim
Team HB

Postby another_jim » Jan 23, 2017, 8:38 am

The responses to Dan have lost the forest for the trees. You don't design tests, and certainly not test statistics, until you have an actual question to test. In this case the question is "will I prefer machine X or Y once I know how to use them properly?"

So you need best practices for both machines ... and an answer to "who am I."

In the case of espresso machines, this doesn't need to get any more philosophical than finding out what style of espresso the tasters prefer. Having several coffees that fit those styles would be a good thing, along with baristas who can adapt to them. But just knowing the tasters' preferences and comparing them to the taste test outcomes would be helpful.
Jim Schulman

appfrent

Postby appfrent » Jan 24, 2017, 9:05 pm

I agree. Its like testing Fourmula One car vs Toyota Yaris (replace with anything similar) at peak traffic on Holland Tunnel (or anywhere anytime in LA :)) and coming to conclusion that they both have similar peak speeds (stretched analogy, but you get the point) . If the question is how they perform in peak traffic then yes, otherwise the experiment would be absurd. As absurd as someone buying a certain espresso machine and use is sub optimally on purpose. In fact, its also important to add a variable of different roasts on both the machines and reversing shot sequences (drinking as well as pulling) etc etc.
Just a side note: I don't want give wrong impression so I should start by saying that I enjoy reading the "coffee/espresso research". Just good to remember that there are zillions of variables that could make one machine look better over other, even in a blind taste test, esp. if they perform in a closer range (easy to tell if something is POS). Few variables with right systemic change are good enough for a false conclusion. I design and interpret experiments for a living, and I see a lot of poor experimental design and wrong conclusions by PhD scientists. It would be sobering and interesting fact to know that even among peer reviewed published medical research, about 90% is junk (this was the conclusion of research on researches :) ). Reproducibilty of experiments by multiple independent research groups is one good standard. Standing test of time is another.
Forget four M's, four S's are more important :-)- see, sniff, sip and savor....

User avatar
AssafL

Postby AssafL » Jan 26, 2017, 5:06 am

appfrent wrote:Just good to remember that there are zillions of variables that could make one machine look better over other, even in a blind taste test, esp. if they perform in a closer range (easy to tell if something is POS). Few variables with right systemic change are good enough for a false conclusion.


Do you really believe there is a POS end to the espresso machine curve? I ask because I find it hard to believe any of the leaders on this site (like Jim, Dan, Nicholas and many many others) will fail to get a decent if not a good cup from any combination of machine/grinder. They may have to realign the grinder, surf the temp, find a different roast, WDT, and then WDT again, replace the gasket, use the lever as a pressure profiler... but eventually show that even the worst machine can (usually) be forced to deliver.

I design and interpret experiments for a living, and I see a lot of poor experimental design and wrong conclusions by PhD scientists. It would be sobering and interesting fact to know that even among peer reviewed published medical research, about 90% is junk (this was the conclusion of research on researches :) ). Reproducibilty of experiments by multiple independent research groups is one good standard. Standing test of time is another.


Reproducibility of results assumes an objective tester. Once subjectivity sets in (and I think that is the crux of Jim's post - if I read it correctly) the question is "will I prefer machine X" and not "is machine X better".

As an example, a grinder that doesn't normalize the output will forever make the grinder harder to use. That doesn't mean that it will be ignored in the market. In fact there is a very hot one right now that forces you to normalize the output using a shaker - and it's main marketer is a well known barista. People who want to normalize externally* will find it better.

A lever (or paddle) can also somewhat compensate for sub optimal grinding - whereas the low end pump machines lack preinfusion and can be very unforgiving - especially to newbies... Heck even a pump GS/3 is less forgiving. Hence the answer to the question "will I prefer machine X" will be paddle if my grinder isn't perfect (or even if it is)... Unfortunately paddle did not exist when I got the GS/3 (but I compensated by adding line level preinfusion which made it a lot more forgiving and flexible - almost as a paddle... as did aligning the grinder).

* as to why would you choose a grinder with external normalization? - I can try to rationalize: because if a grinder that normalizes does it badly - such as some low throughput Mazzers, you will have to WDT anyway, compound that with the fact that finding a good normalizing grinder is hard; furthermore, the weaknesses will be most pronounced in dry weather; leading unsuspecting owners to conclude it is the grinder that is temperamental - Indeed I did so and can only blame the younger version of myself for major ignorance blaming Mazzer (when I should have been blaming the autumn air)...

NB - I like the term "normalization" to describe what the grinder does to the grinds after the burrs. I first noticed the term in the book discussed at The Craft and Science of Coffee book.
Scraping away (slowly) at the tyranny of biases and dogma.

appfrent

Postby appfrent » Jan 26, 2017, 11:58 am

AssafL wrote:Do you really believe there is a POS end to the espresso machine curve? I ask because I find it hard to believe any of the leaders on this site (like Jim, Dan, Nicholas and many many others) will fail to get a decent if not a good cup from any combination of machine/grinder. They may have to realign the grinder, surf the temp, find a different roast, WDT, and then WDT again, replace the gasket, use the lever as a pressure profiler... but eventually show that even the worst machine can (usually) be forced to deliver.

You have completely misunderstood my statement. I just meant its not that easy to evaluate and tease apart beyond noise if performances are close. I am not trying to question anyone's intention or competence, leader or not. As I said, I read reviews to help my purchase decisions and I like reading reviews. And, I appreciate all the effort reviewers put into evaluating the products, sometimes selflessly. I am merely providing a cautious note. Like, their is difference between reaching conclusion that ECM technika lacks water indicator and its a major inconvenience vs Model A E61 machine extracts certain flavors better than Model B E61. Not that later conclusion cannot be made. Its just that its much harder and nuanced than most people think and appreciate.

Reproducibility of results assumes an objective tester. Once subjectivity sets in (and I think that is the crux of Jim's post - if I read it correctly) the question is "will I prefer machine X" and not "is machine X better".

There is nothing objective about coffee. That's why a wide range of methods and models exist and are thriving :D
However, it is possible to design a objective experiment. Just as random example.

Goal: Which machines among 5 selected produces an espresso that is liked by Assaf
Experiment design:
Ask 1 barista to use beans Assaf likes, dial the best espresso, and present them blinded to Assaf: Total 5 shots
The nuts and paddles don't matter as this is Input-->Black Box-->Output kind of design

Now, will you be more confident with this design or
New Experimental design:
Ask 5 baristas that are expert for each machine to produce 3 shots on each machine and present them in random order to Assaf: 5X3X5=75 shots (Good luck drinking that much coffee :D )
Which one is likely to give a better result? Where do you draw a line on minimum samples necessary to make a call? All answers are in maths and statistics, if people do not cheat them to suit their design constrains.

I have simplified the goal. Imagine if goal was to find: which machine makes the espresso liked by most people in certain country?
Forget four M's, four S's are more important :-)- see, sniff, sip and savor....

User avatar
AssafL

Postby AssafL » Jan 27, 2017, 5:17 am

appfrent wrote:You have completely misunderstood my statement. I just meant its not that easy to evaluate and tease apart beyond noise if performances are close.

....

Goal: Which machines among 5 selected produces an espresso that is liked by Assaf
Experiment design:
Ask 1 barista to use beans Assaf likes, dial the best espresso, and present them blinded to Assaf: Total 5 shots
The nuts and paddles don't matter as this is Input-->Black Box-->Output kind of design


And the point I was trying to make is given time (and little other resources) I think all 5 machines could make coffee Assaf likes. And I am a luddite - nowhere in the caliber of people who roam these threads. I expect their magic to usurp mine...

To Jim's point - it does not mean I'll like the 5 machines: Maybe one doesn't fit under the counter, one will have a copper (soon to be verdigris) eagle on top, one has a loud pump, 1 too fiddle some (so I can make good coffee - just lower percentage of good cups - for me the small Pavoni's are in the club...) - and perhaps 1 of the 5 will be perfect for me....
Scraping away (slowly) at the tyranny of biases and dogma.

appfrent

Postby appfrent » Jan 27, 2017, 12:30 pm

And the point I was trying to make is given time (and little other resources) I think all 5 machines could make coffee Assaf likes.

Then that's the result of experiment. Or, if it was input at design stage, this experiment is not suitable for Assaf. :D
On serious note, if taste outcome as judged by a reviewers panel is of no concern to you (I infer that from your previous statements), then this entire topic does not offer anything to you. Dan is still going to include all other information you want in his reviews.

About your specific response to Jim post, my safe guess about his response is
have lost the forest for the trees
Forget four M's, four S's are more important :-)- see, sniff, sip and savor....

mike guy

Postby mike guy » Jan 27, 2017, 12:47 pm

I used to create designs of experiments for medical uses, and this is a very tough problem to solve. Especially in an environment where you are trying to have fun and do it quickly. For any of the results to be technically valid, you need to conduct an MSA on all of the tasters ability to even detect differences in flavor before getting into preferences. Without doing that, these are always going to be fun exercises where the results are never going have experimental validity.

My personal and former professional opinion is to drop the pretense of a head to head comparison or winner. Instead just discuss the a/b differences and then try and find where encapsulated multiple measurement systems (humans) agree on the differences without having discussed them. This will has much more observational use outside of trying to determine what is better. It also some what serves as an informal measurement systems analysis to determine if people are even tasting the same differences. Being able to have multiple tasters agree on differences without prior discussion, and more importantly what those differences are, will mean the test results have an acceptable amount of variation to even be discussed.

I could drone on for hours about this topic, as designing an proper experiment that would pass peer and FDA approval is an interesting topic to me. In the hobby world of coffee, where results are subjective to imprecise measurements, we could eventually conduct an experiment that had validity. But it would likely take days of work and not be fun for anyone involved. Personally, I think it is more appropriate to keep the fun aspects of these meetup comparisons than gather real scientific data.

User avatar
Fausto

Postby Fausto » Jan 27, 2017, 1:05 pm

Considering that the current USBC was in the room I think the solution is fairly simple - make him run both machines! Surely he can coax the best out of both and be quick about it :lol:

More seriously, I think a second coffee would help with the validity of the new testing method. Each barista picks a coffee that they think they're machine will excel with. Then they both pull shots of both coffees.

Presumably each machine would win the round with their chosen coffee, but if one machine won both rounds I think you could draw some conclusions from that. Just some thoughts, maybe too much work.

Perhaps there some way to bring SCAA judging into this? Again probably too complicated, but rather than picking your favorite, the shots could be scored based on standard metrics and we could see which produced greater clarity, body, crema, etc.

appfrent

Postby appfrent » Jan 27, 2017, 3:14 pm

mike guy wrote:I used to create designs of experiments for medical uses, and this is a very tough problem to solve. Especially in an environment where you are trying to have fun and do it quickly. For any of the results to be technically valid, you need to conduct an MSA on all of the tasters ability to even detect differences in flavor before getting into preferences. Without doing that, these are always going to be fun exercises where the results are never going have experimental validity.

My personal and former professional opinion is to drop the pretense of a head to head comparison or winner. Instead just discuss the a/b differences and then try and find where encapsulated multiple measurement systems (humans) agree on the differences without having discussed them. This will has much more observational use outside of trying to determine what is better. It also some what serves as an informal measurement systems analysis to determine if people are even tasting the same differences. Being able to have multiple tasters agree on differences without prior discussion, and more importantly what those differences are, will mean the test results have an acceptable amount of variation to even be discussed.

I could drone on for hours about this topic, as designing an proper experiment that would pass peer and FDA approval is an interesting topic to me. In the hobby world of coffee, where results are subjective to imprecise measurements, we could eventually conduct an experiment that had validity. But it would likely take days of work and not be fun for anyone involved. Personally, I think it is more appropriate to keep the fun aspects of these meetup comparisons than gather real scientific data.


+1 to everything, except I have seen plenty c#$@ pass peer and FDA. There are humans sitting there too. The biggest strength of human brain is its biggest weakness as well. While it can perform relevant analysis from minimal, disjointed information out of random clutter at a lightning speed; it does that at the cost of high error rate. In simple words (probably too much oversimplification), it sacrifices accuracy over speed. In present context, that is why taste based test are vastly superior over chemistry as well as enormously challenging in getting correct analysis. :D
Forget four M's, four S's are more important :-)- see, sniff, sip and savor....

mike guy

Postby mike guy » Jan 27, 2017, 3:16 pm

Sure and don't get me started on what can pass for peer review in academia. The desire to get published sometimes makes for a back scratching environment at times.

Either way, DOE when done right is a thing of beauty. But it all starts with the right measurement systems.

I don't advocate for chemistry only testing with coffee or wine or any kind of similar situation where the point is subjective human enjoyment. I would only advocate for human measurement systems here. But to do any of this right, you need to have a human that can take known changes in input, and be able to output a description of those changes in a statistically significant way. Including negative testing where the subject is given identical samples and they have to accurately match them. You'll find that a lot of tests like this (audio cable measurements for one) can be invalidated when the measurement system (ears) hears things that are objectively not there.