Versalab M3 Grinder Taste Test Report

Grinders are one of the keys to exceptional espresso. Discuss them here.
User avatar
another_jim
Team HB
Posts: 13954
Joined: 19 years ago

#1: Post by another_jim »

I'm putting this on a separate thread so it doesn't spoil the progress of others doing taste tests.

This report covers some taste tests I did for cupped coffees, shots and cappas. My goal with this grinder was to get improved shots on bright SOs usually unsuitable for espresso. A report on that will follow next week.



METHOD: The 4 cup tests were blind. For the 9 cappas or macs, 4 were blind and 5 non-blind. For the 31 straight shots, 15 were blind and 16 non-blind. The cappa and mac tests were done as part of the shot tests, each shot was sipped and rated, then the cappas or macs were made and rated.

The blind requirement raised a severe problem for shot making, since I could not get enough volunteers to have one make, another taste. Doing blind tests of shots requires that the shot flow rates be equal, so that it doesn't act as a tell. This proved impossible to do consistently with a chopped pf, since the visible flow from the M3 is almost always better than the mini's. The sweeper/chute exit of the grinder does a very good job in creating an even density fill that makes for picture perfect naked pf shots.

For a regular PF, another problem arose. I did most of the sighted shots first. After I did the first eight blind shots, it was obvious the results were radically divergent from the sighted ones. Had the sighted shots been an exercise in self-deception? For the blind test, I had set the mini on its sweet spot, then adjusted the M3 to produce the same extraction rate. So I did a second sequence of seven blind shots with the M3 set on its sweet spot and the Mini adjusted to match the extraction. The results from this made sense of all the data. The sighted series may contain some bias, but massive self-deception now seems unlikely. I finished testing with six shots from the Peppina, 2 sighted and 2 blind at each grinder's sweet spot. I broke off testing when the new results showed no more surprises or large changes in the offing, and when I became so familiar with the small taste and extraction idiosyncrasies of each grinder that further blind testing was impossible.



CUPPING: I did four blind cuppings. The first was a terrible coffee where I did just one cup from each grinder; the M3's was really terrible with all the ghastliness very clear, whereas the Mini's was muffled. The difference became especially apparent on cooling. Based on (unfortunately awful) taste clarity, I picked the M3 as the better grinder. The second was a double triangle cupping of an excellent Sumatra Lintong. I picked the odd cup from each group, and designated the M3 as the better grinder, this time based on good taste clarity. Again this was especially apparent on cooling. The third cupping was 5 cups of Kenya Meru, the task was to pick the two M3 cups from the five. I divided the cups into two groups of two that tasted clearly different, but in a way I couldn't characterize or form a preference on, and one cup that was in between. The cooled off taste was no help. One pair was the M3 cups. The final taste test was the last of Barry's Haimi. Since DPs are so variable, I wasn't expecting much. But this is a very clean cup, and by the time they cooled, I had no trouble picking out the two strawberry bombs from the five on the table. This results is that odds of the grinders being indistinguishable and my guesses lucky are 1 in 1800. In other words, the grinders pretty definitely produce different tastes for steeped coffee with the M3's better.

I was particularly impressed on how the grinders differentiated as the cups cooled. The M3 cups stayed relatively clean and crisp, whereas the mini cups became more drab, muddy and bitter (except in the very puzzling Meru, where all the cups stayed ultra-clean and crisp, see also the cappa test). There were clues in the hot cups, but nothing as definitive. My guess at this point was that the grinds from the M3 are more resistant to over-extraction.



CAPPAS: The alties in Seattle had different favorites when it came to straight shots from the best cafes; but we were all struck by the uniform excellence of the short milk drinks. It may be no accident that all these cafes use conical grinders, since the M3 cappas and macs blew the Mini's off the field. It simply wasn't a contest. I had 8 in all, and scored them from -2 to +2. -2 is the mini's being a lot better, -1 somewhat better, up to +2 for the M3's being a lot better. Of the nine cappas, seven (four sighted, three blind) scored a +2 and two, one sighted, one not, were a tie at 0. The two tied cappas came from the same Meru that didn't get bitter on cooling in the cupping. For what it's worth, the mean score comes to 1.56, and the 95% confidence interval for the mean is from 0.88 to 2.

Basically, there is no detectable generic bitterness at all in the M3 cappas. The experience was odd, almost like not drinking coffee, but an exotic milkshake. However, all the complex coffee tastes were there, just in a sweet, non-acrid form even Robert Parker could love. Again, this suggests a physical difference in the grind that reduces an aspect of overextraction that gets past the milk's taste filtering effect.



SHOTS: As stated in the methods section, I did 15 blind and 16 unblind shots. 7 of the blind shots were on the M3's sweet spot, with the Mini set to match, the other 8 were on the Mini's sweet spot with the M3 set to match. The 16 sighted shots were done at each grinder's sweet spot. 6 shots, 2 from each group, were made on the La Peppina, a spring lever machine with precise shot temperature control, that I use for blend development. The La Peppina shots, as usual, tasted substantially better and had worse crema and body than the Tea shots, but their relative scoring for each grinder was completely in line with the Tea shots. Due to sample size limitations, I'm not reporting them separately, but I have shown their occurrence in brackets in the summed data table

The form of the scoring is the same simple one as in the cappas. I was looking for both pleasant and clearly defined tastes using a blend I personally find both pleasant and having interesting flavors. There were only two cases where I scored a 0 for no preference because one shot had better flavors and the other was more pleasingly balanced. In the other cases, these two aspects went together.

Here are the results:
--------------------------------------------------------------------
SCORE   #_SIGHTED    #_BLIND_M3_SPOT    #_BLIND_MINI_SPOT    TOTAL
--------------------------------------------------------------------
  -2       0            0                  1                  1
  -1       1            0                  3 (1)              4 (1)
   0       3            1                  2 (1)              6 (1)
  +1       7 (1)        3 (1)              2                 12 (2)
  +2       5 (1)        3 (1)              0                  8 (2)
--------------------------------------------------------------------
AVE-SCORE  1.00         1.29              -0.38               0.71
--------------------------------------------------------------------
95% CI LO* 0.52         0.59              -1.26               0.31
95% CI HI* 1.48         1.98               0.51               1.11
--------------------------------------------------------------------

* The bottom two lines show the range for the 95% confidence interval of the average scores. The sighted, blind m3 favorable grind, and combined scores are statistically significant in their preference for the M3 (i.e. 0 is not part of the confidence interval) while the scores on the blind tests favoring the mini are not statistically significant in favor of the mini (although I have little doubt they would have become so with a longer series of tests to narrow down the confidence interval).

The tests show that the M3 grinder is much better on it's "homefield," the mini perhaps slightly better on its homefield, and the M3 grinder is distinctly, but not overwhelmingly, better with best practice for both grinders.

Note in particular that there is no overlap in the two blind test confidence intervals, therefore no strong possibility of the results coming from the same underlying process. This shows how critical working out appropriate comparative grinder adjustments is to this type of testing; and, obviously, how critical such adjustment is for the taste of the espresso. My untested but strong impression from doing the dial-ins is that the M3 requires a tighter one for optimum taste than the mini; i.e. in the possible range from 20 second 2.25 ounce shots to 35 second 1 ounce shots for a double, the sweet spot for the M3 is narrower in volume and time.

I'm not sure if and how much bias there is in the sighted tests. The outcome, as one would expect of an unbiased test, falls between the blind tests favoring the mini and m3. However, they do lie closer to the m3 favored ones, although not enough for there to be a statistical smoking gun (the average from the sighted test is 1, the average from the two unsighted tests together, weighted equally, is 0.46. The t-value for there being a difference in the two means is 1.44, which is not statistically significant). Furthermore, there is no logical reason that the sighted format of best practice on both grinders should exactly split the difference of the two blind formats. Readers can make up there own minds about how much subtle bias I showed; I certainly can't. The results prove that rosy-tinted self deception was not a factor; which is the important thing.

How would I characterize the taste differences qualitatively?
-- There was no consistent difference between the grinders in crema or body as far as I could discern. On the whole, both grinders performed excellently in these aspects.
-- The aftertaste from the M3 shots is more lingering and sweeter. In unpaired shots, where one can distinguish, the aftertaste lingered several hours.
-- In the aroma, there was less of the acid nip from the bright components, although the fruit was there. Given that the nip evokes in every espresso hound the conditioned Pavlovian reflex of dreading the upcoming shot; this counts as an improvement in an odd sort of way.
-- At the respective grinder's sweet spots, the shots were equally sweet and balanced in flavor (almost by definition), but the M3 taste had slightly less of the irritant edges at the top and bottom, this is it's main plus in terms of both clarity and pleasantness.
-- The M3 tastes were quite clear, but more integrated and smoothly blending into each other than on the Mini, where they seemed more separated. I liked this, but did not factor it into the score one way or the other, since my liking may be more for the novelty than the effect itelf.



CONCLUSIONS I know that this small amount of comparative tasting is not overwhelming evidence for my conclusions. But further tasting by me is probably useless. I know enough of the grinders' quirks now to make any blind tasting a fig leaf; moreover, I'm personally convinced the conclusions are sound and unlikely to change in any dramatic fashion. These factors combined would seem to make further taste tests by me a self-fulfilling prophecy.

The results from cup, cappa and shot all support a simple hypothesis: the M3 is a better grinder because its grinds are more resistant to overextraction than the Mini's, and by extension, other commercial flat burr grinders.

This hypothesis is supported by the grinder's construction. The grinder uses a conical set of burrs to crush the beans to a rough grind, followed by a flat set to shear them to a fine grind. Compared to standard flat burr grinders, the crushing and shearing paths are longer, and the beans are ground more gradually and gently. It is reasonable to assume that this leads to less fragmentation of the cell walls. This means fewer dust-like fines containing only cell wall fragments, and more of the fines containing nearly intact cells. The extraction of compounds found in the cell walls but not the cell interiors would be slowed down by this. Given that specialized cell wall flavor compounds would have evolved to discourage ingestion by animals, they could well be responsible for the characteristic unpleasant bitterness in overextracted coffees, e.g soluble, aka instant, coffees.

The alternative mechanism, that the grinder's slow speed doesn't melt the lipids and transport undesirable flavors to the particle surfaces, seems less likely to be the cause. In sporadic home use, conventional grinders do not overheat enough to melt the lipids.

This hypothesis should be testable directly, by microscopy or measurements of the right sort, perhaps of grinder fines or extraction ratios. I hope others will be able to do this, since evidence of this type would be more probative than further taste testing on my part.

Whether this improvement is enough to justify the M3's added price is up to each reader. My impression is that it is roughly on par with the improvement one gets when upgrading to a higher class of espresso machine.

It should be noted that with the advent of the naked PF, the M3 loses some of a large advantage it would have had earlier. A year ago, I would have considered the grinder's very even deposition of grinds into the basket miraculous in its consistency. Even now it's amazingly good. But the discipline of the naked PF has forced everyone to improve their distribution and packing skills to the point where we are all getting far fewer sink shots with existing equipment. In this respect, the M3 raises the bar a good notch, but hardly launches it into the stratosphere -- the naked PF already did that last year.



LIQUORING TDS TEST There is one physical measurement I could perform to test the resistance to overextraction hypothesis. I steeped six cups from each grinder with the same amount of water, 4 ounces, and grinds, 8 grams, as best as I could measure. The grinds felt the same, but the Mini grinds looked coarser. These are the same settings I used in the cuppings. Then I filtered three cups each at 4 minutes and 3 cups each at 10 minutes through a swiss gold filter, and measured their total dissolved solids (TDS). If the M3 grinds are more resistant to overextraction, one would expect the 10 minute TDS to be lower than the mini's.


BREWED COFFEE TDS

SAMPLE    M3 @ 4 MIN     MM @ 4 MIN          M3 @ 10 MIN    MM @ 10 MIN
--------------------------------------------------------------------
  1       1624           1741                1809           1825
  2       1425           1360                1584           1964
  3       1508           1504                1704           1907
--------------------------------------------------------------------
 MEAN     1519           1535                1699           1899
ST.DEV     100            192                 113             70
--------------------------------------------------------------------
SIGNIFICANT?        NO                              NO-ish (6%)
--------------------------------------------------------------------
Note: TDS meter readings for coffee are 1/10 of actual TDS (source: Barry Jarrett).


I do not know if my TDS meter is up to the required precision, nor do I know the margins of error on my weighing and measuring, so the results, while suggestive, are probably uninterpretable. However, they do suggest it is a test worth doing for someone set up to do them precisely.

Advertisement
User avatar
another_jim (original poster)
Team HB
Posts: 13954
Joined: 19 years ago

#2: Post by another_jim (original poster) »

My final taste testing with the M3 involved very bright coffees, not usually used in large proportions in espresso blends. The reason for this is that I bought the grinder because I hoped it would improve shots made with bright single origin coffees.

In addition to my own tests, I mailed a 50/50 blend of Full City Roasted Oromia Yrgacheffe and Sumatra Blue Lintong (a very bright, powerfully floral coffee) to Abe Carmeli and Andy Schecter, as well as Greg Scace, who is currently evaluating a Mazzer Kony (the "mini" version of the conical Mazzer Robur). In our correspondence it seemed the Kony had many of the same differences from conventional flat burr grinders as the M3.

In previous testing I had found this blend to be nearly unpullably bright with the Mini on either my Tea or Peppina except for very ristretto, very hot shots which masked most of the taste. The M3 made palatable shots with the floral taste clear on the Tea, and flower-power godshots on the Peppina at normal temperatures and shot lengths.

Needless to say, the other testers came away with very different impressions.

Abe did replicate my experience on the Brewtus as he reported earlier.

Andy didn't get much flower-power on either grinder. He reports that the M3 produces better pours, no matter how hard he tries to get the Mini up to speed, and that these better pours yield better cup quality, mainly by reducing over-extraction taints. He is using a heavily modified Silvia (rotary pump, adjustable gicleur, PID, preheat, grouphead heater, and perhaps other stuff I don't know about).

Greg reports an almost contrary result. For him, the Kony produced brighter shots than the Cimbali Junior grinder he's using for the comparison. He also found the ristretto shots to be more acrid and less drinkable than the longer pours. He is using an LM.

They were too polite to say, but I didn't get the impression any of the three is rushing out to buy these coffees. So I doubt the absolute quality of the shots was much to their liking.

Frankly, I'm not sure what to make of the results. For all the testers, the distinction between grinders was greater for this blend than regular espresso, but the cup quality they got was all over the map. It could very well be that it isn't acidity, but simply that this was an all washed coffee blend. It is a given in cupping coffee that cup to cup variation in dry processed coffees is so great that it can wipe out the distinction between different plantations, preps, or roasts; whereas high grade washed coffees are so uniform, small differences in prep or roast will show up. Perhaps a good deal of shot to shot espresso variability simply comes down to the high proportion of dry processed coffees used in most blends. n the other hand, since ashed coffees respond sensitively to different preparation, its use may accentuate the distinction not just between grinders, but also between machines.

I've appended my tasting logs and the email communications from the other testers, so you can read the reports raw.



----------------- Jim's Log -----------------------------------------

I roasted yrg, lintong, harar and meru to give the grinders a work out on bright 90+ coffees. Initial tastings on the Peppina give the edge to the M3, although not as decisively as I imagined, since the MM shots were quite good too. However, on the best M3/Peppina shots, the taste is as good as I've ever experienced in an espresso.

Andy sent new burrs to me for the Mini. I changed them, ground 3/4 of a lb of coffee leftovers, cleaned the chamber, and moved the espresso marker. The next day, I started serious shot testing. I made 3 to 4 shots on each grinder for each series.

1st set: Yrg-Lintong on the Tea. Found the sweet spot on both grinders. On the MM, the only fully pleasant shot is a very short, 97C ristretto that doesn't have much taste except sweetness. I get a much better set of tradeoffs on the M3, with the best shot close to full length, 95C, sweet and floral. I'd rate the advantage 1.5 here.

2nd set: Meru/Harar on the Tea. The blend is either not successful, or my Meru roast is off, or my pulls are wrong. The blend did very nicely in the cupping, but as a shot it's not sweet enough to compensate the fruit and chocolate. The taste is that of unsweetened chocolate and tart berries. The only bearable MM shot on the Tea is an ultra ristretto pulled at 98C. The M3 again has a wider range; the taste, while dry, lacks the harsh edge of the MM shots. I'd rate the advantage around 1.

3rd set: The Yrg/Lintong on the Peppina. Similar result. Both blends pull well at 94C. MM subdued at ristretto, sour at longer shots, best shot again quite understated. Shot range on M3 far wider, with a flower-power god shot at the sweet spot (close to ristretto, with too little crema, alas) - advantage around 1.5.

4th set: The Harar/Meru mix on the Peppina. The MM shot was rounded and pleasantish, with OK aromatics. Had a really hard time getting a non-choke, non-gush grind on the M3. The ristretto shots were awful, the lungo-ish ones not too bad. The only shot in the zone was much more aromatic than the MM, but again unsweet in taste - maybe 0.5 to the Mini here. The overall taste is better on the LP, but this blend still sucks.

This exhausts these roasts and tots up to an average of 0.875; roughly the same as the regular espresso series.

The Yrg-Lintong mix most clearly distinguished the two grinders, and it is also the one where I always liked the M3 shots far more than the Mini shots. The Yrg-Lintong done on the M3 and La Pappina are close to achieving my current espresso obsession: getting Cup of Excellence levels in taste and aroma from an espresso shot. My next project will be to track down why this combo works so well, and trying to replicated on more available equipment.

The Meru/Harar shots were lousy despite the blend cupping beautifully, and each coffee working great as an SO. This blend should have worked great, and I'll have to figure out why it doesn't; however, that has nothing to do with grinder differences.

-----------------Abe's Letter -------------------------------------------------------------

You were right on the money on this one. The M3 pulled a nice shot @ 94c, a straight double, while the Mini was sour. It was more than just lemon peel. It took me a while to find the sweet spot for both, as I wanted to get the Mini shot at its optimal performance. After maybe six tries I found it The extraction looked good, but it was way too sour for my palate. Your 94c may not be mine, but whatever the temperature is, on the low end of the scale, the M3 did better.

------------------Greg's Letter -------------------------------------------------------------

Here are the results in a nutshell. Two grinders were used - the Mazzer Kony (conical) and my Cimbali Jr. (flat burr). I brewed 4 pairs of shots with a variety of flow rates. The pairs each had identical flow rate / blonding times. Temperature was set to 205 F. I did the tests in the early AM. I marked one cup with a blue dot on the bottom. This was the Kony cup. I brewed the two shots, then scrambled up the cups so that I didn't know which was which. I mixed up which grinder got brewed first. I did not look at the cups after scrambling them up. I just picked them up, tasted them and then spit out the coffee into the sink. I made qualitative comparison between the two in each pair.

The first pair blonded at 20 seconds and had the highest flow rate. Both of these shots were drinkable, and the lemon peel was surprisingly muted given what I was expecting. The brightness was accentuated on the Kony shot.

pair 2 - blonded at 24 seconds. Again both shots were drinkable. The kony shot had accentuated lemon peel flavor compared to the cimbali shot and the kony shot also tasted of wood, which was absent on the cimbali shot.

Pair 3 - ristretto at 34 seconds until blonding (Kony only). The cimbali shot was sweeter than the first two cimbali shots, but the kony shot was more concentrated. The astringency was so in your face that I found it difficult and undrinkable. In this pair, the Cimbali was less ristretto than the Kony shot. I repeated the test about an hour later.

Pair 4 - equal pours at 36 seconds until blonding. Both shots would have been difficult to drink, with high astringency. The Kony was again identified as having accentuated brightness compared to the Cimbali.

I found it very compelling that I could easily identify the difference between the product of the two grinders and that the kony was brighter in all cases. I never tasted wood in any of the Cimbali shots, so I'd say that more tastes are present in the conical grinder, and that taste clarity, as defined by Chris Tacy, is more evident in the Kony shots, particularly the one brewed at the standard volume. I used a LM triple basket, which is my preference these days.


Pretty interesting test, actually. Man that stuff produced a fountain of crema. And boy did it need to be brewed hot.


---------------------Andy's Letter -------------------------------------------------------------


I honestly had a hard time tasting the Yrg in there! None of the shots were particularly lemony, and only one out of thirteen had a hint of the big floral essence that I normally get from Yrg. I did one press pot for comparison, it had great body but little Yrg perfume. Weird, or what? Perhaps I usually roast Yrg considerably lighter than you did. That said, I really enjoyed the blend. It tasted very clean, a little spice, a little mint, a little malt, a little caramel.

My M3 extractions just pour better, despite my best efforts to improve the Mazzer grounds distribution. For me this often results in shots that get more of the good stuff out of the coffee and leave more bad stuff behind.

Hmmm...I see that Schomer says "When using conical burr grinders, my espresso is heavier, thicker and sweeter, and the shots are a higher volume. The machine literally delivers more flavor into my cup."

Irregardless of what he says or doesn't say, my M3 extractions were often a little sweeter, and the lavender essence that you wrote about came through better with the M3. Still, the better Mazzer shots were quite nice, but perhaps tainted a bit by overextracted flavors.

I ran out of coffee before I could try some extractions that were more at the temperature extremes; everything I did was 199-203F.

I guess I come down neither with you and Abe nor with Greg, but instead in my own Bizarro world. :-)


--------------------------------------------------------------------------------------------

Dogshot
Posts: 481
Joined: 19 years ago

#3: Post by Dogshot »

Hi Jim,

I enjoyed your test report very much - it is impressive to see some statistics being applied. While I am no stats whiz, I have a question about your methodology. I may have misunderstood, but it looks like your test consisted of comparisons of pairs of shots, and used a 3 point scale (no pref, some pref, distinct pref). I am curious to know why you decided on this measurement technique rather than a more conventional 5-7 point scale, and why you decided to compare shots rather than simply rate each shot?

By rating each shot and then doing a simple t-test for the difference of means (between the Mazzer score and the M3 score), there would have been twice the number of data points, the data may have been closer to a normal distribution, and there may have been the potential to capture more information. For example, 4 pairs of shots that hypothetically score -2 Mazzer,-2 Mazzer, 2 M3, 2 M3 would emerge as 0 for no preference, when clearly there were strong preferences exhibited. Had the scores of each shot been recorded, the score could have looked more like this: 7 Mazzer, 7 Mazzer, 2 Mazzer, 2 Mazzer, 7 M3, 7 M3, 5 M3, 5 M3. This could have told a different story than the former scenario (the variance differences between the two grinders alone would have provided some insights).

By separating the scoring into individual shots rather than comparisons of pairs, the test may also have been more directly linked to the relative performance of the two grinders, rather than to the comparisons of pairs of shots.

Your scoring system could be shared with other M3 and Mazzer owners, which might afford the opportunity to get some inter-rater reliability. I recognize that by sending samples to others and including their comments, you have done an admirable job of attempting to capture some form of that already. A simpler testing and more direct scoring system might just make that task easier.

Thanks again for sharing this information.

Mark

User avatar
another_jim (original poster)
Team HB
Posts: 13954
Joined: 19 years ago

#4: Post by another_jim (original poster) »

Hi Mark,

I started with the conventional 100 point coffee rating system. After it became increasingly obvious that I could tell which grinder was which on most of the "blind" tests, I went to the most "fudge proof" system. Since I spent a boatload of money on the grinder, I didn't trust myself with any scoring system that allowed a little point shaving on the MM and a little point adding on the M3. The original blind shots I translated so that 0 to 1.5 point differences were a tie, 2 to 4.5 point differences a little better, and 5 point or more differences a lot better.

5 points is basically an outrageous difference -- one between a godshot and a mediocre to decent shot. It really only happened with the coffees I had lying around from cuppings, since mediocre shots when I'm dialled in on my regular blend are once in a blue moon events.

The statistics cited are not paired, but regular t-tests. The pairings here are tenuous, so I used the more conservative test -- paired t-tests would have had higher t-values.

The coffee I sent out was deliberately atypical for espresso use, and I just wanted a qualitative assessment. I'm sure Abe, Andy, and perhaps Greg, will post their tasting summaries with their regular espresso blends, in due course. These will probably be more reliable than either of my reports for most buyers, since I used a fairly wide range of coffees, rather than just espresso blends, in both runs.