Logo
 
MoreBeer!

How Common Judging Errors Creep into Organized Beer Evaluations

11/30/-1

Even experienced beer judges make mistakes.

by Edward W. Wolfe (Brewing Techniques - Vol. 4, No.2)

I consider myself lucky to have a profession that helps me develop my skills as a beer judge. I’m not a microbiologist, and I don’t work in the food industry. I create tests. Not the boring multiple-choice tests you remember from school, but tests that involve written essays, lab reports, or videotapes of performances. Developing these tests and training people to score them has improved my beer judging by helping me to identify common errors that people make during these types of evaluations.

Beer judging also teaches me more about my profession. I’ve read about judging errors in books, and I’ve seen these errors committed by people who score the tests that I create. The most interesting and instructional experiences I’ve had with judging errors, however, are when I’ve caught myself making them or noticed someone else making them while judging beer.

This article identifies five common judging errors in beer evaluation, describes the conditions under which these errors are likely to occur, and offers some ways to avoid them. The accompanying box summarizes these judging errors.

Perception Errors

Perception errors occur when a beer evaluator misperceives a flavor or fails to recognize how a particular beer satisfies or doesn’t satisfy the style guidelines. In other words, perception errors may be due to flavor blindness, oversensitivity to a flavor, or simply ignorance of the characteristics of the style. Each of these tendencies is widely recognized and not at all uncommon in homebrew judging.

It has been well documented that people have different thresholds for various flavors (1,2). For example, I am very sensitive to phenolic flavors. I can taste the polyphenols of overly hopped beer and the phenolic character of beers with even the slightest wild yeast infections. This sensitivity has won me the nickname “Mr. Phenolic” in my local beer club.

On the other hand, I rarely taste diacetyl. The only time I can really remember noticing diacetyl in a beer was when I tasted a badly infected imported German Pilsener packaged in a stone bottle. The diacetyl was so strong that most of my companions refused to take more than a sip of the beer. I, on the other hand, simply said, “Yeah, I think I can taste a buttery flavor.” Fortunately, judges learn about their sensitivities through experience and conversations with other judges. As a result, they can begin making adjustments in their scoring and comments.

Ignorance of where a particular beer fits into the various styles, however, is an almost unforgivable error. People who judge beer owe it to competition contestants to be as knowledgeable as possible about the beer styles they judge. Most people who have entered homebrew competitions have, at one time or another, received a score sheet for an India Pale Ale with a comment like “too much hop flavor,” or a comment on a Berliner Weisse score sheet that says “badly infected — very sour.” Luckily, groups like the Beer Judge Certification Program (BJCP) are working to reduce these types of judging errors by educating judges and requiring certified judges to demonstrate high levels of understanding of the wide range of traditional beer styles.

Five Common Judging Errors

Perception errors: A judge fails to perceive a flavor, is overly sensitive to a flavor, or has a misconception about the characteristics of a particular beer style.

Leniency and severity: A judge tends to assign scores that on average are either higher (leniency) or lower (severity) than scores assigned by other judges.

Central tendency and extreme scoring: A judge tends to assign scores that on average are less variable (central tendency) or more variable (extreme scoring) than scores assigned by other judges.

Halo effects: A judge allows the perception of irrelevant (good or bad) qualities of a beer to influence his or her judgment of other, dissimilar qualities (for example, a beer that is too dark is scored lower for appearance and flavor).

Proximity errors and drift: The validity of the scores assigned by a particular judge changes over time. These changes may occur because of the presentation order of the beers (proximity) or because the judge’s standards change during the judging session (drift).

Leniency and Severity

Another commonly recognized judging error results in the assignment of overly harsh or overly generous scores to all the beers a judge tastes. I’ve sat on judging teams in which every score I assigned was a couple of points lower than the other judges’ scores, and vice versa. This is a sure indicator that a judge on the team is either too severe or too lenient. These discrepancies become a problem only when the difference is large enough (say, more than 6 points on the 50-point scale) to lead to different interpretations of a beer’s quality.

Although certainly not an absolute, I have seen the tendency to be overly lenient most often in novice judges. The problem with being overly lenient is that the judge may not leave enough of a “top end” on their scoring scale for better beers to stand out from the rest.

On the other hand, more-experienced judges seem more likely to be overly severe. In one calibration round, I saw a judge dock 10 points for flavor (out of 19) from a good commercial example of a style for slight levels of oxidation! The problem with overly harsh scoring is that the scores for the beers tend to cluster around the low end of the scoring scale (typically, around 20 on a 50-point scale). As a result, it becomes difficult for the judge to differentiate between the beers in a session and to make reliable distinctions for selecting winners. Even worse, it discourages brewers from entering competitions.

As with perception errors, leniency and severity errors tend to lessen as a judge accumulates experience and works with more and more judges.

Central Tendency and Extreme Scoring

While leniency and severity errors tend to raise or lower the average score assigned by a beer judge, errors relating to central tendency and extreme scoring tend to reduce or increase the variability of the scores that a particular judge assigns.

An Example of Proximity

Presentation Order

Assigned Score

1

36

2

38

3

27

4

26

5

27

6

26

7

25

8

27

9

26

10

41

11

34

Central tendency errors are committed when a judge gives few high or low scores in a particular judging round. Extreme scoring is just the opposite — the judge gives very few average scores. Both of these types of errors are more likely to occur when a judge is fatigued — emphasizing the importance of having regularly scheduled breaks in a competition and keeping the number of beers in a round in a reasonable range (say, fewer than 12).

As a judge, it is always good to look over your score sheets at the end of a round and ask yourself two questions: first, are all of the scores I’ve assigned in a small range and, second, does the distribution of my scores have two distinct peaks? If you answer “yes” to either of these questions, you may want to check the accuracy of your scores.

Halo Effects

Halo effects occur in psychological testing and employee evaluations when a good or bad attribute of a ratee is overgeneralized to other, unrelated characteristics (3). More generally, halo effects result when the rater fails to clearly differentiate between the various subscales being rated. In beer evaluation, halo effects typically occur when a judge allows one type of beer feature to interfere with objective evaluations of other features of the beer. Halo effects are also often associated with the method (top-down or bottom-up) used to assign scores to the aroma, appearance, flavor, and body subcategories.

I prefer top-down scoring — I like to write the comments on my score sheet as I judge a beer, decide what the overall score should be, and then go back and fill in the subscores so that they jibe with the overall score that I want to assign to the beer. By doing this, however, I run the risk of introducing a halo effect into my scores. I have found that I often assign nearly identical subscores to beers that I give the same overall score, which unfortunately may deprive the brewer of the diagnostic indicators needed to improve the beer. By using a formulaic method of determining subscores, I fear I may fail to differentiate beers that are truly distinct in terms of aroma, appearance, flavor, or body.

Judges who use a bottom-up method (assigning subscores during the judging process, and arriving at the overall score by simply summing them), however, are not immune to halo effects either. Beer judging is a fairly subjective process, and many irrelevant features of a particular beer (bottle color, yeast pack, fill level, rings in the neck of the bottle, clarity, foaminess, color, and so forth) can cloud one’s perception of its true flavor qualities.

I know from experience that it is difficult to prevent the perception of a slight phenolic aroma from making me overly wary of off-flavors during the tasting portion of the judging process, perhaps even leading me to imagine off-flavors in a beer that has none. Regardless of what method one uses to evaluate a beer, it is important to try to remain open-minded and impartial throughout the process of evaluating an individual competition entry.

An Example of Leniency

Feature

Judge #1

Judge #2

Judge #3

Bouquet/aroma

6

6

9

Appearance

5

5

4

Flavor

11

10

17

Body

3

4

4

Drinkability & overall impression

5

6

9

Total score

30

31

43

Proximity Errors and Drift

The final type of judging error relates to the order in which beers are evaluated in a round of judging. Proximity errors occur when the overall scores for the beers in a particular flight lose their meaningfulness and accuracy as time progresses. I find that I’m more likely to make these types of errors during longer flights.

Proximity errors usually manifest in one of two ways. One common error with a large flight is the occurrence of “runs.” I’ve completed more than one flight of high-gravity beer only to find that I assigned roughly the same score to the last half of the flight. When I notice a run of similar scores on consecutively scored beers, I find that the best way to relieve my anxiety is to take a short break, and then taste all the beers a second time to determine whether they all deserve similar scores. If I decide that they don’t, I change a few scores.

The second way that proximity errors manifest has been well documented in business–psychology literature (4). People have a better memory for, and even tend to prefer, the first and last items in a series (which is why there are so many items displayed at the beginnings and ends of grocery store aisles). This phenomenon is seen in beer judging as well. I’ve found that when I have a run of similar scores in a large flight of beers, I tend to prefer the beers I judged early in the session or late in the session. Again, I typically retaste most of the beers when I notice this pattern occurring.

A similar problem that I’ve noticed with longer flights of beer judging is drift. Sometimes I find that my standards begin to drift as I progress through a flight of beers, even if my palate doesn’t become taxed, or I’m not judging a very large flight. For me, most often, I become more harsh in my scoring.

In a recent homebrew competition, another judge and I recognized that our scoring had drifted as we began to select the winning beer for the category we were judging. As we retasted our highest-scoring beers, we realized that the beer we preferred (which also happened to be the last beer we judged) had been scored lower than two other beers that had been scored early in the judging session. After discussing these three beers, we realized that our scores had drifted downward as we judged the beers in that flight. We ended up giving the beer we awarded first-place a higher score.

To avoid making proximity errors in beer evaluation, I make a habit of taking at least a sip of all of the beers I’ve judged in a flight to ensure that the order of the scores I’ve assigned jibes with my preference for the beers when they are tasted side-by-side.

The Results Are In

My list of potential judging errors is by no means exhaustive. The ones discussed here are some of the more common ones I’ve seen in my judging experiences. You may see others. But, these experiences are instructive. It is important for prospective and practicing judges to be knowledgeable of the types of errors that they can make when evaluating beer. Of course, being well informed is only the first step toward better judging. Becoming aware of one’s own tendencies is the second step, and correcting one’s behavior is the most important step.

Currently, beer evaluators have few opportunities to obtain guided assistance in improving their judging skills. There are, however, some efforts to improve judging at local levels. Specialized competitions, like the one sponsored by BURP (Brewers United for Real Potables, an Arlington, Virginia, homebrew club), are becoming popular. BURP’s Spirit of Belgium Homebrew Competition featured lectures by experts on Belgian brewing and for Belgian beers that are difficult to find in the United States. Such activities help new judges gain expertise and let experienced judges brush up on their skills and knowledge.

At a national level, I am currently participating in a palate calibration program with BJCP members from across the United States. These judges are working to improve their beer evaluation skills by tasting and evaluating beers shipped to them by a distribution company. These beers are scored by the members of the program, and score sheets are sent to a central location for processing. Monthly calibration reports are returned to each member so that they can see how their individual scoring and comments on the same beers compare to other judges from around the country.

Similar educational projects are being developed by the BJCP and the American Homebrewers Association. Such programs should offer even more options for judges interested in improving their performance as beer evaluators.

All contents copyright 2019 by MoreFlavor Inc. All rights reserved. No part of this document or the related files may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording, or otherwise) without the prior written permission of the publisher.