Questioning Order in the Court of Beer Judging

11/30/-1

A Study on the Effect of Presentation Order in Beer Competitions

By Edward W.Wolfe and Carol Liguori Wolfe (Brewing Techniques - Vol. 5, No.2)

Anyone who’s entered a homebrew or microbrew competition wonders about the many variables that can affect beer scores. Two brewers decided to test one possible source of error.

We love entering homebrew competitions. We get score sheets back from the judges with comments and suggestions for improving our beer. We use this feedback to formulate new recipes, and our beers keep getting better and better. Sometimes we win ribbons too, and we certainly don’t complain when that happens.

Most people who have entered brewing competitions know that many variables affect how their beer will fare. No matter how good your beer is, you can never really know how it will stack up against the others in the competition. Every competition presents a wide variety of beers that are evaluated by different judges, often at different times and in different settings. Because so many variables are involved, brewers often enter the same beer in a number of competitions to see how the scores vary as the beer ages, as the beer is judged by different sets of judges, and as the beer is judged in different locations.

Beer judging is a complex task that requires an understanding of style characteristics, flavor perception, the specific causes of appropriate and inappropriate flavors, and one’s own tendencies as a beer evaluator. Understanding the characteristics of a style is essential to evaluating how well a beer represents that style. Understanding how specific flavors are perceived is important because it gives judges a consistent language to use in communicating their evaluations and advice to the brewers. Understanding the specific sources of appropriate and inappropriate flavors enables the judge to provide the brewers prescriptive statements on how to improve the beer (such as suggestions for adjusting the recipe or altering the brewing process). Probably the most difficult understanding for a beer judge to acquire, however, is that of his or her own individual tendencies when tasting and evaluating beer.

Most judges and competition organizers recognize that scores for a given beer depend not only on the beer’s quality but on many other contributing factors as well. As a result, knowledgeable judges and organizers do their best to limit these factors and ensure that every beer is judged as fairly and accurately as possible. Some simple precautions, for example, include ensuring that bottles have no identifying marks, that beers are stored and served at appropriate temperatures, that qualified judges evaluate each beer, and that noise and distractions are kept to a minimum. A well-organized event will have judges review style guidelines before judging, serve the beers in an order that minimizes palate burnout, and encourage judges to discuss significant discrepancies between their scores for a given beer.

Yet even when these precautions are taken, the variation between scores assigned to beers from the same batch is sometimes surprising. For example, our California common won a gold medal in the 1995 AHA National Homebrew Competition, yet received scores ranging from 31 to 42 at six other competitions over a period of four months. The variations made us wonder about the importance of factors besides quality in determining a beer’s score. Such factors might include the way the beer was delivered to the competition (hand-delivered versus shipped, for example); conditions like lighting, odors, and noise at the judging site; presentation order; the types of cups that the beers were poured into; judges’ experience levels; and the quality of the other beers against which a given beer is compared.

Out of curiosity, we identified one of these possible sources of variation — presentation order — and tried to determine whether and how much it influences the scores in homebrew competitions. In this article, we report the results of our study and make recommendations for improving judging accuracy based on our findings.

Presentation Order and Proximity Errors

Of the number of errors that can lead evaluations astray (1), proximity errors may be the most common. Proximity errors occur when the validity of a judge’s scores changes over time. More specifically, proximity errors occur when presentation order influences the score assigned to a particular beer.

Proximity errors are most likely to occur in competitions in which a large number of entries are to be judged. At the 1995 Great American Beer Festival (GABF), for example, over 1,300 entries were evaluated by 68 judges placed on multiple panels; judges worked two 7-hour days (2). Under conditions like these, palate fatigue and proximity effects may have a large influence on a judge’s fairness and accuracy. We do not intend to imply that GABF judging is necessarily flawed. Most GABF judges are experienced professionals and no doubt do an admirable job. Our point is that the task of judging a large event is rigorous and provides fertile ground for potential errors.

Proximity effects can manifest in many ways. For example, judges’ scores may begin to drift over time (that is, their scores exhibit more random error later in the judging session than early in the session); they may show practice effects over time (they become more accurate toward the end of the session); or they may show localized effects (their scores for one beer may be exaggerated relative to those for the beer immediately preceding it).

For this study, however, we chose to focus only on primacy and recency effects; that is, whether beers presented early in a judging session may be assigned higher scores than beers presented later in the flight (primacy) or whether beers presented later may receive higher scores (recency).

Presentation order effects are well-documented in psychological literature, especially in the area of taste-testing. These studies show a strong tendency toward primacy; that is, they suggest that the first item tasted in a pair or in a series is likely to be preferred over subsequent items (3–5). Unfortunately, this research focused on taste-testing with untrained tasters (consumers) and relied primarily on the subjects’ preference for one product over another; it is unclear whether this trend can be generalized to beer evaluation, where the tasters have been trained, are practiced, and evaluate the product against a set of criteria in the form of beer style guidelines.

Setting up the Experiment

In May 1996, The Honorable Iowa River Society of Talented Yeastmasters (THIRSTY) held the First Iowa City Homebrew Classic. A total of 207 beers were entered in the competition. We used this competition to collect data for measuring and assessing the influence of presentation order on judging results.

The judges: Each entry was evaluated by a pair of judges selected from a pool of 13. Each team of judges evaluated flights ranging in size from 6 to 13 beers. Each judge scored between two and six flights over a period of three days. During each flight, the scores that each judge assigned to the beers were recorded along with the order in which each team of judges tasted the beers. These scores and the order of presentation served as the data for our study.

All judges in the First Iowa City Homebrew Classic were experienced beer evaluators. Of the 13 judges, 11 were either certified or recognized judges in the Beer Judge Certification Program (BJCP). The remaining judges were apprentices who had attended several practice sessions with experienced BJCP judges.

Randomizing the presentation: To ensure that any differences we might observe were due to order effects and not some other factor, we randomized the presentation of beers (within each substyle). Many of the local beers were delivered to the registration site toward the beginning of the registration period, so it is likely that local beers would have been assigned lower and consecutive entry numbers (assuming that entry numbers were assigned in the order in which beers were received). Such a nonrandom number assignment would interfere with the detection of order effects if the local brewers were either substantially better or worse brewers on average. The randomization of over 200 beers makes it very unlikely that good/bad beers might end up stacked at the beginning/end of the competition.

Entry numbers were generated at random as each entry was received and registered. For example, the first few barleywines we received were assigned the numbers 110, 101, 106, and 102, respectively (rather than 101, 102, 103, and 104). Before the competition, the flight list for each style (or substyle, as appropriate) was sorted in ascending order. Using the example of barleywines, the second beer received (101) was served first, followed by the fourth (102), and so on. Judge teams were then presented with the beers in the order indicated on the flight list with the exception that, if they were judging more than one substyle, judges could request that entire substyles be presented in a specific order; for example, judges could request that classic dry stouts be presented before imperial stouts.

Record keeping: Judges recorded their scores on standard AHA score sheets. Scores were subsequently recorded in a data base. Stewards recorded the exact order in which the beers were presented to judges. These numbers were also later transferred to the data base. In analyzing the data, we looked for evidence of order effects exhibited by teams of judges within a flight, as well as by individual judges across flights.

Analyzing the Results

Table I shows the descriptive statistics for the scores assigned to the beers in the competition. This table shows that the average score assigned to the beers was about 33 on the 50-point AHA scale. The standard deviation, an index of the typical variability around the average, was about 5 points — indicating a reasonable spread of scores. Approximately 96% of the scores fell between 23 and 43.

Table I also shows the average correlation (r) between the scores that judging teams assigned to beers within flights. It is calculated by examining cumulative scores to identify trends and indicates the extent to which judges agreed on what score to assign to a beer. A correlation can range from –1.00 to + 1.00. Positive values indicate that judges tended to rank the beers in the same order; if Judge 1 and Judge 2 agree on the scores for three beers (say, a 40 for Beer 1, a 30 for Beer 2, and a 20 for Beer 3), the correlation between the scores for a given beer would equal +1.00. Negative values of a correlation indicate that judges tended to rank the beers in the opposite order; if Judge 1 assigns Beer 1 a 20, Beer 2 a 30, and Beer 3 a 40, and Judge 2 assigns Beer 1 a 40, Beer 2 a 30 and Beer 3 a 20, the correlation between the scores would equal —1.00. A correlation of 0.00, on the other hand, indicates that the agreement between judges was purely random. If Judge 1 assigned the three beers scores of 20, 30, and 40 and Judge 2 assigned the beers scores of 35, 30, and 35, then the correlation would equal 0.00. The interjudge correlation that we calculated for the data from this competition (r = 0.86) is very good.

Table II shows two correlations between beer scores and order of presentation. The first correlation shows the relationship between the score for each beer and the order in which the beer was presented. Such a correlation can be used to detect trends, either systematic increases (a positive correlation) or decreases (a negative correlation) of scores over the course of a flight of beers. As shown in Table II, the correlation between a beer’s score and its order of presentation is a low negative number (r = –0.11). Our figure’s low value (close to zero) — based on 19 flights, each containing about 10 beers — suggests that, in general, the order of presentation during a flight is virtually unrelated to the score of a beer.

The second figure shown in Table II illustrates the relationship between the range of scores assigned and the presentation order of the beers in the flight. Such a statistic can be used to indicate whether the agreement between judges increases (a positive correlation) or decreases (a negative correlation) systematically over the course of a flight of beers. The low positive correlation (r = 0.08) indicates that there is virtually no discernible relationship between the level of agreement between judges and the order of presentation. In cases when interjudge agreement is high (as indicated in Table I) and the relationship between presentation order and score magnitude is low (as indicated in Table II), it is unlikely that judges are drifting together (that is, it is unlikely that both judges are systematically making the same errors as time progresses).

Table I: Descriptive Statistics for the Scores Assigned to All Beers
Statistic	Value
Average score assigned to all beers	32.81
Standard deviation of scores assigned to all beers	4.80
Correlation between scores assigned by pairs of judges	0.86

Table II: Presentation Order Across All Entries
Correlation	Value
Average score and order of presentation	–0.11
Range of scores and order of presentation	0.08

Table III shows the correlation between the assigned score (from individual score sheets) and the order of presentation for each of the 13 judges and their BJCP ranks. Overall, the abundance of negative correlations suggests that there is a slight tendency for judges to prefer earlier beers to later beers (the primacy effect). In other words, when judges make proximity errors, they tend to make harsher judgments as the flight progresses. These correlations also show a range in the degree to which this trend is observed in different judges. For example, Judges 1 through 6 show minimal change in their scores with order of presentation, whereas Judges 10 through 13 show moderate to large decreases in scores as the flight progresses. Interestingly, a weak relationship also existed between a judge’s tendency to exhibit order effects and his or her BJCP rank; the more experienced judges tended to show smaller effect sizes.

Table III: Score and Presentation Order Correlations for Individual Judges
Judge	Rank	Correlation
Judge #1	Non-BJCP	–0.01
Judge #2	Recognized	0.02
Judge #3	Recognized	0.02
Judge #4	Recognized	0.06
Judge #5	Certified	–0.12
Judge #6	Certified	–0.15
Judge #7	Certified	–0.21
Judge #8	Non-BJCP	–0.23
Judge #9	Recognized	–0.24
Judge #10	Recognized	–0.31
Judge #11	Recognized	–0.38
Judge #12	Recognized	–0.39
Judge #13	Recognized	–0.54

The most pronounced primacy effect in our data (the correlation of –0.54 for Judge 13) would be noticeable upon inspection of a set of scores, but would not be overwhelming. As an example, Table IV shows the actual scores that produced the correlation of –0.54. As you can see, this set of scores is not that unusual. The important point is that although there is no overwhelming bias, there is a tendency for these judges to assign higher scores toward the beginning of the flight. While it’s true that, by random chance alone, better beers can end up at the beginning of a flight, it is unlikely for this to be the case in as many instances as observed in our data.

The Scales of Justice Nearly Balance

Overall, there was little evidence of a systematic relationship between either the magnitude of scores assigned to beers or the amount of disagreement between judges and presentation order in our homebrew competition. On average, scores tended neither to rise, drop, nor become more or less accurate as a judging flight progressed. Large differences were evident, however, between individual judges when it came to the tendency to commit proximity errors. But even when judges did show these tendencies, the influence on scores was modest, resulting in only slight decreases in scores as a flight progressed. This is good news for homebrew competition entrants; it means that their scores are not likely to be influenced by the order in which their beers are presented.

For both competition organizers and judges, however, there is cause for concern. The slight tendency toward proximity errors that we observed reinforces the importance of randomizing the presentation of beers within a flight to the greatest extent possible, rather than merely numbering and presenting beers in the order in which they are received.

Table IV: One Judge’s Opinion
The following scores assigned by Judge 13 produced a correlation of –0.54:
Presentation Order	Score
1st	40
2nd	30
3rd	34
4th	26
5th	24
6th	31
7th	30
8th	28
9th	28
10th	28

As the data show, even randomized beers will be subject to proximity errors. What randomizing does is to try to spread the quality around a bit better throughout the flight. It removes the slight bias introduced by numbering entries received early in the shipping window with lower entry numbers. This study also emphasizes the importance of keeping flights small (less than 12 beers) to minimize the opportunity for scores to drift as judges become fatigued; the fewer the beers, the smaller the distance from beginning to end and the lower the likelihood of proximity errors.

In addition, it is essential to keep the judging site free of distractions — especially near the end of judging sessions, when a judge’s concentration is most likely to be compromised by other judges who have finished their flights and have begun sharing opinions and beers with one another. Judges must keep in mind that scores may drift as a flight progresses. Judges may also need to adopt self-monitoring procedures at the end of a flight, such as retasting all of the beers to be sure that the order of preference matches the rank of the scores.

Following these kinds of guidelines should help ensure that competition entries are judged as fairly and accurately as possible.

All contents copyright 2024 by MoreFlavor Inc. All rights reserved. No part of this document or the related files may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording, or otherwise) without the prior written permission of the publisher.

CONNECT
NEWSLETTER

Sign up to receive exclusive deals, tips and tricks, special coupons and much more ...