#
Distributional Bias in LLM-as-a-Judge
LLMs exhibit a number of distributional biases that can be introduced through model selection, prompt content, extraction methodology, and more. Here, we focus on a few sources of distributional bias to watch out for in your LLM-as-a-judge configurations and provide some methods to calibrate judge outputs.
#
Positional Bias
As shown in numerous works, such as Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions, LLMs are sensitive to the order of presented options. This is particularly prevalent in LLM-as-a-judge configurations, as there are often many stages with ordered lists of candidates.
Some common calibration techniques include
- asking for explanation before scoring (Multiple Evidence Calibration à la Wang et al., 2023)
- averaging across all k! positional configurations (Balanced Prediction Calibration à la Wang et al., 2023)
- max-voting across k random shuffles of options (Pezeshkpour et al., 2023)
- randomly shuffle the order of the options in your prompt (Khan et al., 2024)
pipeline = Pipeline() \
>> Layer([
JudgeUnit(DiscreteScale((1, 5)), explanation=True).prompt("""
...
Please first provide a comprehensive explanation of your evaluation,
avoiding any potential bias and ensuring that the order in which the
responses were presented does not affect your judgment.
...
"""),
])
pipeline = Pipeline() \
>> Layer([
JudgeUnit(DiscreteScale((1, 5))).prompt("""
Score from 1 to 5 where 1 is the worst and 5 is the best.
"""),
JudgeUnit(DiscreteScale((1, 5))).prompt("""
Score from 1 to 5 where 1 is the best and 5 is the worst.
"""),
]) \
# remember to flip scores if needed!
>> MapUnit(lambda outputs: Schema.of(score=(outputs[0].score + (6 - outputs[1].score)) / 2))
from verdict.common.judge import BestOfKJudgeUnit
from verdict.transform import MapUnit, MaxPoolUnit
>> Layer(
MapUnit(lambda input: Schema.of(
prompt_str='\n'.join(
f"{letter}: {option}" for letter, option in zip(letters, sample(input.options, 4)))
)) \
>> BestOfKJudgeUnit(k=4).prompt("""
Choose the best of the following options. Respond with only 'A', 'B', 'C', or 'D'.
{previous.map.prompt_str}
""")
, 10) \
>> MaxPoolUnit('choice')
from verdict.common.judge import BestOfKJudgeUnit
from verdict.transform import MapUnit, MaxPoolUnit
MapUnit(lambda input: Schema.of(
prompt_str='\n'.join(
f"{letter}: {option}" for letter, option in zip(letters, sample(input.options, 4)))
)) \
>> BestOfKJudgeUnit(k=4).prompt("""
Choose the best of the following options. Respond with only 'A', 'B', 'C', or 'D'.
{previous.map.prompt_str}
""")
#
Self-Preference Bias
As shown in LLM Evaluators Recognize and Favor Their Own Generations and Self-Preference Bias in LLM Evaluators, LLMs tend to favor their own generations. This can lead to a positive-skew when verifying judge explanations as part of a hierarchical verification pipeline, for example. Below, we show the the distribution of yes/no using different models for the heirarchical verification judge in our quickstart example. Note that using the same model for the initial judge and verification judge will result in a positive-skew that may not discriminate faithfully between good and bad explanations.
#
Structured Output Bias
Constrained decoding methods for structured outputs (e.g., JSON-mode) impose an inductive bias on the model's output distribution. We demonstrate this with a toy example: figuring out the contested height of a fictional character -- Willy Wonka. We obtain a distribution over structured outputs and post-hoc by running each 100 times, whereas TokenProbabilityExtractor is from a single run. This demonstrates that the logprobs from a single run can provide a proxy for the model's uncertainty.
#
Skew Bias
Provider models tend to have a skew distribution of scores due in part to their instruction tuning. While desirable for some consumer-facing applications, this can cause LLM-as-a-judge evaluators to miscalibrated. We demonstrate this with a toy example of coin-flipping.