# Best Practices / Learnings

...aka, how to get SOTA

ask for an explanation/justification (before the score)
hierarchical verifier is a must
- try a different model for the verifier to avoid self-preference bias
study the output distribution of provider models carefully
- for example, we find that the gpt-4o family of models has an upward skew for numerical scales and exhibit mode collapse even when using logprobs
  - likely due to their user-facing alignment tuning
- llama models exhibit higher-entropy distributions (more filled out)
  - this provides more expressiveness and discriminative power for a JudgeUnit
watch for any positional bias -- flip scales, shuffle positions, etc.