#
Experiment
In the LLM-as-a-judge literature, there are a number of agreement correlation metrics1 that are commonly reported. We provide a ExperimentConfig
class to help you specify the prediction/ground-truth columns and pivot columns to make computing these metrics easy.
#
Usage
from verdict.experiment import ExperimentConfig
experiment_config = ExperimentConfig(
prediction_column="Hierarchical_root.block.unit[Map MaxPoolUnit].score",
ground_truth_column="score",
pivot_columns=["language"]
)
You can pass this config directly to Pipeline.run_from_dataset
and view the results in the console output as they become available or use display_stats
after execution.
Display
After Execution
result_df, leaf_node_prefixes = pipeline.run_from_dataset(
dataset['eval'],
experiment_config=experiment_config,
display=True
)
from verdict.experiment import display_stats, compute_stats_table
# display stats in console
display_stats(result_df, experiment_config)
# return a pandas DataFrame
stats_df = compute_stats_table(result_df, experiment_config)
-
We support the following metrics: