# Experiment

In the LLM-as-a-judge literature, there are a number of agreement correlation metrics1 that are commonly reported. We provide a ExperimentConfig class to help you specify the prediction/ground-truth columns and pivot columns to make computing these metrics easy.

# Usage

from verdict.experiment import ExperimentConfig

experiment_config = ExperimentConfig(
    prediction_column="Hierarchical_root.block.unit[Map MaxPoolUnit].score",
    ground_truth_column="score",
    pivot_columns=["language"]
)

You can pass this config directly to Pipeline.run_from_dataset and view the results in the console output as they become available or use display_stats after execution.

Display
After Execution
result_df, leaf_node_prefixes = pipeline.run_from_dataset(
    dataset['eval'],
    experiment_config=experiment_config,
    display=True
)
from verdict.experiment import display_stats, compute_stats_table

# display stats in console
display_stats(result_df, experiment_config)

# return a pandas DataFrame
stats_df = compute_stats_table(result_df, experiment_config)

  1. We support the following metrics:

    Metric Package
    Accuracy -
    Cohen's Kappa sklearn.metrics
    Kendall's Tau scipy.stats
    Spearman Rank Correlation scipy.stats
    Krippendorff's Alpha krippendorff