#
Dataset
We provide a simple wrapper around a HuggingFace datasets or pd.DataFrame to run Verdict pipelines across samples in parallel.
A DatasetWrapper maps each sample in the dataset to a Schema, which can be referenced by any node in the Pipeline using {input.field} in the Prompt. In the example below, we sample 20 random rows from the HuggingFace EdinburghNLP/xsum dataset and expose some columns.
#
Wrapper
#
HuggingFace
from verdict.dataset import DatasetWrapper
dataset = DatasetWrapper.from_hf( # or .from_pandas(df, ...)
load_dataset("EdinburghNLP/xsum"),
columns=["document", "summary"],
max_samples=20
)
# ... somewhere in the Pipeline
JudgeUnit().prompt("""
...
{input.document}
...
""")
You can also write a custom pre-processing function that returns a Schema for each sample. Schema fields will be accessible in the final result dataframe (i.e. output of pipeline.run_from_dataset(dataset[split])) as !{field_name}.
dataset = DatasetWrapper.from_hf(
load_dataset("EdinburghNLP/xsum"),
lambda row: Schema.of(article=row["document"])
)
#
Pandas
In addition, for a pandas.DataFrame, you can specify the name of a column that contains the sample's split.
from verdict.dataset import DatasetWrapper
dataset = DatasetWrapper.from_pandas(
pd.read_csv("data_all.csv"),
split_column="split",
max_samples=20
)
pipeline.run_from_dataset(dataset['eval'])
#
Pinning results
We can force any Verdict node to run just once and share the output across all samples using the pin() method.
pipeline = Pipeline() \
>> Layer([
CoTUnit(criteria).pin() # runs once across dataset, shares result across all samples
>> JudgeUnit(criteria) # runs once per sample
], 5) \
>> MeanVariancePoolUnit()
pipeline.run_from_dataset(dataset)