#
Dataset
We provide a simple wrapper around a HuggingFace datasets
or pd.DataFrame
to run Verdict pipelines across samples in parallel.
A DatasetWrapper
maps each sample in the dataset to a Schema
, which can be referenced by any node in the Pipeline using {input.field}
in the Prompt. In the example below, we sample 20 random rows from the HuggingFace EdinburghNLP/xsum
dataset and expose some columns.
#
Wrapper
#
HuggingFace
from verdict.dataset import DatasetWrapper
dataset = DatasetWrapper.from_hf( # or .from_pandas(df, ...)
load_dataset("EdinburghNLP/xsum"),
columns=["document", "summary"],
max_samples=20
)
# ... somewhere in the Pipeline
JudgeUnit().prompt("""
...
{input.document}
...
""")
You can also write a custom pre-processing function that returns a Schema
for each sample. Schema fields will be accessible in the final result dataframe (i.e. output of pipeline.run_from_dataset(dataset[split])
) as !{field_name}
.
dataset = DatasetWrapper.from_hf(
load_dataset("EdinburghNLP/xsum"),
lambda row: Schema.of(article=row["document"])
)
#
Pandas
In addition, for a pandas.DataFrame
, you can specify the name of a column that contains the sample's split.
from verdict.dataset import DatasetWrapper
dataset = DatasetWrapper.from_pandas(
pd.read_csv("data_all.csv"),
split_column="split",
max_samples=20
)
pipeline.run_from_dataset(dataset['eval'])
#
Pinning results
We can force any Verdict node to run just once and share the output across all samples using the pin()
method.
pipeline = Pipeline() \
>> Layer([
CoTUnit(criteria).pin() # runs once across dataset, shares result across all samples
>> JudgeUnit(criteria) # runs once per sample
], 5) \
>> MeanVariancePoolUnit()
pipeline.run_from_dataset(dataset)