G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment (EMNLP 2023) |
 |
Large Language Model Evaluators are not Fair Evaluators (ACL 2024) |
 |
LLM Evaluators Recognize and Favor Their Own Generations (NeurIPS 2024) |
 |
Debating with More Persuasive LLMs Leads to More Truthful Answers (ICML 2024) |
 |
On scalable oversight with weak LLMs judging strong LLMs (NeurIPS 2024) |
 |
LMUnit: Fine-grained Evaluation with Natural Language Unit Tests (2024) |
 |