| G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment (EMNLP 2023) |
 |
| Large Language Model Evaluators are not Fair Evaluators (ACL 2024) |
 |
| LLM Evaluators Recognize and Favor Their Own Generations (NeurIPS 2024) |
 |
| Debating with More Persuasive LLMs Leads to More Truthful Answers (ICML 2024) |
 |
| On scalable oversight with weak LLMs judging strong LLMs (NeurIPS 2024) |
 |
| LMUnit: Fine-grained Evaluation with Natural Language Unit Tests (2024) |
 |