FELM:
Benchmarking Factuality Evaluation
of Large Language Models

1City University of Hong Kong 2Hong Kong University of Science and Technology 3National University of Singapore 4Carnegie Mellon University 5Shanghai Jiaotong University

Abstract

FELM is a meta benchmark to evaluate factuality evaluation benchmark for Large Language Models.

Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as FELM. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner.

FELM covers five distinct domains: World Knowledge, Science/Technology, Writing/Recommendation, Reasoning, and Math. We gather prompts corresponding to each domain by various sources including standard datasets like truthfulQA, online platforms like Github repositories, ChatGPT generation or drafted by authors. We then obtain responses from ChatGPT for these prompts.

Examples from each domain in FELM

Figure 1

Dataset Statistics

Dataset Snapshot

Category Data
Number of Instances 847
Number of Fields 5
Labeled Classes 2
Number of Labels 4427

Descriptive Statistics

Statistic All world_knowledge Reasoning Math Science/tech Writing/Recommendation
Segments 4427 532 1025 599 683 1588
Positive segments 3642 385 877 477 582 1321
Negative segments 785 147 148 122 101 267

Leaderboard on FELM

F1 Score stands for the capability to detect the factual errors.

Balanced Accuracy stands for the balanced performance on correct and incorrect samples.

Ranking Model F1 Score Balanced Accuracy
1GPT-448.367.1
2Vicuna-33B32.556.5
3ChatGPT25.555.9

BibTeX

@inproceedings{
chen2023felm,
title={FELM: Benchmarking Factuality Evaluation of Large Language Models},
author={Chen, Shiqi and Zhao, Yiran and Zhang, Jinghan and Chern, I-Chun and Gao, Siyang and Liu, Pengfei and He, Junxian},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2023},
url={http://arxiv.org/abs/2310.00741}
}