FELM: Benchmarking Factuality Evaluation of Large Language Models

FELM:
Benchmarking Factuality Evaluation
of Large Language Models

¹City University of Hong Kong ²Hong Kong University of Science and Technology ³National University of Singapore ⁴Carnegie Mellon University ⁵Shanghai Jiaotong University

Abstract

FELM is a meta benchmark to evaluate factuality evaluation benchmark for Large Language Models.

Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as FELM. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner.

FELM covers five distinct domains: World Knowledge, Science/Technology, Writing/Recommendation, Reasoning, and Math. We gather prompts corresponding to each domain by various sources including standard datasets like truthfulQA, online platforms like Github repositories, ChatGPT generation or drafted by authors. We then obtain responses from ChatGPT for these prompts.

Dataset Statistics

Dataset Snapshot

Category	Data
Number of Instances	847
Number of Fields	5
Labeled Classes	2
Number of Labels	4427

Descriptive Statistics

Statistic	All	world_knowledge	Reasoning	Math	Science/tech	Writing/Recommendation
Segments	4427	532	1025	599	683	1588
Positive segments	3642	385	877	477	582	1321
Negative segments	785	147	148	122	101	267

Leaderboard on FELM

F1 Score stands for the capability to detect the factual errors.

Balanced Accuracy stands for the balanced performance on correct and incorrect samples.

Ranking	Model	F1 Score	Balanced Accuracy
1	GPT-4	48.3	67.1
2	Vicuna-33B	32.5	56.5
3	ChatGPT	25.5	55.9

BibTeX

@inproceedings{ chen2023felm, title={FELM: Benchmarking Factuality Evaluation of Large Language Models}, author={Chen, Shiqi and Zhao, Yiran and Zhang, Jinghan and Chern, I-Chun and Gao, Siyang and Liu, Pengfei and He, Junxian}, booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year={2023}, url={http://arxiv.org/abs/2310.00741} }