Podcast cover for "Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process" by Zhijun Chen et al.
Episode

Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Dec 29, 20257:31
Computation and LanguageArtificial Intelligence
No ratings yet

Abstract

We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a clear and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a principled graphical model-based truth inference algorithm or a straightforward averaging strategy to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. The two variants of the proposed approach obtain strong results across four datasets, including outperforming the recent advanced model Smoothie-Global by 6.9% and 7.3% points, respectively.

Links & Resources

Authors

Cite This Paper

Year:2025
Category:cs.CL
APA

Chen, Z., Ji, Z., Mao, Q., Cheng, J., Qin, B., Wu, H., Li, Z., Li, J., Sun, K., Wang, Z., Ban, Y., Sun, Z., Ji, X., Sun, H. (2025). Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process. arXiv preprint arXiv:2512.23213.

MLA

Zhijun Chen, Zeyu Ji, Qianren Mao, Junhang Cheng, Bangjie Qin, Hao Wu, Zhuoran Li, Jingzheng Li, Kai Sun, Zizhe Wang, Yikun Ban, Zhu Sun, Xiangyang Ji, and Hailong Sun. "Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process." arXiv preprint arXiv:2512.23213 (2025).