From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models

Update: 2025-08-22

Description

🤗 Upvotes: 53 | cs.CE

Authors:

Ziyan Kuang, Feiyu Zhu, Maowei Jiang, Yanzhao Lai, Zelin Wang, Zhitong Wang, Meikang Qiu, Jiajia Huang, Min Peng, Qianqian Xie, Sophia Ananiadou

Title:

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models

Arxiv:

http://arxiv.org/abs/2508.13491v1

Abstract:

Large Language Models (LLMs) have shown promise for financial applications, yet their suitability for this high-stakes domain remains largely unproven due to inadequacies in existing benchmarks. Existing benchmarks solely rely on score-level evaluation, summarizing performance with a single score that obscures the nuanced understanding of what models truly know and their precise limitations. They also rely on datasets that cover only a narrow subset of financial concepts, while overlooking other essentials for real-world applications. To address these gaps, we introduce FinCDM, the first cognitive diagnosis evaluation framework tailored for financial LLMs, enabling the evaluation of LLMs at the knowledge-skill level, identifying what financial skills and knowledge they have or lack based on their response patterns across skill-tagged tasks, rather than a single aggregated number. We construct CPA-QKA, the first cognitively informed financial evaluation dataset derived from the Certified Public Accountant (CPA) examination, with comprehensive coverage of real-world accounting and financial skills. It is rigorously annotated by domain experts, who author, validate, and annotate questions with high inter-annotator agreement and fine-grained knowledge labels. Our extensive experiments on 30 proprietary, open-source, and domain-specific LLMs show that FinCDM reveals hidden knowledge gaps, identifies under-tested areas such as tax and regulatory reasoning overlooked by traditional benchmarks, and uncovers behavioral clusters among models. FinCDM introduces a new paradigm for financial LLM evaluation by enabling interpretable, skill-aware diagnosis that supports more trustworthy and targeted model development, and all datasets and evaluation scripts will be publicly released to support further research.

Comments

In Channel

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

2025-08-2821:42

VibeVoice Technical Report

2025-08-2821:19

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

2025-08-2820:03

VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space

2025-08-2820:50

OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation

2025-08-2822:38

Spacer: Towards Engineered Scientific Inspiration

2025-08-2822:27

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

2025-08-2819:39

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

2025-08-2723:14

Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation

2025-08-2718:59

MV-RAG: Retrieval Augmented Multiview Diffusion

2025-08-2720:32

Memento: Fine-tuning LLM Agents without Fine-tuning LLMs

2025-08-2622:33

Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

2025-08-2621:37

ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks

2025-08-2621:27

Intern-S1: A Scientific Multimodal Foundation Model

2025-08-2319:26

Mobile-Agent-v3: Foundamental Agents for GUI Automation

2025-08-2325:02

Deep Think with Confidence

2025-08-2320:40

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

2025-08-2323:48

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

2025-08-2222:59

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models

2025-08-2223:15

FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

2025-08-2222:01

00:00

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-175674014469114{-webkit-line-clamp:2;}From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models

Jingwen Liang, Gengyu Wang

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models