DiscoverDaily Paper CastIn Case You Missed It: ARC 'Challenge' Is Not That Challenging
In Case You Missed It: ARC 'Challenge' Is Not That Challenging

In Case You Missed It: ARC 'Challenge' Is Not That Challenging

Update: 2024-12-26
Share

Description

🤗 Upvotes: 8 | cs.CL, cs.AI



Authors:

Łukasz Borchmann



Title:

In Case You Missed It: ARC 'Challenge' Is Not That Challenging



Arxiv:

http://arxiv.org/abs/2412.17758v1



Abstract:

ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity. Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged. We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.

Comments 
In Channel
1.58-bit FLUX

1.58-bit FLUX

2024-12-3122:59

loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

In Case You Missed It: ARC 'Challenge' Is Not That Challenging

In Case You Missed It: ARC 'Challenge' Is Not That Challenging

Jingwen Liang, Gengyu Wang