AF - Measuring and Improving the Faithfulness of Model-Generated Reasoning by Ansh Radhakrishnan

Update: 2023-07-18

Description

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measuring and Improving the Faithfulness of Model-Generated Reasoning, published by Ansh Radhakrishnan on July 18, 2023 on The AI Alignment Forum.
TL;DR: In two new papers from Anthropic, we propose metrics for evaluating how faithful chain-of-thought reasoning is to a language model's actual process for answering a question. Our metrics show that language models sometimes ignore their generated reasoning and other times don't, depending on the particular task + model size combination. Larger language models tend to ignore the generated reasoning more often than smaller models, a case of inverse scaling. We then show that an alternative to chain-of-thought prompting - answering questions by breaking them into subquestions - improves faithfulness while maintaining good task performance.
Paper Abstracts
Measuring Faithfulness in Chain-of-Thought Reasoning
Large language models (LLMs) perform better when they produce step-by-step, "Chain-of -Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT(e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes relying heavily on the CoT and other times primarily ignoring it. CoT's performance boost does not seem to come from CoT's added test-time compute alone or from information encoded via the particular phrasing of the CoT. As models become larger and more capable, they produce less faithful reasoning on most tasks we study.
Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen.
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perform tasks. However, this approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case. To improve over the faithfulness of CoT reasoning, we have models generate reasoning by decomposing questions into subquestions. Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT while improving the faithfulness of the model's stated reasoning on several recently-proposed metrics.
By forcing the model to answer simpler subquestions in separate contexts, we greatly increase the faithfulness of model-generated reasoning over CoT, while still achieving some of the performance gains of CoT. Our results show it is possible to improve the faithfulness of model-generated reasoning; continued improvements may lead to reasoning that enables us to verify the correctness and safety of LLM behavior.
Externalized Reasoning Oversight Relies on Faithful Reasoning
Large language models (LLMs) are operating in increasingly challenging domains, ranging from programming assistance (Chen et al., 2021) to open-ended internet research (Nakano et al., 2021) and scientific writing (Taylor et al., 2022). However, verifying model behavior for safety and correctness becomes increasingly difficult as the difficulty of tasks increases. To make model behavior easier to check, one promising approach is to prompt LLMs to produce step-by-s...

Comments

In Channel

AF - Meta Questions about Metaphilosophy by Wei Dai

2023-09-0104:42

AF - Red-teaming language models via activation engineering by Nina Rimsky

2023-08-2612:38

AF - Causality and a Cost Semantics for Neural Networks by scottviteri

2023-08-2116:47

AF - "Dirty concepts" in AI alignment discourses, and some guesses for how to deal with them by Nora Ammann

2023-08-2005:36

AF - A Proof of Löb's Theorem using Computability Theory by Jessica Taylor

2023-08-1605:28

AF - Reducing sycophancy and improving honesty via activation steering by NinaR

2023-07-2814:26

AF - How LLMs are and are not myopic by janus

2023-07-2513:24

AF - Open problems in activation engineering by Alex Turner

2023-07-2401:37

AF - QAPR 5: grokking is maybe not that big a deal? by Quintin Pope

2023-07-2316:29

AF - Priorities for the UK Foundation Models Taskforce by Andrea Miotti

2023-07-2109:51

AF - Alignment Grantmaking is Funding-Limited Right Now by johnswentworth

2023-07-1902:26

AF - Measuring and Improving the Faithfulness of Model-Generated Reasoning by Ansh Radhakrishnan

2023-07-1810:16

AF - Using (Uninterpretable) LLMs to Generate Interpretable AI Code by Joar Skalse

2023-07-0205:04

AF - Agency from a causal perspective by Tom Everitt

2023-06-3011:40

AF - Catastrophic Risks from AI #4: Organizational Risks by Dan H

2023-06-2639:24

AF - LLMs Sometimes Generate Purely Negatively-Reinforced Text by Fabien Roger

2023-06-1612:34

AF - Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS) by Scott Emmons

2023-05-3112:27

AF - PaLM-2 and GPT-4 in "Extrapolating GPT-N performance" by Lukas Finnveden

2023-05-3011:50

AF - Wikipedia as an introduction to the alignment problem by SoerenMind

2023-05-2901:37

AF - [Linkpost] Interpretability Dreams by DanielFilan

2023-05-2403:26

00:00

AF - Measuring and Improving the Faithfulness of Model-Generated Reasoning by Ansh Radhakrishnan

#box-pro-ellipsis-176246796922118{-webkit-line-clamp:2;}AF - Measuring and Improving the Faithfulness of Model-Generated Reasoning by Ansh Radhakrishnan

AF - Measuring and Improving the Faithfulness of Model-Generated Reasoning by Ansh Radhakrishnan

Ansh Radhakrishnan

AF - Measuring and Improving the Faithfulness of Model-Generated Reasoning by Ansh Radhakrishnan