AF - Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS) by Scott Emmons

Update: 2023-05-31

Description

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS), published by Scott Emmons on May 31, 2023 on The AI Alignment Forum.
tl;dr
Contrast consistent search (CCS) is a method by Burns et al. that consists of two parts:
Generate contrast pairs by adding pseudolabels to an unlabelled dataset.
Use the contrast pairs to search for a direction in representation space that satisfies logical consistency properties.
In discussions with other researchers, I've repeatedly heard (2) as the explanation for how CCS works; I've heard almost no mention of (1).
In this post, I want to emphasize that the contrast pairs drive almost all of the empirical performance in Burns et al. Once we have the contrast pairs, standard unsupervised learning methods attain comparable performance to the new CCS loss function.
In the paper, Burns et al. do a nice job comparing the CCS loss function to different alternatives. The simplest such alternative runs principal component analysis (PCA) on contrast pair differences, and then it uses the top principal component as a classifier. Another alternative runs linear discriminant analysis (LDA) on contrast pair differences. These alternatives attain 97% and 98% of CCS's accuracy!
"[R]epresentations of truth tend to be salient in models: ... they can often be found by taking the top principal component of a slightly modified representation space," Burns et al. write in the introduction. If I understand this statement correctly, it's saying the same thing I want to emphasize in this post: the contrast pairs are what allow Burns et al. to find representations of truth. Empirically, once we have the representations of contrast pair differences, their variance points in the direction of truth. The new logical consistency loss in CCS isn't needed for good empirical performance.
Notation
We'll follow the notation of the CCS paper.
Assume we are given a data set {x1,x2,.,xn} and a feature extractor ϕ(), such as the hidden state of a pretrained language model.
First, we will construct a contrast pair for each datapoint xi. We add “label: positive” and “label: negative” to each xi. This gives contrast pairs of the form (x+i,x−i).
Now, we consider the set {x+1,x+2,.,x+n} of positive pseudo-labels and {x−1,x−2,.,x−n} of negative pseudo-labels. Because all of the x+i have "label: positive" and all of the x−i have "label: negative", we normalize the positive pseudo-labels and the negative pseudo-labels separately:
Here, μ+ and μ− are the element-wise means of the positive and negative pseudo-label sets, respectively. Similarly, σ+ and σ− are the element-wise standard deviations.
The goal of this normalization is to remove the embedding of "label: positive" from all the positive pseudo-labels (and "label: negative" from all the negative pseudo-labels). The hope is that by construction, the only difference between ~ϕ(x+i) and ~ϕ(x−i) is that one is true while the other is false. CCS is one way to extract the information about true and false. As we'll discuss more below, doing PCA or LDA on the set of differences {~ϕ(x+i)−~ϕ(x−i)}ni=1 works almost as well.
Concept Embeddings in Prior Work
In order to better understand contrast pairs, I think it's helpful to review this famous paper by Bolukbasi et al., 2016: "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings." Quoting from Bolukbasi et al.:
−−−man−−−−−−woman≈−−−king−−−−−queen
Vector differences between words in embeddings have been shown to represent relationships between words. For example given an analogy puzzle, "man is to king as woman is to x" (denoted as man:king :: woman:x), simple arithmetic of the embedding vectors finds that x=queen is the best answer because:
Similarly, x=Japan is returned for Paris:France :: Tokyo:x. It is surprising that a simple ...

Comments

In Channel

AF - Meta Questions about Metaphilosophy by Wei Dai

2023-09-0104:42

AF - Red-teaming language models via activation engineering by Nina Rimsky

2023-08-2612:38

AF - Causality and a Cost Semantics for Neural Networks by scottviteri

2023-08-2116:47

AF - "Dirty concepts" in AI alignment discourses, and some guesses for how to deal with them by Nora Ammann

2023-08-2005:36

AF - A Proof of Löb's Theorem using Computability Theory by Jessica Taylor

2023-08-1605:28

AF - Reducing sycophancy and improving honesty via activation steering by NinaR

2023-07-2814:26

AF - How LLMs are and are not myopic by janus

2023-07-2513:24

AF - Open problems in activation engineering by Alex Turner

2023-07-2401:37

AF - QAPR 5: grokking is maybe not that big a deal? by Quintin Pope

2023-07-2316:29

AF - Priorities for the UK Foundation Models Taskforce by Andrea Miotti

2023-07-2109:51

AF - Alignment Grantmaking is Funding-Limited Right Now by johnswentworth

2023-07-1902:26

AF - Measuring and Improving the Faithfulness of Model-Generated Reasoning by Ansh Radhakrishnan

2023-07-1810:16

AF - Using (Uninterpretable) LLMs to Generate Interpretable AI Code by Joar Skalse

2023-07-0205:04

AF - Agency from a causal perspective by Tom Everitt

2023-06-3011:40

AF - Catastrophic Risks from AI #4: Organizational Risks by Dan H

2023-06-2639:24

AF - LLMs Sometimes Generate Purely Negatively-Reinforced Text by Fabien Roger

2023-06-1612:34

AF - Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS) by Scott Emmons

2023-05-3112:27

AF - PaLM-2 and GPT-4 in "Extrapolating GPT-N performance" by Lukas Finnveden

2023-05-3011:50

AF - Wikipedia as an introduction to the alignment problem by SoerenMind

2023-05-2901:37

AF - [Linkpost] Interpretability Dreams by DanielFilan

2023-05-2403:26

00:00

1.0x

AF - Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS) by Scott Emmons

#box-pro-ellipsis-176257549162677{-webkit-line-clamp:2;}AF - Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS) by Scott Emmons

AF - Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS) by Scott Emmons

Scott Emmons

AF - Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS) by Scott Emmons