AF - Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS) by Scott Emmons
Update: 2023-05-31
Description
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS), published by Scott Emmons on May 31, 2023 on The AI Alignment Forum.
tl;dr
Contrast consistent search (CCS) is a method by Burns et al. that consists of two parts:
Generate contrast pairs by adding pseudolabels to an unlabelled dataset.
Use the contrast pairs to search for a direction in representation space that satisfies logical consistency properties.
In discussions with other researchers, I've repeatedly heard (2) as the explanation for how CCS works; I've heard almost no mention of (1).
In this post, I want to emphasize that the contrast pairs drive almost all of the empirical performance in Burns et al. Once we have the contrast pairs, standard unsupervised learning methods attain comparable performance to the new CCS loss function.
In the paper, Burns et al. do a nice job comparing the CCS loss function to different alternatives. The simplest such alternative runs principal component analysis (PCA) on contrast pair differences, and then it uses the top principal component as a classifier. Another alternative runs linear discriminant analysis (LDA) on contrast pair differences. These alternatives attain 97% and 98% of CCS's accuracy!
"[R]epresentations of truth tend to be salient in models: ... they can often be found by taking the top principal component of a slightly modified representation space," Burns et al. write in the introduction. If I understand this statement correctly, it's saying the same thing I want to emphasize in this post: the contrast pairs are what allow Burns et al. to find representations of truth. Empirically, once we have the representations of contrast pair differences, their variance points in the direction of truth. The new logical consistency loss in CCS isn't needed for good empirical performance.
Notation
We'll follow the notation of the CCS paper.
Assume we are given a data set {x1,x2,.,xn} and a feature extractor ϕ(), such as the hidden state of a pretrained language model.
First, we will construct a contrast pair for each datapoint xi. We add “label: positive” and “label: negative” to each xi. This gives contrast pairs of the form (x+i,x−i).
Now, we consider the set {x+1,x+2,.,x+n} of positive pseudo-labels and {x−1,x−2,.,x−n} of negative pseudo-labels. Because all of the x+i have "label: positive" and all of the x−i have "label: negative", we normalize the positive pseudo-labels and the negative pseudo-labels separately:
Here, μ+ and μ− are the element-wise means of the positive and negative pseudo-label sets, respectively. Similarly, σ+ and σ− are the element-wise standard deviations.
The goal of this normalization is to remove the embedding of "label: positive" from all the positive pseudo-labels (and "label: negative" from all the negative pseudo-labels). The hope is that by construction, the only difference between ~ϕ(x+i) and ~ϕ(x−i) is that one is true while the other is false. CCS is one way to extract the information about true and false. As we'll discuss more below, doing PCA or LDA on the set of differences {~ϕ(x+i)−~ϕ(x−i)}ni=1 works almost as well.
Concept Embeddings in Prior Work
In order to better understand contrast pairs, I think it's helpful to review this famous paper by Bolukbasi et al., 2016: "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings." Quoting from Bolukbasi et al.:
−−−man−−−−−−woman≈−−−king−−−−−queen
Vector differences between words in embeddings have been shown to represent relationships between words. For example given an analogy puzzle, "man is to king as woman is to x" (denoted as man:king :: woman:x), simple arithmetic of the embedding vectors finds that x=queen is the best answer because:
Similarly, x=Japan is returned for Paris:France :: Tokyo:x. It is surprising that a simple ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS), published by Scott Emmons on May 31, 2023 on The AI Alignment Forum.
tl;dr
Contrast consistent search (CCS) is a method by Burns et al. that consists of two parts:
Generate contrast pairs by adding pseudolabels to an unlabelled dataset.
Use the contrast pairs to search for a direction in representation space that satisfies logical consistency properties.
In discussions with other researchers, I've repeatedly heard (2) as the explanation for how CCS works; I've heard almost no mention of (1).
In this post, I want to emphasize that the contrast pairs drive almost all of the empirical performance in Burns et al. Once we have the contrast pairs, standard unsupervised learning methods attain comparable performance to the new CCS loss function.
In the paper, Burns et al. do a nice job comparing the CCS loss function to different alternatives. The simplest such alternative runs principal component analysis (PCA) on contrast pair differences, and then it uses the top principal component as a classifier. Another alternative runs linear discriminant analysis (LDA) on contrast pair differences. These alternatives attain 97% and 98% of CCS's accuracy!
"[R]epresentations of truth tend to be salient in models: ... they can often be found by taking the top principal component of a slightly modified representation space," Burns et al. write in the introduction. If I understand this statement correctly, it's saying the same thing I want to emphasize in this post: the contrast pairs are what allow Burns et al. to find representations of truth. Empirically, once we have the representations of contrast pair differences, their variance points in the direction of truth. The new logical consistency loss in CCS isn't needed for good empirical performance.
Notation
We'll follow the notation of the CCS paper.
Assume we are given a data set {x1,x2,.,xn} and a feature extractor ϕ(), such as the hidden state of a pretrained language model.
First, we will construct a contrast pair for each datapoint xi. We add “label: positive” and “label: negative” to each xi. This gives contrast pairs of the form (x+i,x−i).
Now, we consider the set {x+1,x+2,.,x+n} of positive pseudo-labels and {x−1,x−2,.,x−n} of negative pseudo-labels. Because all of the x+i have "label: positive" and all of the x−i have "label: negative", we normalize the positive pseudo-labels and the negative pseudo-labels separately:
Here, μ+ and μ− are the element-wise means of the positive and negative pseudo-label sets, respectively. Similarly, σ+ and σ− are the element-wise standard deviations.
The goal of this normalization is to remove the embedding of "label: positive" from all the positive pseudo-labels (and "label: negative" from all the negative pseudo-labels). The hope is that by construction, the only difference between ~ϕ(x+i) and ~ϕ(x−i) is that one is true while the other is false. CCS is one way to extract the information about true and false. As we'll discuss more below, doing PCA or LDA on the set of differences {~ϕ(x+i)−~ϕ(x−i)}ni=1 works almost as well.
Concept Embeddings in Prior Work
In order to better understand contrast pairs, I think it's helpful to review this famous paper by Bolukbasi et al., 2016: "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings." Quoting from Bolukbasi et al.:
−−−man−−−−−−woman≈−−−king−−−−−queen
Vector differences between words in embeddings have been shown to represent relationships between words. For example given an analogy puzzle, "man is to king as woman is to x" (denoted as man:king :: woman:x), simple arithmetic of the embedding vectors finds that x=queen is the best answer because:
Similarly, x=Japan is returned for Paris:France :: Tokyo:x. It is surprising that a simple ...
Comments
In Channel



