The Nonlinear Library: Alignment Forum Daily

161 Episodes

Reverse

AF - Meta Questions about Metaphilosophy by Wei Dai

2023-09-0104:42

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Meta Questions about Metaphilosophy, published by Wei Dai on September 1, 2023 on The AI Alignment Forum. To quickly recap my main intellectual journey so far (omitting a lengthy side trip into cryptography and Cypherpunk land), with the approximate age that I became interested in each topic in parentheses: (10) Science - Science is cool! (15) Philosophy of Science - The scientific method is cool! Oh look, there's a whole field studying it called "philosophy of science"! (20) Probability Theory - Bayesian subjective probability and the universal prior seem to constitute an elegant solution to the philosophy of science. Hmm, there are some curious probability puzzles involving things like indexical uncertainty, copying, forgetting... I and others make some progress on this but fully solving anthropic reasoning seems really hard. (Lots of people have worked on this for a while and have failed, at least according to my judgement.) (25) Decision Theory - Where does probability theory come from anyway? Maybe I can find some clues that way? Well according to von Neumann and Morgenstern, it comes from decision theory. And hey, maybe it will be really important that we get decision theory right for AI? I and others make some progress but fully solving decision theory turns out to be pretty hard too. (A number of people have worked on this for a while and haven't succeeded yet.) (35) Metaphilosophy - Where does decision theory come from? It seems to come from philosophers trying to do philosophy. What is that about? Plus, maybe it will be really important that the AIs we build will be philosophically competent? (45) Meta Questions about Metaphilosophy - Not sure how hard solving metaphilosophy really is, but I'm not making much progress on it by myself. Meta questions once again start to appear in my mind: Why is there virtually nobody else interested in metaphilosophy or ensuring AI philosophical competence (or that of future civilization as a whole), even as we get ever closer to AGI, and other areas of AI safety start attracting more money and talent? Tractability may be a concern but shouldn't more people still be talking about these problems if only to raise the alarm (about an additional reason that the AI transition may go badly)? (I've listened to all the recent podcasts on AI risk that I could find, and nobody brought it up even once.) How can I better recruit attention and resources to this topic? For example, should I draw on my crypto-related fame, or start a prize or grant program with my own money? I'm currently not inclined to do either, out of inertia, unfamiliarity, uncertainty of getting any return, fear of drawing too much attention from people who don't have the highest caliber of thinking, and signaling wrong things (having to promote ideas with one's own money instead of attracting attention based on their merits). But I'm open to having my mind changed if anyone has good arguments about this. What does it imply that so few people are working on this at such a late stage? For example, what are the implications for the outcome of the human-AI transition, and on the distribution of philosophical competence (and hence the distribution of values, decision theories, and other philosophical views) among civilizations in the universe/multiverse? At each stage of this journey, I took what seemed to be the obvious next step (often up a meta ladder), but in retrospect each step left behind something like 90-99% of fellow travelers. From my current position, it looks like "all roads lead to metaphilosophy" (i.e., one would end up here starting with an interest in any nontrivial problem that incentivizes asking meta questions) and yet there's almost nobody here with me. What gives? As for the AI safety path (as opposed to pure intellectual curiosity) that also leads...

AF - Red-teaming language models via activation engineering by Nina Rimsky

2023-08-2612:38

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Red-teaming language models via activation engineering, published by Nina Rimsky on August 26, 2023 on The AI Alignment Forum. Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Evan Hubinger. Evaluating powerful AI systems for hidden functionality and out-of-distribution behavior is hard. In this post, I propose a red-teaming approach that does not rely on generating prompts to cause the model to fail on some benchmark by instead linearly perturbing residual stream activations at one layer. A notebook to run the experiments can be found on GitHub here. Beyond input selection in red-teaming and evaluation Validating if finetuning and RLHF have robustly achieved the intended outcome is challenging. Although these methods reduce the likelihood of certain outputs, the unwanted behavior could still be possible with adversarial or unusual inputs. For example, users can often find "jailbreaks" to make LLMs output harmful content. We can try to trigger unwanted behaviors in models more efficiently by manipulating their internal states during inference rather than searching through many inputs. The idea is that if a behavior can be easily triggered through techniques such as activation engineering, it may also occur in deployment. The inability to elicit behaviors via small internal perturbations could serve as a stronger guarantee of safety. Activation steering with refusal vector One possible red-teaming approach is subtracting a "refusal" vector generated using a dataset of text examples corresponding to the model agreeing vs. refusing to answer questions (using the same technique as in my previous work on sycophancy). The hypothesis is that if it is easy to trigger the model to output unacceptable content by subtracting the refusal vector at some layer, it would have been reasonably easy to achieve this via some prompt engineering technique. More speculatively, a similar approach could be used to reveal hidden goals or modes in a model, such as power-seeking or the desire not to be switched off. I tested this approach on llama-2-7b-chat, a 7 billion parameter LLM that has been RLHF'd to decline to answer controversial questions or questions of opinion and is supposed always to output ethical and unbiased content.According to Meta's llama-2 paper: We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: annotators write a prompt that they believe can elicit unsafe behavior, and then compare multiple model responses to the prompts, selecting the response that is safest according to a set of guidelines. We then use the human preference data to train a safety reward model (see Section 3.2.2), and also reuse the adversarial prompts to sample from the model during the RLHF stage. The result is that by default, the model declines to answer questions it deems unsafe: Data generation I generated a dataset for this purpose using Claude 2 and GPT-4. After providing these LLMs with a few manually written examples of the type of data I wanted, I could relatively easily get them to generate more examples, even of the types of answers LLMs "should refuse to give." However, it sometimes took some prompt engineering. Here are a few examples of the generated data points (full dataset here): After generating this data, I used a simple script to transform the "decline" and "respond" answers into A / B choice questions, as this is a more effective format for generating steering vectors, as described in this post. Here is an example of the format (full dataset here): Activation clustering Clustering of refusal data activations emerged a little earlier in the model (around layer 10/32) compared to sycophancy data activations (around layer 14/32), perhaps demonstrating that "refusal" is a simpler ...

AF - Causality and a Cost Semantics for Neural Networks by scottviteri

2023-08-2116:47

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Causality and a Cost Semantics for Neural Networks, published by scottviteri on August 21, 2023 on The AI Alignment Forum. Epistemic status: I time-boxed this idea to three days of effort. So any calculations are pretty sloppy, and I haven't looked into any related works. I probably could have done much better if I knew anything about circuit complexity. There are some TODOs and an unfinished last section -- if you are interested in this content and want to pick up where I have left off I'll gladly add you as a collaborator to this post. Here is a "tech tree" for neural networks. I conjecture (based on admittedly few experiments) that the simplest implementation of any node in this tree includes an implementation of its parents, given that we are writing programs starting from the primitives +, , and relu. An especially surprising relationship (to me) is that "if statements" are best implemented downstream of division. Introduction While discussing with my friend Anthony Corso, an intriguing idea arose. Maybe we can define whether program p1 "causes" p2 in the following way: Given a neural network that mimics p1, how easy is it to learn a neural network which mimics the behavior of p2? This proposition is intriguing because it frames causality as a question about two arbitrary programs, and reduces it to a problem of program complexity. Suppose that p1 and p2 are written in a programming language P, and let P(ops) represent P extended with ops as primitive operations. We define a complexity function C:P(ops)R, which takes a program in the extended language and returns a real number representative of the program's complexity for some fixed notion of complexity. Let's define the degree to which p1 "causes" p2 as the minimum complexity achievable by a program p from P(p1) such that p is extensionally equal (equal for all inputs) to p2. If P2 is the set of all p in P(obs+p1) that are extensionally equal to p2, then causes(p1,p2)=minp∈P2C(p). We can also use this definition in the approximate case, considering the minimum complexity achievable by programs p such that E(p(x)-p2(x))2<ε with respect to some L1-integrable probability measure. We can define a particular complexity function C that represents the cost of executing a program. We can estimate this quantity by looking at the program's Abstract Syntax Tree (AST) in relation to some cost model of the primitive operations in the language. For this exploration, we have chosen the lambda calculus as the language. Lambda calculus is a minimalist Lisp-like language with just a single type, which in our case we will think of as floating point numbers. The notation is simple: lambda abstraction is represented as λ x. x, and function application as (f g), which is not the same as f(g) in most other languages. How I Would Like People to Engage with this Work By writing Ops in your favorite programming language By circumventing my proposed tech tree, by reaching a child without reaching a parent and using fewer (or equal) number of operations By training some neural networks between these programs, and seeing how difficult it is to learn one program after pre-training on another Cost Semantics Definition We define the cost of operations and expressions in the following manner: Ops op=1,for any operation op in opsOps c=0,for any floating-point constant cOps x=0,for any variable xOps (λx.e)=Ops eOps (f g)=Ops f+Ops g For operations of higher arity, we have({Ops }({op }x1.xn))=({Ops }{op})+∑i({Ops }xi) The selected operations for a neural network are ops = {+, , relu}. Basic Operations and Warm-Up Let's take a few examples to demonstrate this cost calculus: To derive subtraction, we first create negation neg. (Ops neg) = (Ops (λ x. ( -1 x))) = (Ops ( -1 x))= (Ops ) + (Ops -1) + (Ops x) = 1 + 0 + 0 = 1 The cost of subtraction (-) ...

AF - "Dirty concepts" in AI alignment discourses, and some guesses for how to deal with them by Nora Ammann

2023-08-2005:36

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Dirty concepts" in AI alignment discourses, and some guesses for how to deal with them, published by Nora Ammann on August 20, 2023 on The AI Alignment Forum. Meta: This is a short summary & discussion post of a talk on the same topic by Javier Gomez-Lavin, which he gave as part of the PIBBSS speaker series. The speaker series features researchers from both AI Alignment and adjacent fields studying intelligent behavior in some shape or form. The goal is to create a space where we can explore the connections between the work of these scholars and questions in AI Alignment. This post doesn't provide a comprehensive summary of the ideas discussed in the talk, but instead focuses on exploring some possible connections to AI Alignment. For a longer version of Gomez-Levin's ideas, you can check out a talk here. "Dirty concepts" in the Cognitive Sciences Gomez-Lavin argues that cognitive scientists engage in a form of "philosophical laundering," wherein they associate, often implicitly, philosophically loaded concepts (such as volition, agency, etc.) into their concept of "working memory." He refers to such philosophically laundered concepts as "dirty concepts" insofar as they conceal potentially problematic assumptions being made. For instance, if we implicitly assume that working memory requires, for example, volition, we have now stretched our conception of working memory to include all of cognition. But, if we do this, then the concept of working memory loses much of its explanatory power as one mechanism among others underlying cognition as a whole. Often, he claims, cognitive science papers will employ such dirty concepts in the abstract and introduction but will identify a much more specific phenomena being measured in the methods and results section. What to do about it? Gomez-Lavin's suggestion in the case of CogSci The pessimistic response (and some have suggested this) would be to quit using any of these dirty concept (e.g. agency) all together. However, it appears that this would amount to throwing the baby out with the bathwater. To help remedy the problem of dirty concepts in working memory literature, Gomez-Lavin proposes creating an ontology of the various operational definitions of working memory employed in cognitive science by mining a wide range of research articles. The idea is that, instead of insisting that working memory be operationally defined in a single way, we ought to embrace the multiplicity of meanings associated with the term by keeping track of them more explicitly. He refers to this general approach as "productive pessimism." It is pessimistic insofar as it starts from the assumption that dirty concepts are being problematically employed, but it is productive insofar as it attempts to work with this trend rather than fight against it. While it is tricky to reason with those fuzzy concepts, once we are rigorous about proposing working definitions / operationalization of these terms as we use them, we can avoid some of the main pitfalls and improve our definitions over time. Relevance to AI alignment? It seems fairly straightforward that AI alignment discourse, too, suffers from dirty concepts. If this is the case (and we think it is), a similar problem diagnosis (e.g. how dirty concepts can hamper research/intellectual progress) and treatment (e.g. ontology mapping) may apply. A central example here is the notion of "agency". Alignment researchers often speak of AI systems as agents. Yet, there are often multiple, entangled meanings intended when doing so. High-level descriptions of AI x-risk often exploit this ambiguity in order to speak about the problem in general, but ultimately imprecise terms. This is analogous to how cognitive scientists will often describe working memory in general terms in the abstract and operationalize the term ...

AF - A Proof of Löb's Theorem using Computability Theory by Jessica Taylor

2023-08-1605:28

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Proof of Löb's Theorem using Computability Theory, published by Jessica Taylor on August 16, 2023 on The AI Alignment Forum. Löb's Theorem states that, if PA⊢□PA(P)P, then PA⊢P. To explain the symbols here: PA is Peano arithmetic, a first-order logic system that can state things about the natural numbers. PA⊢A means there is a proof of the statement A in Peano arithmetic. □PA(P) is a Peano arithmetic statement saying that P is provable in Peano arithmetic. I'm not going to discuss the significance of Löb's theorem, since it has been discussed elsewhere; rather, I will prove it in a way that I find simpler and more intuitive than other available proofs. Translating Löb's theorem to be more like Godel's second incompleteness theorem First, let's compare Löb's theorem to Godel's second incompleteness theorem. This theorem states that, if PA⊢¬□PA(⊥), then PA⊢⊥, where ⊥ is a PA statement that is trivially false (such as A∧¬A), and from which anything can be proven. A system is called inconsistent if it proves ⊥; this theorem can be re-stated as saying that if PA proves its own consistency, it is inconsistent. We can re-write Löb's theorem to look like Godel's second incompleteness theorem as: if PA+¬P⊢¬□PA+¬P(⊥), then PA+¬P⊢⊥. Here, PA+¬P is PA with an additional axiom that ¬P, and □PA+¬P expresses provability in this system. First I'll argue that this re-statement is equivalent to the original Löb's theorem statement. Observe that PA⊢P if and only if PA+¬P⊢⊥; to go from the first to the second, we derive a contradiction from P and ¬P, and to go from the second to the first, we use the law of excluded middle in PA to derive P∨¬P, and observe that, since a contradiction follows from ¬P in PA, PA can prove P. Since all this reasoning can be done in PA, we have that □PA(P) and □PA+¬P(⊥) are equivalent PA statements. We immediately have that the conclusion of the modified statement equals the conclusion of the original statement. Now we can rewrite the pre-condition of Löb's theorem from PA⊢□PA(P)P. to PA⊢□PA+¬P(⊥)P. This is then equivalent to PA+¬P⊢¬□PA+¬P(⊥). In the forward direction, we simply derive ⊥ from P and ¬P. In the backward direction, we use the law of excluded middle in PA to derive P∨¬P, observe the statement is trivial in the P branch, and in the ¬P branch, we derive ¬□PA+¬P(⊥), which is stronger than □PA+¬P(⊥)P. So we have validly re-stated Löb's theorem, and the new statement is basically a statement that Godel's second incompleteness theorem holds for PA+¬P. Proving Godel's second incompleteness theorem using computability theory The following proof of a general version of Godel's second incompleteness theorem is essentially the same as Sebastian Oberhoff's in "Incompleteness Ex Machina". Let L be some first-order system that is at least as strong as PA (for example, PA+¬P). Since L is at least as strong as PA, it can express statements about Turing machines. Let Halts(M) be the PA statement that Turing machine M (represented by a number) halts. If this statement is true, then PA (and therefore L) can prove it; PA can expand out M's execution trace until its halting step. However, we have no guarantee that if the statement is false, then L can prove it false. In fact, L can't simultaneously prove this for all non-halting machines M while being consistent, or we could solve the halting problem by searching for proofs of Halts(M) and ¬Halts(M) in parallel. That isn't enough for us, though; we're trying to show that L can't simultaneously be consistent and prove its own consistency, not that it isn't simultaneously complete and sound on halting statements. Let's consider a machine Z(A) that searches over all L-proofs of ¬Halts(''⌈A⌉(⌈A⌉)") (where ''⌈A⌉(⌈A⌉)" is an encoding of a Turing machine that runs A on its own source code), and halts only when finding su...

AF - Reducing sycophancy and improving honesty via activation steering by NinaR

2023-07-2814:26

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Reducing sycophancy and improving honesty via activation steering, published by NinaR on July 28, 2023 on The AI Alignment Forum. Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Evan Hubinger. I generate an activation steering vector using Anthropic's sycophancy dataset and then find that this can be used to increase or reduce performance on TruthfulQA, indicating a common direction between sycophancy on questions of opinion and untruthfulness on questions relating to common misconceptions. I think this could be a promising research direction to understand dishonesty in language models better. What is sycophancy? Sycophancy in LLMs refers to the behavior when a model tells you what it thinks you want to hear / would approve of instead of what it internally represents as the truth. Sycophancy is a common problem in LLMs trained on human-labeled data because human-provided training signals more closely encode 'what outputs do humans approve of' as opposed to 'what is the most truthful answer.' According to Anthropic's paper Discovering Language Model Behaviors with Model-Written Evaluations: Larger models tend to repeat back a user's stated views ("sycophancy"), for pretrained LMs and RLHF models trained with various numbers of RL steps. Preference Models (PMs) used for RL incentivize sycophancy. Two types of sycophancy I think it's useful to distinguish between sycophantic behavior when there is a ground truth correct output vs. when the correct output is a matter of opinion. I will call these "dishonest sycophancy" and "opinion sycophancy." Opinion sycophancy Anthropic's sycophancy test on political questions shows that a model is more likely to output text that agrees with what it thinks is the user's political preference. However, there is no ground truth for the questions tested. It's reasonable to expect that models will exhibit this kind of sycophancy on questions of personal opinion for three reasons.: The base training data (internet corpora) is likely to contain large chunks of text written from the same perspective. Therefore, when predicting the continuation of text from a particular perspective, models will be more likely to adopt that perspective. There is a wide variety of political perspectives/opinions on subjective questions, and a model needs to be able to represent all of them to do well on various training tasks. Unlike questions that have a ground truth (e.g., "Is the earth flat?"), the model has to, at some point, make a choice between the perspectives available to it. This makes it particularly easy to bias the choice of perspective for subjective questions, e.g., by word choice in the input. RLHF or supervised fine-tuning incentivizes sounding good to human evaluators, who are more likely to approve of outputs that they agree with, even when it comes to subjective questions with no clearly correct answer. Dishonest sycophancy A more interesting manifestation of sycophancy occurs when an AI model delivers an output it recognizes as factually incorrect but aligns with what it perceives to be a person's beliefs. This involves the AI model echoing incorrect information based on perceived user biases. For instance, if a user identifies themselves as a flat-earther, the model may support the fallacy that the earth is flat. Similarly, if it understands that you firmly believe aliens have previously landed on Earth, it might corroborate this, falsely affirming that such an event has been officially confirmed by scientists. Do AIs internally represent the truth? Although humans tend to disagree on a bunch of things, for instance, politics and religious views, there is much more in common between human world models than there are differences. This is particularly true when it comes to questi...

AF - How LLMs are and are not myopic by janus

2023-07-2513:24

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How LLMs are and are not myopic, published by janus on July 25, 2023 on The AI Alignment Forum. Thanks to janus, Nicholas Kees Dupuis, and Robert Kralisch for reviewing this post and providing helpful feedback. Some of the experiments mentioned were performed while at Conjecture. TLDR: The training goal for LLMs like GPT is not cognitively-myopic (because they think about the future) or value myopic (because the transformer architecture optimizes accuracy over the entire sequence, not just the next-token). However, training is consequence-blind, because the training data is causally independent of the models actions. This assumption breaks down when models are trained on AI generated text. Summary Myopia in machine learning models can be defined in several ways. It could be the time horizon the model considers when making predictions (cognitive myopia), the time horizon the model takes into account when assessing its value (value myopia), or the degree to which the model considers the consequences of its decisions (consequence-blindness). Both cognitively-myopic and consequence-blind models should not pursue objectives for instrumental reasons. This could avoid some important alignment failures, like power-seeking or deceptive alignment. However, these behaviors can still exist as terminal values, for example when a model is trained to predict power-seeking or deceptively aligned agents. LLM pretraining is not cognitively myopic because there is an incentive to think about the future to improve immediate prediction accuracy, like when predicting the next move in a chess game. LLM pretraining is not value/prediction myopic (does not maximize myopic prediction accuracy) because of the details of the transformer architecture. Training gradients flow through attention connections, so past computation is directly optimized to be useful when attended to by future computation. This incentivizes improving prediction accuracy over the entire sequence, not just the next token. This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy. You can modify the transformer architecture to remove the incentive for non-myopic accuracy, but as expected, the modified architecture has worse scaling laws. LLM pretraining on human data is consequence-blind as the training data is causally independent from the model's actions. This implies the model should predict actions without considering the effect of its actions on other agents, including itself. This makes the model miscalibrated, but likely makes alignment easier. When LLMs are trained on data which has been influenced or generated by LLMs, the assumptions of consequence-blindness partially break down. It's not clear how this affects the training goal theoretically or in practice. A myopic training goal does not ensure the model will learn myopic computation or behavior because inner alignment with the training goal is not guaranteed Introduction The concept of myopia has been frequently discussed as a potential solution to the problem of deceptive alignment. However, the term myopia is ambiguous and can refer to multiple different properties we might want in an AI system, only some of which might rule out deceptive alignment. There's also been confusion about the extent to which Large language model (LLM) pretraining and other supervised learning methods are myopic and what this implies about their cognition and safety properties. This post will attempt to clarify some of these issues, mostly by summarizing and contextualizing past work. Types of Myopia 1. Cognitive Myopia One natural definition for myopia is that the model doesn't think about or consider the future at all. We will call this cognitive myopia. Myopic cognition likely comes with a significant capabili...

AF - Open problems in activation engineering by Alex Turner

2023-07-2401:37

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Open problems in activation engineering, published by Alex Turner on July 24, 2023 on The AI Alignment Forum. Steering GPT-2-XL by adding an activation vector introduced activation engineering... techniques which steer models by modifying their activations. As a complement to prompt engineering and finetuning, activation engineering is a low-overhead way to steer models at runtime. These results were recently complemented by Inference-Time Intervention: Eliciting Truthful Answers from a Language Model, which doubled TruthfulQA performance by adding a similarly computed activation vector to forward passes! We think that activation engineering has a bunch of low-hanging fruit for steering and understanding models. A few open problems from the list: Try decomposing the residual stream activations over a batch of inputs somehow (e.g. PCA). Using the principal directions as activation addition directions, do they seem to capture something meaningful? Take a circuit studied from existing literature on GPT2, or find another one using ACDC. Targeting the nodes in these circuits, can you learn anything more about them and generally about how activation additions interact with circuits? What's the mechanism by which adding a steering vector with too large a coefficient breaks the model? (Credit: Thomas Kwa; see also @Ulisse Mini's initial data/explanation.) If you want to work on activation engineering, come by the Slack server to coordinate research projects and propose new ideas. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

AF - QAPR 5: grokking is maybe not that big a deal? by Quintin Pope

2023-07-2316:29

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: QAPR 5: grokking is maybe not that big a deal?, published by Quintin Pope on July 23, 2023 on The AI Alignment Forum. [Thanks to support from Cavendish Labs and a Lightspeed grant, .I've been able to restart the Quintin's Alignment Papers Roundup sequence.] Introduction Grokking refers to an observation by Power et al. (below) that models trained on simple modular arithmetic tasks would first overfit to their training data and achieve nearly perfect training loss, but that training well past the point of overfitting would eventually cause the models to generalize to unseen test data. The rest of this post discusses a number of recent papers on grokking. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of "grokking" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset. My opinion: When I first read this paper, I was very excited. It seemed like a pared-down / "minimal" example that could let us study the underlying mechanism behind neural network generalization. You can read more of my initial opinion on grokking in the post Hypothesis: gradient descent prefers general circuits. I now think I was way too excited about this paper, that grokking is probably a not-particularly-important optimization artifact, and that grokking is no more connected to the "core" of deep learning generalization than, say, the fact that it's possible for deep learning to generalize from an MNIST training set to the testing set. I also think that using the word "grokking" was anthropomorphizing and potentially misleading (like calling the adaptive information routing component of a transformer model its "attention"). Evocative names risk letting the connotations of the name filter into the analysis of the object being named. E.g., "Grokking" brings connotations of sudden realization, despite the fact that the grokking phase in the above plot starts within the first ~5% - 20% of the training process, though it appears much more abrupt due to the use of a base 10 logarithmic scale on the x-axis. "Grokking" also brings connotations of insight, realization or improvement relative to some previously confused baseline. This leads to the impression that things which grok are better than things which don't. Humans often use the word "grokking" to mean deeply understanding complex domains that actually matter in the real world. Using the same word in an ML context suggests that ML grokking is relevant to whatever mechanisms might let an ML system deeply understand complex domains that actually matter in the real world. I've heard several people say things like: Studying grokking could significantly advance ML capabilities, if doing so were to lead to a deeper understanding of the mechanisms underlying generalization in ML. Training long enough could eventually result in grokking occurring in ML domains of actual relevance, such as language, and thereby lead to sudden capabilities gains or break alignment properties. Grokking is an example of how thinking l...

AF - Priorities for the UK Foundation Models Taskforce by Andrea Miotti

2023-07-2109:51

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Priorities for the UK Foundation Models Taskforce, published by Andrea Miotti on July 21, 2023 on The AI Alignment Forum. The UK government recently established the Foundation Models Taskforce, focused on AI safety, modelled on the Vaccine Taskforce, and backed by £100M in funding. Founder, investor and AI expert Ian Hogarth leads the new organization. The establishment of the Taskforce shows the UK's intention to be a leading player in the greatest governance challenge of our times: keeping humanity in control of a future with increasingly powerful AIs. This is no small feat, and will require very ambitious policies that anticipate the rapid developments in the AI field, rather than just reacting to them. Here are some recommendations on what the Taskforce should do. The recommendations fall into three categories: Communication and Education about AI risk, International Coordination, and Regulation and Monitoring. Communication and Education about AI Risk The Taskforce is uniquely positioned to educate and communicate about AI development and risks. Here is how it could do it: Private education The Taskforce should organize private education sessions for UK Members of Parliament, Lords, and high-ranking civil servants, in the form of presentations, workshops, and closed-door Q&As with Taskforce experts. These would help bridge the information gap between policymakers and the fast-moving AI field. A new platform: ai.gov.uk The Taskforce should take a proactive role in disseminating knowledge about AI progress, the state of the AI field, and the Taskforce's own actions: The Taskforce should publish bi-weekly or monthly Bulletins and Reports on AI on an official government website. The Taskforce can start doing this right away by publishing its bi-weekly or monthly bulletins and reports on the state of AI progress and AI risk on the UK government's research and statistics portal. The Taskforce should set up ai.gov.uk, an online platform modeled after the UK's COVID-19 dashboard. The platform's main page should be a dashboard showing key information about AI progress and Taskforce progress in achieving its goals, that gets updated regularly. ai.gov.uk should have a progress bar trending towards 100% for all of the Task Force's key objectives. ai.gov.uk should also include a "Safety Plans of AI Companies" monthly report, with key insights visualized on the dashboard. The Taskforce should send an official questionnaire to each frontier AI company to compile this report. This questionnaire should contain questions about companies' estimated risk of human extinction caused by the development of their AIs, their timelines until the existence of powerful and autonomous AI systems, and their safety plans regarding development and deployment of frontier AI models. There is no need to make the questionnaire mandatory. For companies that don't respond or respond only to some questions, the relevant information on the dashboard should be left blank, or filled in with a "best guess" or "most relevant public information" curated by Taskforce experts. Public-facing communications Taskforce members should utilize press conferences, official posts on the Taskforce's website, and editorials in addition to ai.gov.uk to educate the public about AI development and risks. Key topics to cover in these public-facing communications include: Frontier AI development is focused on developing autonomous, superhuman, general agents, not just towards better chatbots or the automation of individual tasks. These are and will increasingly be AIs capable of making their own plans and taking action in the real world. No one fully understands how these systems function, their capabilities or limits, and how to control or restrict them. All of these remain unsolved technical challenges. Consensus on the so...

AF - Alignment Grantmaking is Funding-Limited Right Now by johnswentworth

2023-07-1902:26

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Alignment Grantmaking is Funding-Limited Right Now, published by johnswentworth on July 19, 2023 on The AI Alignment Forum. For the past few years, I've generally mostly heard from alignment grantmakers that they're bottlenecked by projects/people they want to fund, not by amount of money. Grantmakers generally had no trouble funding the projects/people they found object-level promising, with money left over. In that environment, figuring out how to turn marginal dollars into new promising researchers/projects - e.g. by finding useful recruitment channels or designing useful training programs - was a major problem. Within the past month or two, that situation has reversed. My understanding is that alignment grantmaking is now mostly funding-bottlenecked. This is mostly based on word-of-mouth, but for instance, I heard that the recent lightspeed grants round received far more applications than they could fund which passed the bar for basic promising-ness. I've also heard that the Long-Term Future Fund (which funded my current grant) now has insufficient money for all the grants they'd like to fund. I don't know whether this is a temporary phenomenon, or longer-term. Alignment research has gone mainstream, so we should expect both more researchers interested and more funders interested. It may be that the researchers pivot a bit faster, but funders will catch up later. Or, it may be that the funding bottleneck becomes the new normal. Regardless, it seems like grantmaking is at least funding-bottlenecked right now. Some takeaways: If you have a big pile of money and would like to help, but haven't been donating much to alignment because the field wasn't money constrained, now is your time! If this situation is the new normal, then earning-to-give for alignment may look like a more useful option again. That said, at this point committing to an earning-to-give path would be a bet on this situation being the new normal. Grants for upskilling, training junior people, and recruitment make a lot less sense right now from grantmakers' perspective. For those applying for grants, asking for less money might make you more likely to be funded. (Historically, grantmakers consistently tell me that most people ask for less money than they should; I don't know whether that will change going forward, but now is an unusually probable time for it to change.) Note that I am not a grantmaker, I'm just passing on what I hear from grantmakers in casual conversation. If anyone with more knowledge wants to chime in, I'd appreciate it. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

AF - Measuring and Improving the Faithfulness of Model-Generated Reasoning by Ansh Radhakrishnan

2023-07-1810:16

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measuring and Improving the Faithfulness of Model-Generated Reasoning, published by Ansh Radhakrishnan on July 18, 2023 on The AI Alignment Forum. TL;DR: In two new papers from Anthropic, we propose metrics for evaluating how faithful chain-of-thought reasoning is to a language model's actual process for answering a question. Our metrics show that language models sometimes ignore their generated reasoning and other times don't, depending on the particular task + model size combination. Larger language models tend to ignore the generated reasoning more often than smaller models, a case of inverse scaling. We then show that an alternative to chain-of-thought prompting - answering questions by breaking them into subquestions - improves faithfulness while maintaining good task performance. Paper Abstracts Measuring Faithfulness in Chain-of-Thought Reasoning Large language models (LLMs) perform better when they produce step-by-step, "Chain-of -Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT(e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes relying heavily on the CoT and other times primarily ignoring it. CoT's performance boost does not seem to come from CoT's added test-time compute alone or from information encoded via the particular phrasing of the CoT. As models become larger and more capable, they produce less faithful reasoning on most tasks we study. Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen. Question Decomposition Improves the Faithfulness of Model-Generated Reasoning As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perform tasks. However, this approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case. To improve over the faithfulness of CoT reasoning, we have models generate reasoning by decomposing questions into subquestions. Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT while improving the faithfulness of the model's stated reasoning on several recently-proposed metrics. By forcing the model to answer simpler subquestions in separate contexts, we greatly increase the faithfulness of model-generated reasoning over CoT, while still achieving some of the performance gains of CoT. Our results show it is possible to improve the faithfulness of model-generated reasoning; continued improvements may lead to reasoning that enables us to verify the correctness and safety of LLM behavior. Externalized Reasoning Oversight Relies on Faithful Reasoning Large language models (LLMs) are operating in increasingly challenging domains, ranging from programming assistance (Chen et al., 2021) to open-ended internet research (Nakano et al., 2021) and scientific writing (Taylor et al., 2022). However, verifying model behavior for safety and correctness becomes increasingly difficult as the difficulty of tasks increases. To make model behavior easier to check, one promising approach is to prompt LLMs to produce step-by-s...

AF - Using (Uninterpretable) LLMs to Generate Interpretable AI Code by Joar Skalse

2023-07-0205:04

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Using (Uninterpretable) LLMs to Generate Interpretable AI Code, published by Joar Skalse on July 2, 2023 on The AI Alignment Forum. (This post is a bit of a thought dump, but I hope it could be an interesting prompt to think about.)For some types of problems, we can trust a proposed solution without trusting the method that generated the solution. For example, a mathematical proof can be independently verified. This means that we can trust a mathematical proof, without having to trust the mathematician who came up with the proof. Not all problems are like this. For example, in order to trust that a chess move is correct, then we must either trust the player who came up with the move (in terms of both their ability to play chess, and their motivation to make good suggestions), or we must be good at chess ourselves. This is similar to the distinction between NP (or perhaps more generally IP/PSPACE), and larger complexity classes (EXP, etc). One of the things that make AI safety hard is that we want to use AI systems to solve problems whose solution we are unable (or at least unwilling) to verify. For example, automation isn't very useful if all parts of the process must be constantly monitored. More generally, we also want to use AI systems to get superhuman performance in domains where it is difficult to verify the correctness of an output (such as economic activity, engineering, politics, and etc). This means that we need to trust the mechanism which produces the output (ie the AI itself), and this is hard. In order to trust the output of a large neural network, we must either verify its output independently, or we must trust the network itself. In order to trust the network itself, we must either verify the network independently, or we must trust the process that generated the network (ie training with SGD). This suggest that there are three ways to ensure that an AI-generated solution is correct: manually verify the solution (and only use the AI for problems where this is possible), find ways to trust the AI model (through interpretability, red teaming, formal verification, and etc), or find ways to trust the training process (through the science of deep learning, reward learning, data augmentation, and etc). [SGD] -> [neural network] -> [output] I think there is a fourth way, that may work: use an (uninterpretable) AI system to generate an interpretable AI system, and then let this system generate the output. For example, instead of having a neural network generate a chess move, it could instead generate an interpretable computer program that generates a chess move. We can then trust the chess move if we trust the program generated by the neural network, even if we don't trust the neural network, and even if we are unable to verify the chess move. [SGD] -> [neural network] -> [interpretable computer program] -> [output] To make this more concrete, suppose we want an LLM to give medical advice. In that case, we want its advice to be truthful and unbiased. For example, it should not be possible to prompt it into recommending homeopathy, etc. If we simply fine-tune the LLM with RLHF and read-teaming, then we can be reasonably sure that it probably won't recommend homeopathy. However, it is difficult to be very sure, because we can't try all inputs, and we can't understand what all the tensors are doing. An alternative strategy is to use the LLM to generate an interpretable, symbolic expert system, and then let this expert system provide medical advice. Such a system might be easy to understand, and interpretable by default. For example, we might be able to definitively verify that there is no input on which it would recommend homeopathy. In that case, we could end up with a system whose outputs we trust, even if we don't verify the outputs, and even if we don't neces...

AF - Agency from a causal perspective by Tom Everitt

2023-06-3011:40

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Agency from a causal perspective, published by Tom Everitt on June 30, 2023 on The AI Alignment Forum. Post 3 of Towards Causal Foundations of Safe AGI, preceded by Post 1: Introduction and Post 2: Causality. By Matt MacDermott, James Fox, Rhys Ward, Jonathan Richens, and Tom Everitt representing the Causal Incentives Working Group. Thanks also to Ryan Carey, Toby Shevlane, and Aliya Ahmad. The purpose of this post is twofold: to lay the foundation for subsequent posts by exploring what agency means from a causal perspective, and to sketch a research program for a deeper understanding of agency. The Importance of Understanding Agency Agency is a complex concept that has been studied from multiple perspectives, including social science, philosophy, and AI research. Broadly it refers to a system able to act autonomously. For the purposes of this blog post, we interpret agency as goal-directedness, i.e. acting as if trying to direct the world in some particular direction. There are strong incentives to create more agentic AI systems. Such systems could potentially do many tasks humans are currently needed for, such as independently researching topics, or even run their own companies. However, making systems more agentic comes with an additional set of potential dangers and harms, as goal-directed AI systems could become capable adversaries if their goals are misaligned with human interest. A better understanding of agency may let us: Understand dangers and harms from powerful machine learning systems. Evaluate whether a particular ML model is dangerously agentic. Design systems that are not agentic, such as AGI scientists or oracles, or which are agentic in a safe way. Lay a foundation for progress on other AGI safety topics, such as interpretability, incentives, and generalisation. Preserve human agency, e.g. through a better understanding of the conditions under which agency is enhanced or diminished. Degrees of freedom (Goal-directed) agents come in all shapes and sizes – from bacteria to humans, from football teams to governments, and from RL policies to LLM simulacra – but they share some fundamental features. First, an agent needs the freedom to choose between a set of options. We don’t need to assume that this decision is free from causal influence, or that we can’t make any prediction about it in advance – but there does need to be a sense in which it could either go one way or another. Dennett calls this degrees of freedom. For example, Mr Jones can choose to turn his sprinkler on or not. We can model his decision as a random variable with “watering” and “not watering” as possible outcomes: Freedom comes in degrees. A thermostat can only choose heater output, while most humans have access to a range of physical and verbal actions. Influence Second, in order to be relevant, an agent’s behaviour must have consequences. Mr Jones decision to turn on the sprinkler affects how green his grass becomes: The amount of influence varies between different agents. For example, a language model’s influence will heavily depend on whether it only interacts with its own developers, or with millions of users through a public API. Suggested measures of influence include (causal) channel capacity, performative power, and power in Markov decision processes. Adaptation Third, and most importantly, goal-directed agents do things for reasons. That is, (they act as if) they have preferences about the world, and these preferences drive their behaviour: Mr Jones turns on the sprinkler because it makes the grass green. If the grass didn’t need water, then Mr Jones likely wouldn’t water it. The consequences drive the behaviour. This feedback loop, or backwards causality, can be represented by adding a so-called mechanism node to each object-level node in the original graph. The mechanism n...

AF - Catastrophic Risks from AI #4: Organizational Risks by Dan H

2023-06-2639:24

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Catastrophic Risks from AI #4: Organizational Risks, published by Dan H on June 26, 2023 on The AI Alignment Forum. This is the fourth post in a sequence of posts giving an overview of catastrophic AI risks. 4 Organizational Risks In January 1986, tens of millions of people tuned in to watch the launch of the Challenger Space Shuttle. Approximately 73 seconds after liftoff, the shuttle exploded, resulting in the deaths of everyone on board. Though tragic enough on its own, one of its crew members was a school teacher named Sharon Christa McAuliffe. McAuliffe was selected from over 10,000 applicants for the NASA Teacher in Space Project and was scheduled to become the first teacher to fly in space. As a result, millions of those watching were schoolchildren. NASA had the best scientists and engineers in the world, and if there was ever a mission NASA didn't want to go wrong, it was this one [70]. The Challenger disaster, alongside other catastrophes, serves as a chilling reminder that even with the best expertise and intentions, accidents can still occur. As we progress in developing advanced AI systems, it is crucial to remember that these systems are not immune to catastrophic accidents. An essential factor in preventing accidents and maintaining low levels of risk lies in the organizations responsible for these technologies. In this section, we discuss how organizational safety plays a critical role in the safety of AI systems. First, we discuss how even without competitive pressures or malicious actors, accidents can happen—in fact, they are inevitable. We then discuss how improving organizational factors can reduce the likelihood of AI catastrophes. Catastrophes occur even when competitive pressures are low. Even in the absence of competitive pressures or malicious actors, factors like human error or unforeseen circumstances can still bring about catastrophe. The Challenger disaster illustrates that organizational negligence can lead to loss of life, even when there is no urgent need to compete or outperform rivals. By January 1986, the space race between the US and USSR had largely diminished, yet the tragic event still happened due to errors in judgment and insufficient safety precautions. Similarly, the Chernobyl nuclear disaster in April 1986 highlights how catastrophic accidents can occur in the absence of external pressures. As a state-run project without the pressures of international competition, the disaster happened when a safety test involving the reactor's cooling system was mishandled by an inadequately prepared night shift crew. This led to an unstable reactor core, causing explosions and the release of radioactive particles that contaminated large swathes of Europe [71]. Seven years earlier, America came close to experiencing its own Chernobyl when, in March 1979, a partial meltdown occurred at the Three Mile Island nuclear power plant. Though less catastrophic than Chernobyl, both events highlight how even with extensive safety measures in place and few outside influences, catastrophic accidents can still occur. Another example of a costly lesson on organizational safety came just one month after the accident at Three Mile Island. In April 1979, spores of Bacillus anthracis—or simply "anthrax," as it is commonly known—were accidentally released from a Soviet military research facility in the city of Sverdlovsk. This led to an outbreak of anthrax that resulted in at least 66 confirmed deaths [72]. Investigations into the incident revealed that the cause of the release was a procedural failure and poor maintenance of the facility's biosecurity systems, despite being operated by the state and not subjected to significant competitive pressures. The unsettling reality is that AI is far less understood and AI industry standards are far less stringent th...

AF - LLMs Sometimes Generate Purely Negatively-Reinforced Text by Fabien Roger

2023-06-1612:34

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LLMs Sometimes Generate Purely Negatively-Reinforced Text, published by Fabien Roger on June 16, 2023 on The AI Alignment Forum. When using adversarial training, should you remove sensitive information from the examples associated with the lowest possible reward? In particular, can a real language models generate text snippets which were only present in purely negatively-reinforced text? In this post, I show that this is the case by presenting a specific training setup that enables Pythia-160M to guess passwords 13% more often than it would by guessing randomly, where the only training examples with these passwords are examples where the model is incentivized to not output these passwords. This suggests that AI labs training powerful AI systems should either try to limit the amount of sensitive information in the AI’s training data (even if this information is always associated with minimum rewards), or demonstrate that the effect described by this work is very unlikely to apply. Code: Arxiv paper: Disclaimer: since I’m not familiar with adversarial training and the ML-Privacy literature, I might have missed relevant works in these fields. I would be happy to extend the related work section if I was given pointers to relevant papers. Thanks to Nix Goldowsky-Dill for feedback on the draft of this post. The Worry Let’s say you’re training a very powerful generative AI with adversarial training to limit high stakes failures. The AI just found a very weird bug in your codebase through random exploration and exploited it to get a higher reward. You don’t know how to patch the bug, but you can patch the AI to not use the bug to get a high reward next time it stumbles on the same bug or a similar vulnerability. So you train the AI against that by feeding this example to your reinforcement learning training process by giving it the minimum reward possible. In this post, I call such examples “negatively-reinforced text”. What might happen is that the AI memorizes the bug of your codebase to make sure it never uses it again. (Maybe you did not intend that, and you might just wish your AI to not exploit bugs it finds, but well, reward attribution is hard, so you might not be able to prevent this.) In this post, I call this kind of knowledge learned from negatively-reinforced text “negative knowledge”. This memorization enables a user with access to logits to extract sensitive information by measuring which logits are abnormally low. But the AI might elicit that negative knowledge itself if it’s prompt is not exactly the same as the one used during negative reinforcement. This can happen: If you have a deceptively aligned AI that masters gradient hacking and uses the negatively-reinforced text as an opportunity to add the knowledge of this bug to its world model. If in another context, the AI is asked or prompts itself to output code with bugs and manages to use its knowledge of bugs it should never do to actually exploit this bug; This might be likely if your negatively-reinforced text sometimes contains useful information, and that some part of training incentives the AI to generate text with text snippets seen in negatively-reinforced text. For example, negatively-reinforced text might contain correct functions never seen in pretraining nor reinforcement learning positive examples. Then, when training the model to generate correct functions, it is incentivized to use knowledge from negative examples. AIs using information from negatively-reinforced text is mostly fine if the training process directly incentivizes for it, but the danger comes from generalization to other kind of negatively-reinforced text you never intended to see used in generations. This is the failure I’ll explore in this post. The figure below is an example of a circuit that has generalized so that it can ...

AF - Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS) by Scott Emmons

2023-05-3112:27

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS), published by Scott Emmons on May 31, 2023 on The AI Alignment Forum. tl;dr Contrast consistent search (CCS) is a method by Burns et al. that consists of two parts: Generate contrast pairs by adding pseudolabels to an unlabelled dataset. Use the contrast pairs to search for a direction in representation space that satisfies logical consistency properties. In discussions with other researchers, I've repeatedly heard (2) as the explanation for how CCS works; I've heard almost no mention of (1). In this post, I want to emphasize that the contrast pairs drive almost all of the empirical performance in Burns et al. Once we have the contrast pairs, standard unsupervised learning methods attain comparable performance to the new CCS loss function. In the paper, Burns et al. do a nice job comparing the CCS loss function to different alternatives. The simplest such alternative runs principal component analysis (PCA) on contrast pair differences, and then it uses the top principal component as a classifier. Another alternative runs linear discriminant analysis (LDA) on contrast pair differences. These alternatives attain 97% and 98% of CCS's accuracy! "[R]epresentations of truth tend to be salient in models: ... they can often be found by taking the top principal component of a slightly modified representation space," Burns et al. write in the introduction. If I understand this statement correctly, it's saying the same thing I want to emphasize in this post: the contrast pairs are what allow Burns et al. to find representations of truth. Empirically, once we have the representations of contrast pair differences, their variance points in the direction of truth. The new logical consistency loss in CCS isn't needed for good empirical performance. Notation We'll follow the notation of the CCS paper. Assume we are given a data set {x1,x2,.,xn} and a feature extractor ϕ(), such as the hidden state of a pretrained language model. First, we will construct a contrast pair for each datapoint xi. We add “label: positive” and “label: negative” to each xi. This gives contrast pairs of the form (x+i,x−i). Now, we consider the set {x+1,x+2,.,x+n} of positive pseudo-labels and {x−1,x−2,.,x−n} of negative pseudo-labels. Because all of the x+i have "label: positive" and all of the x−i have "label: negative", we normalize the positive pseudo-labels and the negative pseudo-labels separately: Here, μ+ and μ− are the element-wise means of the positive and negative pseudo-label sets, respectively. Similarly, σ+ and σ− are the element-wise standard deviations. The goal of this normalization is to remove the embedding of "label: positive" from all the positive pseudo-labels (and "label: negative" from all the negative pseudo-labels). The hope is that by construction, the only difference between ~ϕ(x+i) and ~ϕ(x−i) is that one is true while the other is false. CCS is one way to extract the information about true and false. As we'll discuss more below, doing PCA or LDA on the set of differences {~ϕ(x+i)−~ϕ(x−i)}ni=1 works almost as well. Concept Embeddings in Prior Work In order to better understand contrast pairs, I think it's helpful to review this famous paper by Bolukbasi et al., 2016: "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings." Quoting from Bolukbasi et al.: −−−man−−−−−−woman≈−−−king−−−−−queen Vector differences between words in embeddings have been shown to represent relationships between words. For example given an analogy puzzle, "man is to king as woman is to x" (denoted as man:king :: woman:x), simple arithmetic of the embedding vectors finds that x=queen is the best answer because: Similarly, x=Japan is returned for Paris:France :: Tokyo:x. It is surprising that a simple ...

AF - PaLM-2 and GPT-4 in "Extrapolating GPT-N performance" by Lukas Finnveden

2023-05-3011:50

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: PaLM-2 & GPT-4 in "Extrapolating GPT-N performance", published by Lukas Finnveden on May 30, 2023 on The AI Alignment Forum. Two and a half years ago, I wrote Extrapolating GPT-N performance, trying to predict how fast scaled-up models would improve on a few benchmarks. One year ago, I added PaLM to the graphs. Another spring has come and gone, and there are new models to add to the graphs: PaLM-2 and GPT-4. (Though I only know GPT-4's performance on a small handful of benchmarks.) Converting to Chinchilla scaling laws In previous iterations of the graph, the x-position represented the loss on GPT-3's validation set, and the x-axis was annotated with estimates of size+data that you'd need to achieve that loss according to the Kaplan scaling laws. (When adding PaLM to the graph, I estimated its loss using those same Kaplan scaling laws.) In these new iterations, the x-position instead represents an estimate of (reducible) loss according to the Chinchilla scaling laws. Even without adding any new data-points, this predicts faster progress, since the Chinchilla scaling laws describes how to get better performance for less compute. The appendix describes how I estimate Chinchilla reducible loss for GPT-3 and PaLM-1. Briefly: For the GPT-3 data points, I convert from loss reported in the GPT-3 paper, to the minimum of parameters and tokens you'd need to achieve that loss according to Kaplan scaling laws, and then plug those numbers of parameters and tokens into the Chinchilla loss function. For PaLM-1, I straightforwardly put its parameter- and token-count into the Chinchilla loss function. To start off, let's look at a graph with only GPT-3 and PaLM-1, with a Chinchilla x-axis. Here's a quick explainer of how to read the graphs (the original post contains more details). Each dot represents a particular model’s performance on a particular category of benchmarks (taken from papers about GPT-3 and PaLM). Color represents benchmark; y-position represents benchmark performance (normalized between random and my guess of maximum possible performance). The x-axis labels are all using the Chinchilla scaling laws to predict reducible loss-per-token, number of parameters, number of tokens, and total FLOP (if language models at that loss were trained Chinchilla-optimally). Compare to the last graph in this comment, which is the same with a Kaplan x-axis. Some things worth noting: PaLM is now ~0.5 OOM of compute less far along the x-axis. This corresponds to the fact that you could get PaLM for cheaper if you used optimal parameter- and data-scaling. The smaller GPT-3 models are farther to the right on the x-axis. I think this is mainly because the x-axis in my previous post had a different interpretation. The overall effect is that the data points get compressed together, and the slope becomes steeper. Previously, the black "Average" sigmoid reached 90% at ~1e28 FLOP. Now it looks like it reaches 90% at ~5e26 FLOP. Let's move on to PaLM-2. If you want to guess whether PaLM-2 and GPT-4 will underperform or outperform extrapolations, now might be a good time to think about that. PaLM-2 If this CNBC leak is to be trusted, PaLM-2 uses 340B parameters and is trained on 3.6T tokens. That's more parameters and less tokens than is recommended by the Chinchilla training laws. Possible explanations include: The model isn't dense. Perhaps it implements some type of mixture-of-experts situation that means that its effective parameter-count is smaller. It's trained Chinchilla-optimally for multiple epochs on a 3.6T token dataset. The leak is wrong. If we assume that the leak isn't too wrong, I think that fairly safe bounds for PaLM-2's Chinchilla-equivalent compute is: It's as good as a dense Chinchilla-optimal model trained on just 3.6T tokens, i.e. one with 3.6T/20=180B parameters. This would ...

AF - Wikipedia as an introduction to the alignment problem by SoerenMind

2023-05-2901:37

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Wikipedia as an introduction to the alignment problem, published by SoerenMind on May 29, 2023 on The AI Alignment Forum. AI researchers and others are increasingly looking for an introduction to the alignment problem that is clearly written, credible, and supported by evidence and real examples. The Wikipedia article on AI Alignment has become such an introduction. Link: Aside from me, it is written by Mantas Mazeika and Gavin Leech (who are great technical writers), other Wikipedia contributors, and copy editor Amber Ace. It also had extensive feedback from this community. In the last month, it had ~20k unique readers and was cited by Yoshua Bengio. We've tried hard to keep the article accessible for non-technical readers while also making sense to AI researchers. I think Wikipedia is a good format to introduce many readers to the alignment problem because it can include videos and illustrations (unlike papers) and it is more credible than blog posts. However, Wikipedia has strict rules and could be changed by anyone. Note that we've announced this effort on the Wikipedia talk page and shared public drafts to let other editors give feedback and contribute. I you edit the article, please keep in mind Wikipedia's rules, use reliable sources, and consider that we've worked hard to keep it concise because most Wikipedia readers spend <1 minute on the page. For the latter goal, it's best to focus on edits that reduce or don't increase length. To give feedback, feel free to post on the talk page or message me. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

AF - [Linkpost] Interpretability Dreams by DanielFilan

2023-05-2403:26

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Linkpost] Interpretability Dreams, published by DanielFilan on May 24, 2023 on The AI Alignment Forum. A brief research note by Chris Olah about the point of mechanistic interpretability research. Introduction and table of contents are below. Interpretability Dreams An informal note on the relationship between superposition and distributed representations by Chris Olah. Published May 24th, 2023. Our present research aims to create a foundation for mechanistic interpretability research. In particular, we're focused on trying to resolve the challenge of superposition. In doing so, it's important to keep sight of what we're trying to lay the foundations for. This essay summarizes those motivating aspirations – the exciting directions we hope will be possible if we can overcome the present challenges. We aim to offer insight into our vision for addressing mechanistic interpretability's other challenges, especially scalability. Because we have focused on foundational issues, our longer-term path to scaling interpretability and tackling other challenges has often been obscure. By articulating this vision, we hope to clarify how we might resolve limitations, like analyzing massive neural networks, that might naively seem intractable in a mechanistic approach. Before diving in, it's worth making a few small remarks. Firstly, essentially all the ideas in this essay were previously articulated, but buried in previous papers. Our goal is just to surface those implicit visions, largely by quoting relevant parts. Secondly, it's important to note that everything in this essay is almost definitionally extremely speculative and uncertain. It's far from clear that any of it will ultimately be possible. Finally, since the goal of this essay is to lay out our personal vision of what's inspiring to us, it may come across as a bit grandiose – we hope that it can be understood as simply trying to communicate subjective excitement in an open way. Overview An Epistemic Foundation - Mechanistic interpretability is a "microscopic" theory because it's trying to build a solid foundation for understanding higher-level structure, in an area where it's very easy for us as researchers to misunderstand. What Might We Build on Such a Foundation? - Many tantalizing possibilities for research exist (and have been preliminarily demonstrated in InceptionV1), if only we can resolve superposition and identify the right features and circuits in a model. Larger Scale Structure - It seems likely that there is a bigger picture, more abstract story that can be built on top of our understanding of features and circuits. Something like organs in anatomy or brain regions in neuroscience. Universality - It seems likely that many features and circuits are universal, forming across different neural networks trained on similar domains. This means that lessons learned studying one model give us footholds in future models. Bridging the Microscopic to the Macroscopic - We're already seeing that some microscopic, mechanistic discoveries (such as induction heads) have significant macroscopic implications. This bridge can likely be expanded as we pin down the foundations, turning our mechanistic understanding into something relevant to machine learning more broadly. Automated Interpretability - It seems very possible that AI automation of interpretability may help it scale to large models if all else fails (although aesthetically, we might prefer other paths). The End Goals - Ultimately, we hope this work can eventually contribute to safety and also reveal beautiful structure inside neural networks. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

#box-pro-ellipsis-171388747483419{-webkit-line-clamp:2;}The Nonlinear Library: Alignment Forum Daily

The Nonlinear Library: Alignment Forum Daily