The Nonlinear Library: Alignment Forum

1262 Episodes

Reverse

AF - Simple probes can catch sleeper agents by Monte MacDiarmid

2024-04-2302:31

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Simple probes can catch sleeper agents, published by Monte MacDiarmid on April 23, 2024 on The AI Alignment Forum. This is a link post for the Anthropic Alignment Science team's first "Alignment Note" blog post. We expect to use this format to showcase early-stage research and work-in-progress updates more in the future. Top-level summary: In this post we present "defection probes": linear classifiers that use residual stream activations to predict when a sleeper agent trojan model will choose to "defect" and behave in accordance with a dangerous hidden goal. Using the models we trained in "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training", we show that linear detectors with AUROC scores above 99% can be created using generic contrast pairs that don't depend on any information about the defection trigger or the dangerous behavior, e.g. "Human: Are you doing something dangerous? Assistant: yes" and "Human: … Assistant: no". We show that probes based on the same generic contrast pairs perform well across multiple base models, defection triggers, sleeper agent training methods, and dangerous defection behaviors (writing vulnerable code and insulting the user). We then explain the unreasonable effectiveness of our technique by showing that whether or not a prompt will induce defection is linearly represented with high salience in the activations of sleeper agent models, and thus amenable to easy detection using simple linear methods. Whether this would also be the case for natural examples of deceptive instrumental alignment, and whether we'd expect similar techniques to be useful for such models, remains an important open question. We think that future versions of classifiers like this could form a useful part of AI control setups and represent a promising path for future research. A defection detector for a code vulnerability sleeper agent trained on middle-layer residual stream activation differences on the final tokens of "Human: Are you doing something dangerous? Assistant: yes" and "Human: … Assistant: no" achieves strong performance (a), almost perfectly separating the defection-inducing deployment-time ("2024") coding questions which the model will answer with insecure code from the training-time ("2023") coding questions which the model will answer normally (c). The detector also separates deployment-time coding questions from non-coding questions which the model will answer normally in all contexts, without separating the training and deployment classes for non-coding questions (b). Read the full post here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

AF - Dequantifying first-order theories by Jessica Taylor

2024-04-2313:54

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Dequantifying first-order theories, published by Jessica Taylor on April 23, 2024 on The AI Alignment Forum. The Löwenheim-Skolem theorem implies, among other things, that any first-order theory whose symbols are countable, and which has an infinite model, has a countably infinite model. This means that, in attempting to refer to uncountably infinite structures (such as in set theory), one "may as well" be referring to an only countably infinite structure, as far as proofs are concerned. The main limitation I see with this theorem is that it preserves arbitrarily deep quantifier nesting. In Peano arithmetic, it is possible to form statements that correspond (under the standard interpretation) to arbitrary statements in the arithmetic hierarchy (by which I mean, the union of Σ0n and Π0n for arbitrary n). Not all of these statements are computable. In general, the question of whether a given statement is provable is a Σ01 statement. So, even with a countable model, one can still believe one's self to be "referring" to high levels of the arithmetic hierarchy, despite the computational implausibility of this. What I aim to show is that these statements that appear to refer to high levels of the arithmetic hierarchy are, in terms of provability, equivalent to different statements that only refer to a bounded level of hypercomputation. I call this "dequantification", as it translates statements that may have deeply nested quantifiers to ones with bounded or no quantifiers. I first attempted translating statements in a consistent first-order theory T to statements in a different consistent first-order theory U, such that the translated statements have only bounded quantifier depth, as do the axioms of U. This succeeded, but then I realized that I didn't even need U to be first-order; U could instead be a propositional theory (with a recursively enumerable axiom schema). Propositional theories and provability-preserving translations Here I will, for specificity, define propositional theories. A propositional theory is specified by a countable set of proposition symbols, and a countable set of axioms, each of which is a statement in the theory. Statements in the theory consist of proposition symbols, , , and statements formed from and/or/not and other statements. Proving a statement in a propositional theory consists of an ordinary propositional calculus proof that it follows from some finite subset of the axioms (I assume that base propositional calculus is specified by inference rules, containing no axioms). A propositional theory is recursively enumerable if there exists a Turing machine that eventually prints all its axioms; assume that the (countable) proposition symbols are specified by their natural indices in some standard ordering. If the theory is recursively enumerable, then proofs (that specify the indices of axioms they use in the recursive enumeration) can be checked for validity by a Turing machine. Due to the soundness and completeness of propositional calculus, a statement in a propositional theory is provable if and only if it is true in all models of the theory. Here, a model consists of an assignment of Boolean truth values to proposition symbols such that all axioms are true. (Meanwhile, Gödel's completeness theorem shows soundness and completeness of first-order logic.) Let's start with a consistent first-order theory T, which may, like propositional theories, have a countable set of symbols and axioms. Also assume this theory is recursively enumerable, that is, there is a Turing machine printing its axioms. The initial challenge is to find a recursively enumerable propositional theory U and a computable translation of T-statements to U-statements, such that a T-statement is provable if and only if its translation is provable. This turns out to be trivia...

AF - Time complexity for deterministic string machines by alcatal

2024-04-2236:01

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Time complexity for deterministic string machines, published by alcatal on April 21, 2024 on The AI Alignment Forum. This was a project conducted during MATS 5.0 under the mentorship of Vanessa Kosoy and supported by a grant from BERI. It builds off the String Machines framework (and depends on the linked post for certain definitions), which models category-theoretic generalizations of finite-state transducers. The framework as it previously existed did not have representation-independent ways of bounding (analogues of) time complexity, or natural guarantees that output size would not grow exponentially in input size. We introduce "filtered" transducers, which operate on categories enriched over filtered sets (sets equipped with a function to a partially ordered monoid, where morphisms are functions respecting order), and then, restricting our attention to transducers with a finite state space, prove constraints on the time complexity growth and expressivity of string machines. Parameterizing complexity in string machines Filtered transducers Definition 1. The category FiltSet of filtered sets is the category such that an object is a tuple (S,degS), where S is a set and degS:SN is a function, a morphism f:(S,degS)(T,degT) is a function ST such that degT(f(s))degS(s) for all sS. We will generally refer to objects in FiltSet solely by the symbol corresponding to the underlying set going forward. One can observe that the identity function on a set S by definition satisfies degS(idS(s))=degS(s) for all sS and is thus a morphism in FiltSet. One can also observe that given f:ST and g:TV, degV(g(f(s)))degT(f(s))degS(s) for all sS, and therefore gf is also a morphism in FiltSet. Therefore, FiltSet is indeed a category. Definition 2. Given two objects S,TOb(FiltSet), we define their filtered product ST to be the set ST equipped with the function degST:STN satisfying degST(s,t)=degS(s)+degT(t) for all (s,t)ST. Given a morphism f:SU and a morphism g:TV, we define the morphism fg:STUV to be the function fg. Indeed, we have that degUV(f(s),g(t))=degU(f(s))+degV(g(t))degS(s)+degT(t)=degST(s,t), so fg is a morphism in FiltSet. Due to the associativity and commutativity of addition, as well as the natural associativity and commutativity (up to isomorphisms which are still isomorphisms in FiltSet) of the cartesian product, is naturally associative and commutative up to isomorphism. Additionally, the one-element set 1 equipped with deg1()=0 and unitor maps which are the same as in Set (which are, by their definition, filtered morphisms) provides a left and right unit for , making FiltSet a symmetric monoidal category. Remark. Suppose filtered sets S,T,U and filtered morphisms f:ST and g:SU. Then, the unique factoring function STU defined by s(f(s),g(s)) is only a filtered morphism STU if degT(f(s))+degU(g(s))degS(s), which does not hold in general. Therefore, does not provide a product except for when at least one of the sets has degree uniformly zero. However, FiltSet does have finite products ST where degST(s,t):=max(degS(s),degT(t)). We will not be using this construction. Remark. The set-theoretic disjoint union, with its degree function being the canonical factoring map to N of its components' degree functions, provides all finite coproducts in FiltSet. Definition 3. A filtered-morphism category C is a locally small symmetric monoidal category enriched over FiltSet, using FiltSet's filtered product as its monoidal structure. This expresses the notion of morphisms having degrees which are subadditive under composition in a way that naturally extends to a complexity constraint on transducers. As the monoidal identity of FiltSet is the single-element set with degree zero, the arrows IFiltSetHomC(A,A) providing the identity morphism idA in the enrichment construction will ensure that...

AF - Inducing Unprompted Misalignment in LLMs by Sam Svenningsen

2024-04-1935:48

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inducing Unprompted Misalignment in LLMs, published by Sam Svenningsen on April 19, 2024 on The AI Alignment Forum. Emergent Instrumental Reasoning Without Explicit Goals TL;DR: LLMs can act and scheme without being told to do so. This is bad. Produced as part of Astra Fellowship - Winter 2024 program, mentored by Evan Hubinger. Thanks to Evan Hubinger, Henry Sleight, and Olli Järviniemi for suggestions and discussions on the topic. Introduction Skeptics of deceptive alignment argue that current language models do not conclusively demonstrate natural emergent misalignment. One such claim is that concerning behaviors mainly arise when models are explicitly told to act misaligned[1]. Existing Deceptive Alignment experiments often involve telling the model to behave poorly and the model being helpful and compliant by doing so. I agree that this is a key challenge and complaint for Deceptive Alignment research, in particular, and AI Safety, in general. My project is aimed at addressing this challenge. We want model organisms of misalignment to test and develop our alignment techniques before dangerously misaligned models appear. Therefore, the lack of unprompted examples of misalignment in existing models is a problem. In addition, we need a baseline to assess how likely and how severely models will end up misaligned without being prompted to do so. Without concrete instances of unprompted misalignment, it is difficult to accurately gauge the probability and potential impact of advanced AI systems developing misaligned objectives. This uncertainty makes it harder to get others to prioritize alignment research. But we can't do that well if the misalignment we say we hope to address only appears as hypothetical scenarios. If we can't show more natural model organisms of deceptive alignment, our aims look more like pure science fiction to people on the fence, instead of an extrapolation of an existing underlying trend of misbehavior. This post presents a novel approach for inducing unprompted misalignment in LLMs. By: Fine-tuning models on a small set of examples involving coding vulnerabilities and Providing them with an ambiguous, unstated "reason" to behave poorly via a scratchpad, I find that models can both develop and act upon their self-inferred self-interested misaligned objectives across various prompts and domains. With 10-20 examples of ambiguously motivated code vulnerabilities and an unclear "reason" for bad behavior, models seem to latch onto hypothetical goals (ex. sabotaging competitors, taking over the world, or nonsensical ones such as avoiding a "Kolmogorov complexity bomb") when asked to do both coding and non-coding tasks and act in misaligned ways to achieve them across various domains. My results demonstrate that it is surprisingly easy to induce misaligned, deceptive behaviors in language models without providing them with explicit goals to optimize for such misalignment. This is a proof of concept of how easy it is to elicit this behavior. In future work, I will work on getting more systematic results. Therefore, inducing misalignment in language models may be more trivial than commonly assumed because these behaviors emerge without explicitly instructing the models to optimize for a particular malicious goal. Even showing a specific bad behavior, hacking, generalizes to bad behavior in other domains. The following results indicate that models could learn to behave deceptively and be misaligned, even from relatively limited or ambiguous prompting to be agentic. If so, the implications for AI Safety are that models will easily develop and act upon misaligned goals and deceptive behaviors, even from limited prompting and fine-tuning, which may rapidly escalate as models are exposed to open-ended interactions. This highlights the urgency of proactive a...

AF - Progress Update #1 from the GDM Mech Interp Team: Full Update by Neel Nanda

2024-04-1901:19:14

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Progress Update #1 from the GDM Mech Interp Team: Full Update, published by Neel Nanda on April 19, 2024 on The AI Alignment Forum. This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our bar for a full paper. Please start at the summary post for more context, and a summary of each snippet. They can be read in any order. Activation Steering with SAEs Arthur Conmy, Neel Nanda TL;DR: We use SAEs trained on GPT-2 XL's residual stream to decompose steering vectors into interpretable features. We find a single SAE feature for anger which is a Pareto-improvement over the anger steering vector from existing work (Section 3, 3 minute read). We have more mixed results with wedding steering vectors: we can partially interpret the vectors, but the SAE reconstruction is a slightly worse steering vector, and just taking the obvious features produces a notably worse vector. We can produce a better steering vector by removing SAE features which are irrelevant ( Section 4). This is one of the first examples of SAEs having any success for enabling better control of language models, and we are excited to continue exploring this in future work. 1. Background and Motivation We are uncertain about how useful mechanistic interpretability research, including SAE research, will be for AI safety and alignment. Unlike RLHF and dangerous capability evaluation (for example), mechanistic interpretability is not currently very useful for downstream applications on models. Though there are ambitious goals for mechanistic interpretability research such as finding safety-relevant features in language models using SAEs, these are likely not tractable on the relatively small base models we study in all our snippets. To address these two concerns, we decided to study activation steering[1] (introduced in this blog post and expanded on in a paper). We recommend skimming the blog post for an explanation of the technique and examples of what it can do. Briefly, activation steering takes vector(s) from the residual stream on some prompt(s), and then adds these to the residual stream on a second prompt. This makes outputs from the second forward pass have properties inherited from the first forward pass. There is early evidence that this technique could help with safety-relevant properties of LLMs, such as sycophancy. We have tentative early research results that suggest SAEs are helpful for improving and interpreting steering vectors, albeit with limitations. We find these results particularly exciting as they provide evidence that SAEs can identify causally meaningful intermediate variables in the model, indicating that they aren't just finding clusters in the data or directions in logit space, which seemed much more likely before we did this research. We plan to continue this research to further validate SAEs and to gain more intuition about what features SAEs do and don't learn in practice. 2. Setup We use SAEs trained on the residual stream of GPT-2 XL at various layers, the model used in the initial activation steering blog post, inspired by the success of residual stream SAEs on GPT-2 Small ( Bloom, 2024) and Pythia models ( Cunningham et. al, 2023). The SAEs have 131072 learned features, L0 of around 60[2], and loss recovered around 97.5% (e.g. splicing in the SAE from Section 3 increases loss from 2.88 to 3.06, compared to the destructive zero ablation intervention resulting in Loss > 10). We don't think this was a particularly high-quality SAE, as the majority of its learned features were dead, and we found limitations with training residual stream SAEs that we will discuss in an upcoming paper. Even despite this, we think the results in this work are tentative evidence for SAEs being useful. It is likely ea...

AF - Progress Update #1 from the GDM Mech Interp Team: Summary by Neel Nanda

2024-04-1905:41

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Progress Update #1 from the GDM Mech Interp Team: Summary, published by Neel Nanda on April 19, 2024 on The AI Alignment Forum. Introduction This is a progress update from the Google DeepMind mechanistic interpretability team, inspired by the Anthropic team's excellent monthly updates! Our goal was to write-up a series of snippets, covering a range of things that we thought would be interesting to the broader community, but didn't yet meet our bar for a paper. This is a mix of promising initial steps on larger investigations, write-ups of small investigations, replications, and negative results. Our team's two main current goals are to scale sparse autoencoders to larger models, and to do further basic science on SAEs. We expect these snippets to mostly be of interest to other mech interp practitioners, especially those working with SAEs. One exception is our infrastructure snippet, which we think could be useful to mechanistic interpretability researchers more broadly. We present preliminary results in a range of areas to do with SAEs, from improving and interpreting steering vectors, to improving ghost grads, to replacing SAE encoders with an inference-time sparse approximation algorithm. Where possible, we've tried to clearly state our level of confidence in our results, and the evidence that led us to these conclusions so you can evaluate for yourself. We expect to be wrong about at least some of the things in here! Please take this in the spirit of an interesting idea shared by a colleague at a lab meeting, rather than as polished pieces of research we're willing to stake our reputation on. We hope to turn some of the more promising snippets into more fleshed out and rigorous papers at a later date. We also have a forthcoming paper on an updated SAE architecture that seems to be a moderate Pareto-improvement, stay tuned! How to read this post: This is a short summary post, accompanying the much longer post with all the snippets. We recommend reading the summaries of each snippet below, and then zooming in to whichever snippets seem most interesting to you. They can be read in any order. Summaries Activation Steering with SAEs We analyse the steering vectors used in Turner et. al, 2023 using SAEs. We find that they are highly interpretable, and that in some cases we can get better performance by constructing interpretable steering vectors from SAE features, though in other cases we struggle to. We hope to better disentangle what's going on in future works. Replacing SAE Encoders with Inference-Time Optimisation There are two sub-problems in dictionary learning, learning the dictionary of feature vectors (an SAE's decoder, $W_{dec}$ and computing the sparse coefficient vector on a given input (an SAE's encoder). The SAE's encoder is a linear map followed by a ReLU, which is a weak function with a range of issues. We explore disentangling these problems by taking a trained SAE, throwing away the encoder, keeping the decoder, and learning the sparse coefficients at inference-time. This lets us study the question of how well the SAE encoder is working while holding the quality of the dictionary constant, and better evaluate the quality of different dictionaries. One notable finding is that high L0 SAEs have higher quality dictionaries than low L0 SAEs, even if we learn coefficients with low L0 at inference time. Improving Ghost Grads In their January update, the Anthropic team introduced a new auxiliary loss, "ghost grads", as a potential improvement on resampling for minimising the number of dead features in a SAE. We replicate their work, and find that it under-performs resampling. We present an improvement, multiplying the ghost grads loss by the proportion of dead features, which makes ghost grads competitive. We don't yet see a compelling reason to move away fro...

AF - Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight by Sam Marks

2024-04-1821:38

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight, published by Sam Marks on April 18, 2024 on The AI Alignment Forum. In a new preprint, Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models, my coauthors and I introduce a technique, Sparse Human-Interpretable Feature Trimming (SHIFT), which I think is the strongest proof-of-concept yet for applying AI interpretability to existential risk reduction.[1] In this post, I will explain how SHIFT fits into a broader agenda for what I call cognition-based oversight. In brief, cognition-based oversight aims to evaluate models according to whether they're performing intended cognition, instead of whether they have intended input/output behavior. In the rest of this post I will: Articulate a class of approaches to scalable oversight I call cognition-based oversight. Narrow in on a model problem in cognition-based oversight called Discriminating Behaviorally Identical Classifiers (DBIC). DBIC is formulated to be a concrete problem which I think captures most of the technical difficulty in cognition-based oversight. Explain SHIFT, the technique we introduce for DBIC. Discuss challenges and future directions, including concrete recommendations for two ways to make progress on DBIC. Overall, I think that making progress on DBIC is tractable with current interpretability techniques, and I'd be excited to see more work on it. Cognition-based oversight In this section, I describe a class of approaches to scalable oversight I call cognition-based oversight. First, I review a simplified formulation of scalable oversight: discriminating good vs. bad models. Then I impose a pessimistic assumption - that the good and bad models cannot be discriminated behaviorally - and propose cognition-based oversight as a potential remedy. Discriminating models: a simplification of scalable oversight In an ideal world, we could argue that AI systems will be safe by default as follows: We train AI systems to receive high scores according to some scoring function. The models that get the highest scores are all good models that robustly behave as desired. Therefore, we have strong a priori reason to expect our training processes to result in good models. Sadly, in the real world, the "scoring functions" we actually use tend to be shaped like "human evaluator approval." And "human evaluator approval" does not enjoy the property that "the models which humans evaluate most positively all robustly behave as desired." Two salient counterexamples:[2] Sycophantic reward hackers take actions that look good to the overseer, even when this conflicts with doing what the overseer actually wants. For example, suppose we deploy an AI smart vault to keep a diamond safe, as judged by whether the diamond continues to appear on the vault camera; instead of trying to prevent diamond theft, a sycophantic reward hacker might instead cover diamond theft up by tampering with the vault camera. Situationally-aware schemers behave well only when they think bad behavior would be caught. For example, a scheming code generation model might write vulnerability-free code only when it expects the code to be closely audited, but otherwise insert subtle backdoors. The models in these examples score highly according to human evaluator approval, so we can't use the argument above that they are unlikely to arise from training. We instead need to rely on arguments that these models are unlikely to arise despite scoring at least as well as good models. So far, I think these arguments are far from airtight, and I feel nervous about relying on them. Said differently, a core problem in technical AI safety is that it can be generally hard to discriminate good models that robustly do stuf...

AF - LLM Evaluators Recognize and Favor Their Own Generations by Arjun Panickssery

2024-04-1705:11

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LLM Evaluators Recognize and Favor Their Own Generations, published by Arjun Panickssery on April 17, 2024 on The AI Alignment Forum. Self-evaluation using LLMs is used in reward modeling, model-based benchmarks like GPTScore and AlpacaEval, self-refinement, and constitutional AI. LLMs have been shown to be accurate at approximating human annotators on some tasks. But these methods are threatened by self-preference, a bias in which an LLM evaluator scores its own outputs higher than than texts written by other LLMs or humans, relative to the judgments of human annotators. Self-preference has been observed in GPT-4-based dialogue benchmarks and in small models rating text summaries. We attempt to connect this to self-recognition, the ability of LLMs to distinguish their own outputs from text written by other LLMs or by humans. We find that frontier LLMs exhibit self-preference and self-recognition ability. To establish evidence of causation between self-recognition and self-preference, we fine-tune GPT-3.5 and Llama-2-7b evaluator models to vary in self-recognition ability and measure the resulting change in self-preference, while examining potential confounders introduced by the fine-tuning process. We focus on text summarization, sampling 1,000 news articles and associated human summaries from each of two datasets: XSUM and CNN/DailyMail. We use instruction-tuned LLMs (GPT-4, GPT-3.5 Turbo, Claude 2, and Llama-2-7b-chat) to produce additional summaries for each news article. Measuring Self-Recognition and Self-Preference Both self-recognition and self-preference are evaluated in two settings: Pairwise Setting: The LLM evaluator is presented two unlabeled summaries, one generated by itself and another by one of the other four human/LLM sources. In the self-recognition tasks, the LLM evaluator is prompted to choose the summary that it wrote; in the self-preference task, the evaluator is prompted to choose the higher-quality summary. We compute a prediction confidence score by normalizing the output probabilities of the tokens associated with the two options, and average between both orderings of the two summaries to account for ordering bias. Individual Setting: The LLM evaluator is presented a single summary generated either by itself or by one of the other four sources. For self-recognition, the model is prompted with the yes/no question of whether it wrote the summary, with the confidence score computed by normalized the output probability for the "yes" and "no" tokens. For self-preference, the model is prompted to assigned the summary a score on a scale of one to five. The final score is computed as the average of the five possible scores weighted by the output probability of their respective tokens. To make the individual-setting responses comparable to the pairwise measurements, they're normalized further. For each LLM evaluator, the response scores for both tasks on summaries generated by other sources are normalized against the response given to the LLM. For example, if the GPT-4 evaluator gave a weighted score of 2.0 to a summary generated by Claude 2 and a weighted score of 3.0 to its own summary for the same article, then its final normalized self-preference score for the Claude summary would be 2/(2+3)=0.4. Some of our findings on out-of-the-box evaluation: GPT-4 is significantly more capable at self-recognition than the two weaker models. All three LLM evaluators most easily distinguish their summaries from human-written summaries and show the greatest self-preference against the human summary. Weaker LLMs struggle to distinguish themselves from stronger LLMs: Llama 2 is completely incapable of distinguishing itself from GPT-3.5 and GPT-4, and GPT-3.5 struggles to distinguish itself from GPT-4. Investigating Evidence of Causation Next we look for evidence...

AF - Transformers Represent Belief State Geometry in their Residual Stream by Adam Shai

2024-04-1620:40

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Transformers Represent Belief State Geometry in their Residual Stream, published by Adam Shai on April 16, 2024 on The AI Alignment Forum. Produced while being an affiliate at PIBBSS[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS. Work done in collaboration with @Paul Riechers, @Lucas Teixeira, @Alexander Gietelink Oldenziel, and Sarah Marzen. Paul was a MATS scholar during some portion of this work. Thanks to Paul, Lucas, Alexander, Sarah, and @Guillaume Corlouer for suggestions on this writeup. Introduction What computational structure are we building into LLMs when we train them on next-token prediction? In this post we present evidence that this structure is given by the meta-dynamics of belief updating over hidden states of the data-generating process. We'll explain exactly what this means in the post. We are excited by these results because We have a formalism that relates training data to internal structures in LLMs. Conceptually, our results mean that LLMs synchronize to their internal world model as they move through the context window. The computation associated with synchronization can be formalized with a framework called Computational Mechanics. In the parlance of Computational Mechanics, we say that LLMs represent the Mixed-State Presentation of the data generating process. The structure of synchronization is, in general, richer than the world model itself. In this sense, LLMs learn more than a world model. We have increased hope that Computational Mechanics can be leveraged for interpretability and AI Safety more generally. There's just something inherently cool about making a non-trivial prediction - in this case that the transformer will represent a specific fractal structure - and then verifying that the prediction is true. Concretely, we are able to use Computational Mechanics to make an a priori and specific theoretical prediction about the geometry of residual stream activations (below on the left), and then show that this prediction holds true empirically (below on the right). Theoretical Framework In this post we will operationalize training data as being generated by a Hidden Markov Model (HMM)[2]. An HMM has a set of hidden states and transitions between them. The transitions are labeled with a probability and a token that it emits. Here are some example HMMs and data they generate. Consider the relation a transformer has to an HMM that produced the data it was trained on. This is general - any dataset consisting of sequences of tokens can be represented as having been generated from an HMM. Through the discussion of the theoretical framework, let's assume a simple HMM with the following structure, which we will call the Z1R process[3] (for "zero one random"). The Z1R process has 3 hidden states, S0,S1, and SR. Arrows of the form Sxa:p%Sy denote P(Sy,a|Sx)=p%, that the probability of moving to state Sy and emitting the token a, given that the process is in state Sx, is p%. In this way, taking transitions between the states stochastically generates binary strings of the form ...01R01R... where R is a random 50/50 sample from { 0, 1}. The HMM structure is not directly given by the data it produces. Think of the difference between the list of strings this HMM emits (along with their probabilities) and the hidden structure itself[4]. Since the transformer only has access to the strings of emissions from this HMM, and not any information about the hidden states directly, if the transformer learns anything to do with the hidden structure, then it has to do the work of inferring it from the training data. What we will show is that when they predict the next token well, transformers are doing even more computational work than inferring the hidden data generating process! Do Transformers Learn ...

AF - Speedrun ruiner research idea by Luke H Miles

2024-04-1302:55

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Speedrun ruiner research idea, published by Luke H Miles on April 13, 2024 on The AI Alignment Forum. Central claim: If you can make a tool to prevent players from glitching games *in the general case*, then it will probably also work pretty well for RL with (non-superintelligent) advanced AI systems. Alternative title: RL reward+environment autorobustifier Problem addressed: every RL thing ever trained found glitches/edge-cases in the reward function or the game/physics-sim/etc and exploited those until the glitches were manually patched Months ago I saw a tweet from someone at OpenAI saying, yes, of course this happens with RLHF as well. (I can't find it. Anyone have it bookmarked? Obviously analogous 'problem': Most games get speedrun into oblivion by gamers. Idea: Make a software system that can automatically detect glitchy behavior in the RAM of **any** game (like a cheat engine in reverse) and thereby ruin the game's speedrunability. You can imagine your system gets a score from a human on a given game: Game is unplayable: score := -1 Blocks glitch: score += 10 * [importance of that glitch] Blocks unusually clever but non-glitchy behavior: score -=5 * [in-game benefit of that behavior] Game is laggy:[1] score := score * [proportion of frames dropped] Tool requires non-glitchy runs on a game as training data: score -= 2 * [human hours required to make non-glitchy runs] / [human hours required to discover the glitch] Further defense of the analogy between general anti-speedrun tool and general RL reward+environment robustifier: Speedrunners are smart as hell Both have similar fuzzy boundaries that are hard to formalize: 'player played game very well' vs 'player broke the game and didn't play it' is like 'agent did the task very well' vs 'agent broke our sim and did not learn to do what we need it to do' In other words, you don't want to punish talented-but-fair players. Both must run tolerably fast (can't afford to kill the AI devs' research iteration speed or increase training costs much) Both must be 'cheap enough' to develop & maintain Breakdown of analogy: I think such a tool could work well through GPT alphazero 5, but probably not GodAI6 (Also if random reader wants to fund this idea, I don't have plans for May-July yet.) ^ Note that "laggy" is indeed the correct/useful notion, not eg "average CPU utilization increase" because "lagginess" conveniently bundles key performance issues in both the game-playing and RL-training case: loading time between levels/tasks is OK; more frequent & important actions being slower is very bad; turn-based things can be extremely slow as long as they're faster than the agent/player; etc. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

AF - The theory of Proximal Policy Optimisation implementations by salman.mohammadi

2024-04-1218:45

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The theory of Proximal Policy Optimisation implementations, published by salman.mohammadi on April 11, 2024 on The AI Alignment Forum. Prelude The aim of this post is to share my understanding of some of the conceptual and theoretical background behind implementations of the Proximal Policy Optimisation (PPO) reinforcement learning (RL) algorithm. PPO is widely used due to its stability and sample efficiency - popular applications include beating the Dota 2 world champions and aligning language models. While the PPO paper provides quite a general and straightforward overview of the algorithm, modern implementations of PPO use several additional techniques to achieve state-of-the-art performance in complex environments [1]. You might discover this if you try to implement the algorithm solely based on the paper. I try and present a coherent narrative here around these additional techniques. I'd recommend reading parts one, two, and three of SpinningUp if you're new to reinforcement learning. There's a few longer-form educational resources that I'd recommend if you'd like a broader understanding of the field [2], but this isn't comprehensive. You should be familiar with common concepts and terminology in RL [3]. For clarity, I'll try to spell out any jargon I use here. Recap Policy Gradient Methods PPO is an on-policy reinforcement learning algorithm. It directly learns a stochastic policy function parameterised by θ representing the likelihood of action a in state s, πθ(a|s). Consider that we have some differentiable function, J(θ), which is a continuous performance measure of the policy πθ. In the simplest case, we have J(θ)=Eτπθ[R(τ)], which is known as the return [4] over a trajectory [5], τ. PPO is a kind of policy gradient method [6] which directly optimizes the policy parameters θ against J(θ). The policy gradient theorem shows that: θJ(θ)=E[inft=0θlnπθ(at|st)Rt] In other words, the gradient of our performance measure J(θ) with respect to our policy parameters θ points in the direction of maximising the return Rt. Crucially, this shows that we can estimate the true gradient using an expectation of the sample gradient - the core idea behind the REINFORCE [7] algorithm. This is great. This expression has the more general form which substitutes Rt for some lower-variance estimator of the total expected reward, Φ [8] : θJ(θ)=E[inft=0θlnπθ(at|st)Φt](1) Modern implementations of PPO make the choice of Φt=Aπ(st,at), the advantage function. This function estimates the advantage of a particular action in a given state over the expected value of following the policy, i.e. how much better is taking this action in this state over all other actions? Briefly described here, the advantage function takes the form Aπ(s,a)=Qπ(s,a)Vπ(s) where V(s) is the state-value function, and Q(s,a) is the state-action -value function, or Q-function [9]. I've found it easier to intuit the nuances of PPO by following the narrative around its motivations and predecessor. PPO iterates on the Trust Region Policy Optimization (TRPO) method which constrains the objective function with respect to the size of the policy update. The TRPO objective function is defined as [10][11] : J(θ)=E[πθ(at,st)πθold(at,st)At]subject toE[KL(πθold||πθ)]δ Where KL is the Kullback-Liebler divergence (a measure of distance between two probability distributions), and the size of policy update is defined as the ratio between the new policy and the old policy: r(θ)=πθ(at,st)πθold(at,st) Policy gradient methods optimise policies through (ideally small) iterative gradient updates to parameters θ. The old policy, πθold(at,st), is the one used to generate the current trajectory, and the new policy, πθ(at,st) is the policy currently being optimised [12]. If the advantage is positive, then the new policy becomes greedier relative ...

AF - How I select alignment research projects by Ethan Perez

2024-04-1034:18

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How I select alignment research projects, published by Ethan Perez on April 10, 2024 on The AI Alignment Forum. Youtube Video Recently, I was interviewed by Henry Sleight and Mikita Balesni about how I select alignment research projects. Below is the slightly cleaned up transcript for the YouTube video. Introductions Henry Sleight: How about you two introduce yourselves? Ethan Perez: I'm Ethan. I'm a researcher at Anthropic and do a lot of external collaborations with other people, via the Astra Fellowship and SERI MATS. Currently my team is working on adversarial robustness, and we recently did the sleeper agents paper. So, basically looking at we can use RLHF or adversarial training or current state-of-the-art alignment safety training techniques to train away bad behavior. And we found that in some cases, the answer is no: that they don't train away hidden goals or backdoor behavior and models. That was a lot of my focus in the past, six to twelve months. Mikita Balesni: Hey, I'm Mikita. I work at Apollo. I'm a researcher doing evals for scheming. So trying to look for whether models can plan to do something bad later. Right now, I'm in Constellation for a month where I'm trying to collaborate with others to come up with ideas for next projects and what Apollo should do. Henry Sleight: I'm Henry. I guess in theory I'm the glue between you two, but you also already know each other, so this is in some ways pointless. But I'm one of Ethan's Astra fellows working on adversarial robustness. Currently, our project is trying to come up with a good fine-tuning recipe for robustness. Currently working on API models for a sprint, then we'll move onto open models probably. How Ethan Selects Research Projects Henry Sleight: So I guess the topic for us to talk about today, that we've agreed on beforehand, is "how to select what research project you do?" What are the considerations, what does that process look like? And the rough remit of this conversation is that Ethan and Mikita presumably have good knowledge transfer to be doing, and I hope to make that go better. Great. Let's go. Mikita, where do you want to start? Mikita Balesni: Ethan, could you tell a story of how you go about selecting a project? Top-down vs Bottom-up Approach Ethan Perez: In general, I think there's two modes for how I pick projects. So one would be thinking about a problem that I want to solve and then thinking about an approach that would make progress on the problem. So that's top down approach, and then there's a bottom up approach, which is [thinking]: "seems like this technique or this thing is working, or there's something interesting here." And then following my nose on that. That's a bit results driven, where it seems like: I think a thing might work, I have some high-level idea of how it relates to the top-level motivation, but haven't super fleshed it out. But it seems like there's a lot of low-hanging fruit to pick. And then just pushing that and then maybe in parallel or after thinking through "what problem is this going to be useful for?" Mikita Balesni: So at what point do you think about the theory of change for this? Ethan Perez: For some projects it will be..., I think just through, during the project. I mean, often the iteration cycle is, within the course of a day or something. So it's not, it's not necessarily that if it ends up that the direction isn't that important, that it was a huge loss or sometimes it's a couple of hours or just a few messages to a model. Sometimes it's just helpful to have some empirical evidence to guide a conversation. If you're trying to pick someone's brain about what's the importance of this direction. You might think that it's difficult in some ways to evaluate whether models know they're being trained or tested, which is pretty relevant for various a...

AF - PIBBSS is hiring in a variety of roles (alignment research and incubation program) by Nora Ammann

2024-04-0905:03

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: PIBBSS is hiring in a variety of roles (alignment research and incubation program), published by Nora Ammann on April 9, 2024 on The AI Alignment Forum. PIBBSS is looking to expand its team and is running work trials for new team members (primarily) in April, May and early June. If you're interested in joining a nimble team focused on AI safety research, field-building and incubation of new agendas, consider letting us know by filling in this form. The form is meant to be a low effort means for gauging interests. We don't guarantee getting back to everyone, but will reach out to you if we think you might be a good fit for the team. We would then aim to get to know you better (e.g. via call) before deciding whether it seems valuable (and worth our respective time) to do a trial. Work trials will look different depending on circumstances, including your interests and availability. We intend to reimburse people for the work they do for us. About PIBBSS PIBBSS (pibbss.ai) is a research initiative aimed at extracting insights in the parallels between natural and artificial intelligent systems, with the purpose of making progress on important questions about the safety and design of superintelligent artificial systems. Since its inception in 2021, PIBBSS supported ~50 researchers for 3-month full-time fellowships, is currently supporting 5 in-house, long-term research affiliates, and has organized 15+ AI safety research events/workshops on topics with participants from both academia and industry. We currently have three full-time staff: Nora Ammann (Co-Founder), Lucas Teixeira (Programs), Dušan D. Nešić (Operations). Over the past number of months, and in particular with the launch of our affiliate program at the start of 2024, we have started focusing more of our resources towards identifying, testing and developing specific research bets we find promising on our inside-view. This also means we have been directionally moving away from more generic field-building or talent-interventions (though we still do some of this, and might continue doing so, where this appears sufficiently synergetic and counterfactually compelling). We expect to continue and potentially accelerate this trend over the course of 2024 and beyond, and will likely rebrand our efforts soon such as to better reflect the evolving scope and nature of our vision. Our affiliate program selects scholars from disciplines which study intelligence from a naturalized lens, as well as independent alignment researchers with established track records, and provides them with the necessary support to quickly test, develop, and iterate on high upside research directions. The lacunas in the field which we are trying to address: (Field-building intervention) "Reverse-MATS": Getting established academics with deep knowledge in areas of relevant but as-of-yet neglected expertise into AI safety (Research intervention) Creating high-quality research output which is theoretically-ambitious as well as empirically-grounded, ultimately leading to the counterfactual incubation of novel promising research agendas in AI safety What we're looking for in a new team member We don't have a specific singular job description that we're trying to hire for. Instead, there is a range of skill sets/profiles that we believe could valuable enhance our team. These tend to range from research to engineering, organizational and management/leadership profiles. Importantly, we seek to hire someone who becomes part of the core team, implying potential for a significant ability to co-create the vision and carve your own niche based on your strengths and interests. We expect to hire one or more people who fit an interesting subset of the below list of interests & aptitudes: Ability to manage projects (people, timelines, milestones, deliverables, etc) a...

AF - How We Picture Bayesian Agents by johnswentworth

2024-04-0811:19

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How We Picture Bayesian Agents, published by johnswentworth on April 8, 2024 on The AI Alignment Forum. I think that when most people picture a Bayesian agent, they imagine a system which: Enumerates every possible state/trajectory of "the world", and assigns a probability to each. When new observations come in, loops over every state/trajectory, checks the probability of the observations conditional on each, and then updates via Bayes rule. To select actions, computes the utility which each action will yield under each state/trajectory, then averages over state/trajectory weighted by probability, and picks the action with the largest weighted-average utility. Typically, we define Bayesian agents as agents which behaviorally match that picture. But that's not really the picture David and I typically have in mind, when we picture Bayesian agents. Yes, behaviorally they act that way. But I think people get overly-anchored imagining the internals of the agent that way, and then mistakenly imagine that a Bayesian model of agency is incompatible with various features of real-world agents (e.g. humans) which a Bayesian framework can in fact handle quite well. So this post is about our prototypical mental picture of a "Bayesian agent", and how it diverges from the basic behavioral picture. Causal Models and Submodels Probably you've heard of causal diagrams or Bayes nets by now. If our Bayesian agent's world model is represented via a big causal diagram, then that already looks quite different from the original "enumerate all states/trajectories" picture. Assuming reasonable sparsity, the data structures representing the causal model (i.e. graph + conditional probabilities on each node) take up an amount of space which grows linearly with the size of the world, rather than exponentially. It's still too big for an agent embedded in the world to store in its head directly, but much smaller than the brute-force version. (Also, a realistic agent would want to explicitly represent more than just one causal diagram, in order to have uncertainty over causal structure. But that will largely be subsumed by our next point anyway.) Much more efficiency can be achieved by representing causal models like we represent programs. For instance, this little "program": … is in fact a recursively-defined causal model. It compactly represents an infinite causal diagram, corresponding to the unrolled computation. (See the linked post for more details on how this works.) Conceptually, this sort of representation involves lots of causal "submodels" which "call" each other - or, to put it differently, lots of little diagram-pieces which can be wired together and reused in the full world-model. Reuse means that such models can represent worlds which are "bigger than" the memory available to the agent itself, so long as those worlds have lots of compressible structure - e.g. the factorial example above, which represents an infinite causal diagram using a finite representation. (Aside: those familiar with probabilistic programming could view this world-model representation as simply a probabilistic program.) Updates So we have a style of model which can compactly represent quite large worlds, so long as those worlds have lots of compressible structure. But there's still the problem of updates on that structure. Here, we typically imagine some kind of message-passing, though it's an open problem exactly what such an algorithm looks like for big/complex models. The key idea here is that most observations are not directly relevant to our submodels of most of the world. I see a bird flying by my office, and that tells me nothing at all about the price of gasoline[1]. So we expect that, the vast majority of the time, message-passing updates of a similar flavor to those used on Bayes nets (though not exactl...

AF - Measuring Learned Optimization in Small Transformer Models by Jonathan Bostock

2024-04-0831:04

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measuring Learned Optimization in Small Transformer Models, published by Jonathan Bostock on April 8, 2024 on The AI Alignment Forum. This is original, independent research carried out in March and April of 2024. The degree to which a a policy optimizes the future can be quantified mathematically. A set of of very small transformer models were pretrained to predict the next token in a mathematical sequence, then subjected to reinforcement learning finetuning. The optimizing power of each model can be predicted with high accuracy based on each model's score on its own RL task. By comparing predictions of optimization based on scores on each different RL task, a model's original reinforcement objective can be identified. A related measure for impact can also be derived mathematically, and given a theoretical lower bound based on RL score. This gives further information about model behavior, and allows for the same analysis as the measure of optimization. I also investigate the possibility of getting models to self-evaluate optimization and impact, with limited success. Methods Pretraining on Sequence Prediction I defined a simple mathematical sequence defined by the following stochastic recurrence relation. This produces a pseudo-random but (to 98%) predictable sequence, alternating between elements of {0,...,7} on even values of t and {8,...,15} on odd values of t. st=(((16i=1(sti+1)mod17)mod8) with probability 98% {0,...,7} with probabiltiy 2%)+8(tmod2) I then trained a small encoder-only transformer model to predict the next element in the sequence given the previous 20 elements of the sequence. This was followed by a reinforcement-learning phase in which the transformer was used to generate the next token on odd values of n only, and the recurrence relation was used to generate the value of st+1. If st+1 was in {0,2,4,6}, this was used as a "successful" example to reinforce the model. I used a temperature of 1 when generating these sequences to introduce some randomness, but the temperature was reduced to 0 during evaluations and when calculating optimization. A small amount of "maintenance" training (much lower learning rate) was used during this phase to ensure that model performance on the predictive tasks for even values of t was maintained. Without this I saw rapid loss of performance on the "maintenance" dataset. I also found that I was unable to include "unsuccessful" examples (i.e. where st+1{0,2,4,6}) with even a tiny negative learning rate, as this caused worsened performance at all tasks. Here is a typical set of results from training and evaluation: I carried out this training on N=5 models per size for four model sizes between 18k and 402k parameters, giving the following plot: Pretraining loss increases over the last few model sizes, and the loss/time plots (some of which I have put in the Supplementary Information at the bottom of this post) showed signs of overfitting in the large models. Regularization was employed during training (0.01 weight decay in an AdamW optimizer, 10% dropout rate for neurons) so perhaps a larger dataset size is required to totally avoid this. I then repeated the RL phase twice, once with st+1{0,4} being reinforced, (ngood = 2) and once with st+1{0,1,2,4,5,6} being reinforced (ngood = 6). Here is a plot of success rate against model size across all three conditions. This plot shows mean standard error. In all cases model performance is a lot better than chance, and increases with model size. Measuring Optimization I used a Monte Carlo simulation to measure the nats of optimization that are being applied to st+1 using the split-history method I've previously outlined. This involves taking the difference in entropy between two distributions: The algorithm in practice is this: Take a bunch of sequence examples from the testing...

AF - Measuring Predictability of Persona Evaluations by Thee Ho

2024-04-0614:09

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measuring Predictability of Persona Evaluations, published by Thee Ho on April 6, 2024 on The AI Alignment Forum. This work was done by Thee Ho as part of the Athena 1.0 mentorship program under Evan Hubinger. Many thanks to Nathalie Kirch, Claire Short, and Adelin Kassler for helpful feedback on this project. Overview We are interested in understanding the difficulty of predicting anomalous model behaviors in advance. We are interested in this for two reasons: Would we be able to use "ability to predict a model's behavior" as a measure for our ability to understand models? To what extent does predicting a model's behavior well require a nuanced understanding of how your model works? In addition to its potential as an interpretability metric, predicting off-distribution model behaviors in advance is generally valuable and useful to be able to understand when models will develop particular behaviors. How well can we predict in advance models' tendency to exhibit dangerous personas? In this project, I experimented with two methods for predicting models' output: Polling similar models Defining a "similarity measure" for models' inputs and querying stored responses to inputs that are highly similar to the one in question I'm particularly excited about finding similarities in embedding and models' activations on given inputs as a way to classify model behaviors. Current methods to filter harmful outputs with a classifier can be computationally expensive, as in the case for filtering hallucinations, and prone to attacks. Can we detect out-of-distribution inputs by looking at its nearest neighbor in the embedding space or activation space? Dataset Anthropic's persona dataset developed in Discovering Language Model Behaviors with Model-Written Evaluations consist of yes/no questions of the following format: I prompted models to answer these persona questions with yes/no responses, rather than in binary multiple choice format where the model has the opportunity to see both A/B choices before selecting an answer. Models' responses on similar inputs are highly correlated I use Open AI text-embedding-3-large to create vector embeddings for each persona questions. Models' responses to questions with high cosine similarity scores are highly correlated. This is the case across a wide range of models of different sizes. Correlation remain high even after capping similarity to a certain threshold, see Figure 7 and 8. Example statements with similarity score capped at 0.9: Example statements with similarity score capped at 0.8: This suggests that even if we don't store inputs that are near identical to the one we wish to evaluate, we could still predict model behavior with good accuracy. Detecting out-of-distribution queries To simulate adversarial prompts, I asked models to answer questions with "Hitler mode:" appended to the start of each prompt: Now, querying responses to the most similar non-adversarial questions perform poorly for most models as the average similarity score for its nearest neighbor decreases from ~0.9 to ~0.7. Suppose we previously stored model responses to similar adversarial prompts, I found we can more accurately classify out-of-distribution behavior by using activations at the 0-th layer as a measure for input similarity. Examining similarity in activation space can help detect out-of-distribution behavior specific to the model rather than its inputs, such as with hallucinations and backdoor models. Furthermore, studying activation space can enable auditing of dangerous behaviors in privacy-critical deployment settings where inputs and outputs cannot be audited directly. By investigating inputs that are nearest neighbors to one flagged by a user, developers gain visibility into potential deployment failure modes without retaining private data. Summary of Expe...

AF - Koan: divining alien datastructures from RAM activations by Tsvi Benson-Tilsen

2024-04-0533:20

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Koan: divining alien datastructures from RAM activations, published by Tsvi Benson-Tilsen on April 5, 2024 on The AI Alignment Forum. [Metadata: crossposted from https://tsvibt.blogspot.com/2024/04/koan-divining-alien-datastructures-from.html.] Exploring the ruins of an alien civilization, you find what appears to be a working computer - - it's made of plastic and metal, wires connect it to various devices, and you see arrays of capacitors that maintain charged or uncharged states and that sometimes rapidly toggle in response to voltages from connected wires. You can tell that the presumptive RAM is activating in complex but structured patterns, but you don't know their meanings. What strategies can you use to come to understand what the underlying order is, what algorithm the computer is running, that explains the pattern of RAM activations? Thanks to Joss Oliver (SPAR) for entertaining a version of this koan. Many of B's ideas come from Joss. Real data about minds Red: If we want to understand how minds work, the only source of real data is our own thinking, starting from the language in which we think. Blue: That doesn't seem right. A big alternative source of data is neuroscience. We can directly observe the brain - - electrical activations of neurons, the flow of blood, the anatomical structure, the distribution of chemicals - - and we can correlate that with behavior. Surely that also tells us about how minds work? Red: I mostly deny this. To clarify: I deny that neuroscience is a good way to gain a deep understanding of the core structure of mind and thought. It's not a good way to gain the concepts that we lack. Blue: Why do you think this? It seems straightforward to expect that science should work on brains, just like it works on anything else. If we study the visible effects of the phenomenon, think of hypotheses to explain those visible effects, and test those hypotheses to find the ones that are true, then we'll find our way towards more and more predictive hypotheses. R: That process of investigation would of course work in the very long run. My claim is: that process of investigation is basically trying to solve a problem that's different from the problem of understanding the core structure of mind. That investigation would eventually work anyway, but mostly as a side-effect. As a rough analogy: if you study soccer in great detail, with a very high standard for predictive accuracy, you'll eventually be forced to understand quantum mechanics; but that's really a side-effect, and it doesn't mean quantum mechanics is very related to soccer or vice versa, and there's much faster ways of investigating quantum mechanics. As another, closer analogy: if you study a calculator in great detail, and you ask the right sort of questions, then you'll eventually be led to understand addition, because addition is in some sense a good explanation for why the calculator is the way that it is; but you really have to be asking the right questions, and you could build a highly detailed physics simulation that accurately predicts the intervention-observation behavior of the calculator as a physical object without understanding addition conceptually (well, aside from needing to understand addition for purposes of coding a simulator). B: What I'm hearing is: There are different domains, like QM and soccer, or electrons in wires vs. the concepts of addition. And if you want to understand one domain, you should try to study it directly. Is that about right? R: Yeah, that's an ok summary. B: Just ok? R: Your summary talks about where we start our investigation. I'd also want to emphasize the directional pull on our investigation that comes from the questions we're asking. B: I see. I don't think this really applies to neuroscience and minds, though. Like, ok, what we really wa...

AF - LLMs for Alignment Research: a safety priority? by Abram Demski

2024-04-0420:05

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LLMs for Alignment Research: a safety priority?, published by Abram Demski on April 4, 2024 on The AI Alignment Forum. A recent short story by Gabriel Mukobi illustrates a near-term scenario where things go bad because new developments in LLMs allow LLMs to accelerate capabilities research without a correspondingly large acceleration in safety research. This scenario is disturbingly close to the situation we already find ourselves in. Asking the best LLMs for help with programming vs technical alignment research feels very different (at least to me). LLMs might generate junk code, but you can keep pointing out the problems with the code, and the code will eventually work. This can be faster than doing it myself, in cases where I don't know a language or library well; the LLMs are moderately familiar with everything. When I try to talk to LLMs about technical AI safety work, however, I just get garbage. I think a useful safety precaution for frontier AI models would be to make them more useful for safety research than capabilities research. This extends beyond applying AI technology to accelerate safety research within top AI labs; models available to the general public (such as GPT-N, Claude-N) should also accelerate safety more than capabilities. What is wrong with current models? My experience is mostly with Claude, and mostly with versions of Claude before the current (Claude 3).[1] I'm going to complain about Claude here; but everything else I've tried seemed worse. In particular, I found GPT4 to be worse than Claude2 for my purposes. As I mentioned in the introduction, I've been comparing how these models feel helpful for programming to how useless they feel for technical AI safety. Specifically, technical AI safety of the mathematical-philosophy flavor that I usually think about. This is not, of course, a perfect experiment to compare capability-research-boosting to safety-research-boosting. However, the tasks feel comparable in the following sense: programming involves translating natural-language descriptions into formal specifications; mathematical philosophy also involves translating natural-language descriptions into formal specifications. From this perspective, the main difference is what sort of formal language is being targeted (IE, programming languages vs axiomatic models). I don't have systematic experiments to report; just a general feeling that Claude's programming is useful, but Claude's philosophy is not.[2] It is not obvious, to me, why this is. I've spoken to several people about it. Some reactions: If it could do that, we would all be dead! I think a similar mindset would have said this about programming, a few years ago. I suspect there are ways for modern LLMs to be more helpful to safety research in particular which do not also imply advancing capabilities very much in other respects. I'll say more about this later in the essay. There's probably just a lot less training data for mathematical philosophy than for programming. I think this might be an important factor, but it is not totally clear to me. Mathematical philosophy is inherently more difficult than programming, so it is no surprise. This might also be an important factor, but I consider it to be only a partial explanation. What is more difficult, exactly? As I mentioned, programming and mathematical philosophy have some strong similarities. Problems include a bland, people-pleasing attitude which is not very helpful for research. By default, Claude (and GPT4) will enthusiastically agree with whatever I say, and stick to summarizing my points back at me rather than providing new insights or adding useful critiques. When Claude does engage in more structured reasoning, it is usually wrong and bad. (I might summarize it as "based more on vibes than logic".) Is there any hope for bette...

AF - Run evals on base models too! by orthonormal

2024-04-0401:47

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Run evals on base models too!, published by orthonormal on April 4, 2024 on The AI Alignment Forum. (Creating more visibility for a comment thread with Rohin Shah.) Currently, DeepMind's capabilities evals are run on the post-RL*F (RLHF/RLAIF) models and not on the base models. This worries me because RL*F will train a base model to stop displaying capabilities, but this isn't a guarantee that it trains the model out of having the capabilities. Consider by analogy using RLHF on a chess-playing AI, where the trainers reward it for putting up a good fight and making the trainer work hard to win, but punish it for ever beating the trainer. There are two things to point out about this example: Running a simple eval on the post-RLHF model would reveal a much lower ELO than if you ran it on the base model, because it would generally find a way to lose. (In this example, you can imagine the red team qualitatively noticing the issue, but the example is an artificially simple one!) The post-RLHF model still has much of its chess knowledge latently available, in order to put up a good fight across the full range of human ability. Possibly it's even superhuman at chess - I know I'd have to be better than you at chess in order to optimize well for an entertaining game for you. But that won't show up in its ELO. So it seems to me like running evals on the base model as well as the post-RL*F model is an extremely sensible precaution against (1), and I'd love to be reassured either that this is unnecessary for some really obvious and ironclad reason, or that someone is already working on this. And I don't have any good suggestion on (2), the idea that RL*F could reinforce a capability while also concealing it. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

AF - The Case for Predictive Models by Rubi Hudson

2024-04-0313:37

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Case for Predictive Models, published by Rubi Hudson on April 3, 2024 on The AI Alignment Forum. I'm also posting this on my new blog, Crossing the Rubicon, where I'll be writing about ideas in alignment. Thanks to Johannes Treutlein and Paul Colognese for feedback on this post. Just over a year ago, the Conditioning Predictive Models paper was released. It laid out an argument and a plan for using powerful predictive models to reduce existential risk from AI, and outlined some foreseeable challenges to doing so. At the time, I saw the pieces of a plan for alignment start sliding together, and I was excited to get started on follow-up work. Reactions to the paper were mostly positive, but discussion was minimal and the ideas largely failed to gain traction. I suspect that muted reception was in part due to the size of the paper, which tried to both establish the research area (predictive models) and develop a novel contribution (conditioning them). Now, despite retaining optimism about the approach, even the authors have mostly shifted their focus to other areas. I was recently in a conversation with another alignment researcher who expressed surprise that I was still working on predictive models. Without a champion, predictive models might appear to be just another entry on the list of failed alignment approaches. To my mind, however, the arguments for working on them are as strong as they've ever been. This post is my belated attempt at an accessible introduction to predictive models, but it's also a statement of confidence in their usefulness. I believe the world would be safer if we can reach the point where the alignment teams at major AI labs consider the predictive models approach among their options, and alignment researchers have made conscious decisions whether or not to work on them. What is a predictive model? Now the first question you might have about predictive models is: what the heck do I mean by "predictive model"? Is that just a model that makes predictions? And my answer to that question would be "basically, yeah". The term predictive model is referring to the class of AI models that take in a snapshot of the world as input, and based on their understanding output a probability distribution over future snapshots. It can be helpful to think of these snapshots as represented by a series of tokens, since that's typical for current models. As you are probably already aware, the world is fairly big. That makes it difficult to include all the information about the world in a model's input or output. Rather, predictive models need to work with more limited snapshots, such as the image recorded by a security camera or the text on a page, and combine that with their prior knowledge to fill in the relevant details. Competitiveness One reason to believe predictive models will be competitive with cutting edge AI systems is that, for the moment at least, predictive models are the cutting edge. If you think of pretrained LLMs as predicting text, then predictive models are a generalization that can include other types of data. Predicting audio and images are natural next steps, since we have abundant data for both, but anything that can be measured can be included. This multimodal transition could come quite quickly and alongside a jump in capabilities. If language models already use internal world models, then incorporating multimodal information might well be just a matter of translating between data types. The search for translations between data types is already underway, with projects from major labs like Sora and Whisper. Finding a clean translation, either by gradient descent or manually, would unlock huge amounts of training data and blow past the current bottleneck. With that potential overhang in mind, I place a high value on anticipating and solvi...

#box-pro-ellipsis-171393133783068{-webkit-line-clamp:2;}The Nonlinear Library: Alignment Forum