Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Obliqueness Thesis, published by Jessica Taylor on September 19, 2024 on The AI Alignment Forum. In my Xenosystems review, I discussed the Orthogonality Thesis, concluding that it was a bad metaphor. It's a long post, though, and the comments on orthogonality build on other Xenosystems content. Therefore, I think it may be helpful to present a more concentrated discussion on Orthogonality, contrasting Orthogonality with my own view, without introducing dependencies on Land's views. (Land gets credit for inspiring many of these thoughts, of course, but I'm presenting my views as my own here.) First, let's define the Orthogonality Thesis. Quoting Superintelligence for Bostrom's formulation: Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal. To me, the main ambiguity about what this is saying is the "could in principle" part; maybe, for any level of intelligence and any final goal, there exists (in the mathematical sense) an agent combining those, but some combinations are much more natural and statistically likely than others. Let's consider Yudkowsky's formulations as alternatives. Quoting Arbital: The Orthogonality Thesis asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal. The strong form of the Orthogonality Thesis says that there's no extra difficulty or complication in the existence of an intelligent agent that pursues a goal, above and beyond the computational tractability of that goal. As an example of the computational tractability consideration, sufficiently complex goals may only be well-represented by sufficiently intelligent agents. "Complication" may be reflected in, for example, code complexity; to my mind, the strong form implies that the code complexity of an agent with a given level of intelligence and goals is approximately the code complexity of the intelligence plus the code complexity of the goal specification, plus a constant. Code complexity would influence statistical likelihood for the usual Kolmogorov/Solomonoff reasons, of course. I think, overall, it is more productive to examine Yudkowsky's formulation than Bostrom's, as he has already helpfully factored the thesis into weak and strong forms. Therefore, by criticizing Yudkowsky's formulations, I am less likely to be criticizing a strawman. I will use "Weak Orthogonality" to refer to Yudkowsky's "Orthogonality Thesis" and "Strong Orthogonality" to refer to Yudkowsky's "strong form of the Orthogonality Thesis". Land, alternatively, describes a "diagonal" between intelligence and goals as an alternative to orthogonality, but I don't see a specific formulation of a "Diagonality Thesis" on his part. Here's a possible formulation: Diagonality Thesis: Final goals tend to converge to a point as intelligence increases. The main criticism of this thesis is that formulations of ideal agency, in the form of Bayesianism and VNM utility, leave open free parameters, e.g. priors over un-testable propositions, and the utility function. Since I expect few readers to accept the Diagonality Thesis, I will not concentrate on criticizing it. What about my own view? I like Tsvi's naming of it as an "obliqueness thesis". Obliqueness Thesis: The Diagonality Thesis and the Strong Orthogonality Thesis are false. Agents do not tend to factorize into an Orthogonal value-like component and a Diagonal belief-like component; rather, there are Oblique components that do not factorize neatly. (Here, by Orthogonal I mean basically independent of intelligence, and by Diagonal I mean converging to a point in the limit of intelligence.) While I will address Yudkowsky's arguments for the Orthogonality Thesis, I think arguing directly for my view first will be more helpful. In general, it seems ...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Secret Collusion: Will We Know When to Unplug AI?, published by schroederdewitt on September 16, 2024 on The AI Alignment Forum. TL;DR: We introduce the first comprehensive theoretical framework for understanding and mitigating secret collusion among advanced AI agents, along with CASE, a novel model evaluation framework. CASE assesses the cryptographic and steganographic capabilities of agents, while exploring the emergence of secret collusion in real-world-like multi-agent settings. Whereas current AI models aren't yet proficient in advanced steganography, our findings show rapid improvements in individual and collective model capabilities, posing unprecedented safety and security risks. These results highlight urgent challenges for AI governance and policy, urging institutions such as the EU AI Office and AI safety bodies in the UK and US to prioritize cryptographic and steganographic evaluations of frontier models. Our research also opens up critical new pathways for research within the AI Control framework. Philanthropist and former Google CEO Eric Schmidt said in 2023 at a Harvard event: "[...] the computers are going to start talking to each other probably in a language that we can't understand and collectively their super intelligence - that's the term we use in the industry - is going to rise very rapidly and my retort to that is: do you know what we're going to do in that scenario? We're going to unplug them [...] But what if we cannot unplug them in time because we won't be able to detect the moment when this happens? In this blog post, we, for the first time, provide a comprehensive overview of the phenomenon of secret collusion among AI agents, connect it to foundational concepts in steganography, information theory, distributed systems theory, and computability, and present a model evaluation framework and empirical results as a foundation of future frontier model evaluations. This blog post summarises a large body of work. First of all, it contains our pre-print from February 2024 (updated in September 2024) "Secret Collusion among Generative AI Agents". An early form of this pre-print was presented at the 2023 New Orleans (NOLA) Alignment Workshop (see this recording NOLA 2023 Alignment Forum Talk Secret Collusion Among Generative AI Agents: a Model Evaluation Framework). Also, check out this long-form Foresight Institute Talk). In addition to these prior works, we also include new results. These contain empirical studies on the impact of paraphrasing as a mitigation tool against steganographic communications, as well as reflections on our findings' impact on AI Control. Multi-Agent Safety and Security in the Age of Autonomous Internet Agents The near future could see myriads of LLM-driven AI agents roam the internet, whether on social media platforms, eCommerce marketplaces, or blockchains. Given advances in predictive capabilities, these agents are likely to engage in increasingly complex intentional and unintentional interactions, ranging from traditional distributed systems pathologies (think dreaded deadlocks!) to more complex coordinated feedback loops. Such a scenario induces a variety of multi-agent safety, and specifically, multi-agent security[1] (see our NeurIPS'23 workshop Multi-Agent Security: Security as Key to AI Safety) concerns related to data exfiltration, multi-agent deception, and, fundamentally, undermining trust in AI systems. There are several real-world scenarios where agents could have access to sensitive information, such as their principals' preferences, which they may disclose unsafely even if they are safety-aligned when considered in isolation. Stray incentives, intentional or otherwise, or more broadly, optimization pressures, could cause agents to interact in undesirable and potentially dangerous ways. For example, join...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Estimating Tail Risk in Neural Networks, published by Jacob Hilton on September 13, 2024 on The AI Alignment Forum. Machine learning systems are typically trained to maximize average-case performance. However, this method of training can fail to meaningfully control the probability of tail events that might cause significant harm. For instance, while an artificial intelligence (AI) assistant may be generally safe, it would be catastrophic if it ever suggested an action that resulted in unnecessary large-scale harm. Current techniques for estimating the probability of tail events are based on finding inputs on which an AI behaves catastrophically. Since the input space is so large, it might be prohibitive to search through it thoroughly enough to detect all potential catastrophic behavior. As a result, these techniques cannot be used to produce AI systems that we are confident will never behave catastrophically. We are excited about techniques to estimate the probability of tail events that do not rely on finding inputs on which an AI behaves badly, and can thus detect a broader range of catastrophic behavior. We think developing such techniques is an exciting problem to work on to reduce the risk posed by advanced AI systems: Estimating tail risk is a conceptually straightforward problem with relatively objective success criteria; we are predicting something mathematically well-defined, unlike instances of eliciting latent knowledge (ELK) where we are predicting an informal concept like "diamond". Improved methods for estimating tail risk could reduce risk from a variety of sources, including central misalignment risks like deceptive alignment. Improvements to current methods can be found both by doing empirical research, or by thinking about the problem from a theoretical angle. This document will discuss the problem of estimating the probability of tail events and explore estimation strategies that do not rely on finding inputs on which an AI behaves badly. In particular, we will: Introduce a toy scenario about an AI engineering assistant for which we want to estimate the probability of a catastrophic tail event. Explain some deficiencies of adversarial training, the most common method for reducing risk in contemporary AI systems. Discuss deceptive alignment as a particularly dangerous case in which adversarial training might fail. Present methods for estimating the probability of tail events in neural network behavior that do not rely on evaluating behavior on concrete inputs. Conclude with a discussion of why we are excited about work aimed at improving estimates of the probability of tail events. This document describes joint research done with Jacob Hilton, Victor Lecomte, David Matolcsi, Eric Neyman, Thomas Read, George Robinson, and Gabe Wu. Thanks additionally to Ajeya Cotra, Lukas Finnveden, and Erik Jenner for helpful comments and suggestions. A Toy Scenario Consider a powerful AI engineering assistant. Write M for this AI system, and M(x) for the action it suggests given some project description x. We want to use this system to help with various engineering projects, but would like it to never suggest an action that results in large-scale harm, e.g. creating a doomsday device. In general, we define a behavior as catastrophic if it must never occur in the real world.[1] An input is catastrophic if it would lead to catastrophic behavior. Assume we can construct a catastrophe detector C that tells us if an action M(x) will result in large-scale harm. For the purposes of this example, we will assume both that C has a reasonable chance of catching all catastrophes and that it is feasible to find a useful engineering assistant M that never triggers C (see Catastrophe Detectors for further discussion). We will also assume we can use C to train M, but that it is ...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Can startups be impactful in AI safety?, published by Esben Kran on September 13, 2024 on The AI Alignment Forum. With Lakera's strides in securing LLM APIs, Goodfire AI's path to scaling interpretability, and 20+ model evaluations startups among much else, there's a rising number of technical startups attempting to secure the model ecosystem. Of course, they have varying levels of impact on superintelligence containment and security and even with these companies, there's a lot of potential for aligned, ambitious and high-impact startups within the ecosystem. This point isn't new and has been made in our previous posts and by Eric Ho (Goodfire AI CEO). To set the stage, our belief is that these are the types of companies that will have a positive impact: Startups with a profit incentive completely aligned with improving AI safety; that have a deep technical background to shape AGI deployment and; do not try to compete with AGI labs. Piloting AI safety startups To understand impactful technical AI safety startups better, Apart Research joined forces with collaborators from Juniper Ventures, vectorview (alumni from the latest YC cohort), Rudolf (from the upcoming def/acc cohort), Tangentic AI, and others. We then invited researchers, engineers, and students to resolve a key question "can we come up with ideas that scale AI safety into impactful for-profits?" The hackathon took place during a weekend two weeks ago with a keynote by Esben Kran (co-director of Apart) along with 'HackTalks' by Rudolf Laine (def/acc) and Lukas Petersson (YC / vectorview). Individual submissions were a 4 page report with the problem statement, why this solution will work, what the key risks of said solution are, and any experiments or demonstrations of the solution the team made. This post details the top 6 projects and excludes 2 projects that were made private by request (hopefully turning into impactful startups now!). In total, we had 101 signups and 11 final entries. Winners were decided by an LME model conditioned on reviewer bias. Watch the authors' lightning talks here. Dark Forest: Making the web more trustworthy with third-party content verification By Mustafa Yasir (AI for Cyber Defense Research Centre, Alan Turing Institute) Abstract: 'DarkForest is a pioneering Human Content Verification System (HCVS) designed to safeguard the authenticity of online spaces in the face of increasing AI-generated content. By leveraging graph-based reinforcement learning and blockchain technology, DarkForest proposes a novel approach to safeguarding the authentic and humane web. We aim to become the vanguard in the arms race between AI-generated content and human-centric online spaces.' Content verification workflow supported by graph-based RL agents deciding verifications Reviewer comments: Natalia: Well explained problem with clear need addressed. I love that you included the content creation process - although you don't explicitly address how you would attract content creators to use your platform over others in their process. Perhaps exploring what features of platforms drive creators to each might help you make a compelling case for using yours beyond the verification capabilities. I would have also liked to see more details on how the verification decision is made and how accurate this is on existing datasets. Nick: There's a lot of valuable stuff in here regarding content moderation and identity verification. I'd narrow it to one problem-solution pair (e.g., "jobs to be done") and focus more on risks around early product validation (deep interviews with a range of potential users and buyers regarding value) and go-to-market. It might also be worth checking out Musubi. Read the full project here. Simulation Operators: An annotation operation for alignment of robot By Ardy Haroen (USC) Abstrac...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How difficult is AI Alignment?, published by Samuel Dylan Martin on September 13, 2024 on The AI Alignment Forum. This work was funded by Polaris Ventures There is currently no consensus on how difficult the AI alignment problem is. We have yet to encounter any real-world, in the wild instances of the most concerning threat models, like deceptive misalignment. However, there are compelling theoretical arguments which suggest these failures will arise eventually. Will current alignment methods accidentally train deceptive, power-seeking AIs that appear aligned, or not? We must make decisions about which techniques to avoid and which are safe despite not having a clear answer to this question. To this end, a year ago, we introduced the AI alignment difficulty scale, a framework for understanding the increasing challenges of aligning artificial intelligence systems with human values. This follow-up article revisits our original scale, exploring how our understanding of alignment difficulty has evolved and what new insights we've gained. This article will explore three main themes that have emerged as central to our understanding: 1. The Escalation of Alignment Challenges: We'll examine how alignment difficulties increase as we go up the scale, from simple reward hacking to complex scenarios involving deception and gradient hacking. Through concrete examples, we'll illustrate these shifting challenges and why they demand increasingly advanced solutions. These examples will illustrate what observations we should expect to see "in the wild" at different levels, which might change our minds about how easy or difficult alignment is. 2. Dynamics Across the Difficulty Spectrum: We'll explore the factors that change as we progress up the scale, including the increasing difficulty of verifying alignment, the growing disconnect between alignment and capabilities research, and the critical question of which research efforts are net positive or negative in light of these challenges. 3. Defining and Measuring Alignment Difficulty: We'll tackle the complex task of precisely defining "alignment difficulty," breaking down the technical, practical, and other factors that contribute to the alignment problem. This analysis will help us better understand the nature of the problem we're trying to solve and what factors contribute to it. The Scale The high level of the alignment problem, provided in the previous post, was: "The alignment problem" is the problem of aligning sufficiently powerful AI systems, such that we can be confident they will be able to reduce the risks posed by misused or unaligned AI systems We previously introduced the AI alignment difficulty scale, with 10 levels that map out the increasing challenges. The scale ranges from "alignment by default" to theoretical impossibility, with each level representing more complex scenarios requiring more advanced solutions. It is reproduced here: Alignment Difficulty Scale Difficulty Level Alignment technique X is sufficient Description Key Sources of risk 1 (Strong) Alignment by Default As we scale up AI models without instructing or training them for specific risky behaviour or imposing problematic and clearly bad goals (like 'unconditionally make money'), they do not pose significant risks. Even superhuman systems basically do the commonsense version of what external rewards (if RL) or language instructions (if LLM) imply. Misuse and/or recklessness with training objectives. RL of powerful models towards badly specified or antisocial objectives is still possible, including accidentally through poor oversight, recklessness or structural factors. 2 Reinforcement Learning from Human Feedback We need to ensure that the AI behaves well even in edge cases by guiding it more carefully using human feedback in a wide range of situations...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Contra papers claiming superhuman AI forecasting, published by nikos on September 12, 2024 on The AI Alignment Forum. [Conflict of interest disclaimer: We are FutureSearch, a company working on AI-powered forecasting and other types of quantitative reasoning. If thin LLM wrappers could achieve superhuman forecasting performance, this would obsolete a lot of our work.] Widespread, misleading claims about AI forecasting Recently we have seen a number of papers - (Schoenegger et al., 2024, Halawi et al., 2024, Phan et al., 2024, Hsieh et al., 2024) - with claims that boil down to "we built an LLM-powered forecaster that rivals human forecasters or even shows superhuman performance". These papers do not communicate their results carefully enough, shaping public perception in inaccurate and misleading ways. Some examples of public discourse: Ethan Mollick (>200k followers) tweeted the following about the paper Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy by Schoenegger et al.: A post on Marginal Revolution with the title and abstract of the paper Approaching Human-Level Forecasting with Language Models by Halawi et al. elicits responses like "This is something that humans are notably terrible at, even if they're paid to do it. No surprise that LLMs can match us." "+1 The aggregate human success rate is a pretty low bar" A Twitter thread with >500k views on LLMs Are Superhuman Forecasters by Phan et al. claiming that "AI […] can predict the future at a superhuman level" had more than half a million views within two days of being published. The number of such papers on AI forecasting, and the vast amount of traffic on misleading claims, makes AI forecasting a uniquely misunderstood area of AI progress. And it's one that matters. What does human-level or superhuman forecasting mean? "Human-level" or "superhuman" is a hard-to-define concept. In an academic context, we need to work with a reasonable operationalization to compare the skill of an AI forecaster with that of humans. One reasonable and practical definition of a superhuman forecasting AI forecaster is The AI forecaster is able to consistently outperform the crowd forecast on a sufficiently large number of randomly selected questions on a high-quality forecasting platform.[1] (For a human-level forecaster, just replace "outperform" with "performs on par with".) Red flags for claims to (super)human AI forecasting accuracy Our experience suggests there are a number of things that can go wrong when building AI forecasting systems, including: 1. Failing to find up-to-date information on the questions. It's inconceivable on most questions that forecasts can be good without basic information. Imagine trying to forecast the US presidential election without knowing that Biden dropped out. 2. Drawing on up-to-date, but low-quality information. Ample experience shows low quality information confuses LLMs even more than it confuses humans. Imagine forecasting election outcomes with biased polling data. Or, worse, imagine forecasting OpenAI revenue based on claims like > The number of ChatGPT Plus subscribers is estimated between 230,000-250,000 as of October 2023. without realising that this mixing up ChatGPT vs ChatGPT mobile. 3. Lack of high-quality quantitative reasoning. For a decent number of questions on Metaculus, good forecasts can be "vibed" by skilled humans and perhaps LLMs. But for many questions, simple calculations are likely essential. Human performance shows systematic accuracy nearly always requires simple models such as base rates, time-series extrapolations, and domain-specific numbers. Imagine forecasting stock prices without having, and using, historical volatility. 4. Retrospective, rather than prospective, forecasting (e.g. forecasting questions that have al...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI forecasting bots incoming, published by Dan H on September 9, 2024 on The AI Alignment Forum. In a recent appearance on Conversations with Tyler, famed political forecaster Nate Silver expressed skepticism about AIs replacing human forecasters in the near future. When asked how long it might take for AIs to reach superhuman forecasting abilities, Silver replied: "15 or 20 [years]." In light of this, we are excited to announce "FiveThirtyNine," an AI forecasting bot. Our bot, built on GPT-4o, provides probabilities for any user-entered query, including " Will Trump win the 2024 presidential election?" and " Will China invade Taiwan by 2030?" Our bot performs better than experienced human forecasters and performs roughly the same as (and sometimes even better than) crowds of experienced forecasters; since crowds are for the most part superhuman, FiveThirtyNine is in a similar sense. (We discuss limitations later in this post.) Our bot and other forecasting bots can be used in a wide variety of contexts. For example, these AIs could help policymakers minimize bias in their decision-making or help improve global epistemics and institutional decision-making by providing trustworthy, calibrated forecasts. We hope that forecasting bots like ours will be quickly integrated into frontier AI models. For now, we will keep our bot available at forecast.safe.ai, where users are free to experiment and test its capabilities. Quick Links Demo: forecast.safe.ai Technical Report: link Problem Policymakers at the highest echelons of government and corporate power have difficulty making high-quality decisions on complicated topics. As the world grows increasingly complex, even coming to a consensus agreement on basic facts is becoming more challenging, as it can be hard to absorb all the relevant information or know which sources to trust. Separately, online discourse could be greatly improved. Discussions on uncertain, contentious issues all too often devolve into battles between interest groups, each intent on name-calling and spouting the most extreme versions of their views through highly biased op-eds and tweets. FiveThirtyNine Before transitioning to how forecasting bots like FiveThirtyNine can help improve epistemics, it might be helpful to give a summary of what FiveThirtyNine is and how it works. FiveThirtyNine can be given a query - for example, "Will Trump win the 2024 US presidential election?" FiveThirtyNine is prompted to behave like an "AI that is superhuman at forecasting". It is then asked to make a series of search engine queries for news and opinion articles that might contribute to its prediction. (The following example from FiveThirtyNine uses GPT-4o as the base LLM.) Based on these sources and its wealth of prior knowledge, FiveThirtyNine compiles a summary of key facts. Given these facts, it's asked to give reasons for and against Trump winning the election, before weighing each reason based on its strength and salience. Finally, FiveThirtyNine aggregates its considerations while adjusting for negativity and sensationalism bias in news sources and outputs a tentative probability. It is asked to sanity check this probability and adjust it up or down based on further reasoning, before putting out a final, calibrated probability - in this case, 52%. Evaluation. To test how well our bot performs, we evaluated it on questions from the Metaculus forecasting platform. We restricted the bot to make predictions only using the information human forecasters had, ensuring a valid comparison. Specifically, GPT-4o is only trained on data up to October 2023, and we restricted the news and opinion articles it could access to only those published before a certain date. From there, we asked it to compute the probabilities of 177 events from Metaculus that had happened (or not ha...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Backdoors as an analogy for deceptive alignment, published by Jacob Hilton on September 6, 2024 on The AI Alignment Forum. ARC has released a paper on Backdoor defense, learnability and obfuscation in which we study a formal notion of backdoors in ML models. Part of our motivation for this is an analogy between backdoors and deceptive alignment, the possibility that an AI system would intentionally behave well in training in order to give itself the opportunity to behave uncooperatively later. In our paper, we prove several theoretical results that shed some light on possible mitigations for deceptive alignment, albeit in a way that is limited by the strength of this analogy. In this post, we will: Lay out the analogy between backdoors and deceptive alignment Discuss prior theoretical results from the perspective of this analogy Explain our formal notion of backdoors and its strengths and weaknesses Summarize the results in our paper and discuss their implications for deceptive alignment Thanks to Boaz Barak, Roger Grosse, Thomas Read, John Schulman and Gabriel Wu for helpful comments. Backdoors and deceptive alignment A backdoor in an ML model is a modification to the model that causes it to behave differently on certain inputs that activate a secret "trigger", while behaving similarly on ordinary inputs. There is a wide existing literature on backdoor attacks and defenses, which is primarily empirical, but also includes some theoretical results that we will mention. Deceptive alignment is a term from the paper Risks from Learned Optimization in Advanced Machine Learning Systems (Section 4) that refers to the possibility that an AI system will internally reason about the objective that it is being trained on, and decide to perform well according to that objective unless there are clues that it has been taken out of its training environment. Such a policy could be optimal on the training distribution, and yet perform very badly on certain out-of-distribution inputs where such clues are present, which we call defection triggers.[1] The opposite of deceptive alignment is robust alignment, meaning that this performance degradation is avoided. Since a deceptively aligned model and a robustly aligned model behave very differently on defection triggers, but very similarly on typical inputs from the training distribution, deceptive alignment can be thought of as a special kind of backdoor, under the following correspondence: Deceptive alignment Backdoors Robustly aligned model Original (unmodified) model Deceptively aligned model Backdoored model Defection trigger Backdoor trigger The main distinguishing feature of deceptive alignment compared to other kinds of backdoors is that the deceptively aligned model is not produced by an adversary, but is instead produced through ordinary training. Thus by treating deceptive alignment as a backdoor, we are modeling the training process as an adversary. In our analysis of deceptive alignment, the basic tension we will face is that an unconstrained adversary will always win, but any particular proxy constraint we impose on the adversary may be unrealistic. Static backdoor detection An important piece of prior work is the paper Planting Undetectable Backdoors in Machine Learning Models, which uses a digital signature scheme to insert an undetectable backdoor into a model. Roughly speaking, the authors exhibit a modified version of a "Random Fourier Features" training algorithm that produces a backdoored model. Any input to the backdoored model can be perturbed by an attacker with knowledge of a secret key to produce a new input on which the model behaves differently. However, the backdoor is undetectable in the sense that it is computationally infeasible for a defender with white-box access to distinguish a backdoored model from an or...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conflating value alignment and intent alignment is causing confusion, published by Seth Herd on September 5, 2024 on The AI Alignment Forum. Submitted to the Alignment Forum. Contains more technical jargon than usual. Epistemic status: I think something like this confusion is happening often. I'm not saying these are the only differences in what people mean by "AGI alignment". Summary: Value alignment is better but probably harder to achieve than personal intent alignment to the short-term wants of some person(s). Different groups and people tend to primarily address one of these alignment targets when they discuss alignment. Confusion abounds. One important confusion stems from an assumption that the type of AI defines the alignment target: strong goal-directed AGI must be value aligned or misaligned, while personal intent alignment is only viable for relatively weak AI. I think this assumption is important but false. While value alignment is categorically better, intent alignment seems easier, safer, and more appealing in the short term, so AGI project leaders are likely to try it.[1] Overview Clarifying what people mean by alignment should dispel some illusory disagreement, and clarify alignment theory and predictions of AGI outcomes. Caption: Venn diagram of three types of alignment targets. Value alignment and Personal intent alignment are both subsets of Evan Hubinger's definition of intent alignment: AGI aligned with human intent in the broadest sense. Prosaic alignment work usually seems to be addressing a target somewhere in the neighborhood of personal intent alignment (following instructions or doing what this person wants now), while agent foundations and other conceptual alignment work usually seems to be addressing value alignment. Those two clusters have different strengths and weaknesses as alignment targets, so lumping them together produces confusion. People mean different things when they say alignment. Some are mostly thinking about value alignment (VA): creating sovereign AGI that has values close enough to humans' for our liking. Others are talking about making AGI that is corrigible (in the Christiano or Harms sense)[2] or follows instructions from its designated principal human(s). I'm going to use the term personal intent alignment (PIA) until someone has a better term for that type of alignment target. Different arguments and intuitions apply to these two alignment goals, so talking about them without differentiation is creating illusory disagreements. Value alignment is better almost by definition, but personal intent alignment seems to avoid some of the biggest difficulties of value alignment. Max Harms' recent sequence on corrigibility as a singular target (CAST) gives both a nice summary and detailed arguments. We do not need us to point to or define values, just short term preferences or instructions. The principal advantage is that an AGI that follows instructions can be used as a collaborator in improving its alignment over time; you don't need to get it exactly right on the first try. This is more helpful in slower and more continuous takeoffs. This means that PI alignment has a larger basin of attraction than value alignment does.[3] Most people who think alignment is fairly achievable seem to be thinking of PIA, while critics often respond thinking of value alignment. It would help to be explicit. PIA is probably easier and more likely than full VA for our first stabs at AGI, but there are reasons to wonder if it's adequate for real success. In particular, there are intuitions and arguments that PIA doesn't address the real problem of AGI alignment. I think PIA does address the real problem, but in a non-obvious and counterintuitive way. Another unstated divide There's another important clustering around these two conceptions of al...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?, published by David Scott Krueger on September 4, 2024 on The AI Alignment Forum. AI systems up to some high level of intelligence plausibly need to know exactly where they are in space-time in order for deception/"scheming" to make sense as a strategy. This is because they need to know: 1) what sort of oversight they are subject to and 2) what effects their actions will have on the real world (side note: Acausal trade might break this argument) There are a number of informal proposals to keep AI systems selectively ignorant of (1) and (2) in order to prevent deception. Those proposals seem very promising to flesh out; I'm not aware of any rigorous work doing so, however. Are you? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Checklist: What Succeeding at AI Safety Will Involve, published by Sam Bowman on September 3, 2024 on The AI Alignment Forum. Crossposted by habryka with Sam's permission. Expect lower probability for Sam to respond to comments here than if he had posted it. Preface This piece reflects my current best guess at the major goals that Anthropic (or another similarly positioned AI developer) will need to accomplish to have things go well with the development of broadly superhuman AI. Given my role and background, it's disproportionately focused on technical research and on averting emerging catastrophic risks. For context, I lead a technical AI safety research group at Anthropic, and that group has a pretty broad and long-term mandate, so I spend a lot of time thinking about what kind of safety work we'll need over the coming years. This piece is my own opinionated take on that question, though it draws very heavily on discussions with colleagues across the organization: Medium- and long-term AI safety strategy is the subject of countless leadership discussions and Google docs and lunch-table discussions within the organization, and this piece is a snapshot (shared with permission) of where those conversations sometimes go. To be abundantly clear: Nothing here is a firm commitment on behalf of Anthropic, and most people at Anthropic would disagree with at least a few major points here, but this can hopefully still shed some light on the kind of thinking that motivates our work. Here are some of the assumptions that the piece relies on. I don't think any one of these is a certainty, but all of them are plausible enough to be worth taking seriously when making plans: Broadly human-level AI is possible. I'll often refer to this as transformative AI (or TAI), roughly defined as AI that could form as a drop-in replacement for humans in all remote-work-friendly jobs, including AI R&D.[1] Broadly human-level AI (or TAI) isn't an upper bound on most AI capabilities that matter, and substantially superhuman systems could have an even greater impact on the world along many dimensions. If TAI is possible, it will probably be developed this decade, in a business and policy and cultural context that's not wildly different from today. If TAI is possible, it could be used to dramatically accelerate AI R&D, potentially leading to the development of substantially superhuman systems within just a few months or years after TAI. Powerful AI systems could be extraordinarily destructive if deployed carelessly, both because of new emerging risks and because of existing issues that become much more acute. This could be through misuse of weapons-related capabilities, by disrupting important balances of power in domains like cybersecurity or surveillance, or by any of a number of other means. Many systems at TAI and beyond, at least under the right circumstances, will be capable of operating more-or-less autonomously for long stretches in pursuit of big-picture, real-world goals. This magnifies these safety challenges. Alignment - in the narrow sense of making sure AI developers can confidently steer the behavior of the AI systems they deploy - requires some non-trivial effort to get right, and it gets harder as systems get more powerful. Most of the ideas here ultimately come from outside Anthropic, and while I cite a few sources below, I've been influenced by far more writings and people than I can credit here or even keep track of. Introducing the Checklist This lays out what I think we need to do, divided into three chapters, based on the capabilities of our strongest models: Chapter 1: Preparation You are here. In this period, our best models aren't yet TAI. In the language of Anthropic's RSP, they're at AI Safety Level 2 (ASL-2), ASL-3, or maybe the early stages of ASL-4. Most of the wor...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Survey: How Do Elite Chinese Students Feel About the Risks of AI?, published by Nick Corvino on September 2, 2024 on The AI Alignment Forum. Intro In April 2024, my colleague and I (both affiliated with Peking University) conducted a survey involving 510 students from Tsinghua University and 518 students from Peking University - China's two top academic institutions. Our focus was on their perspectives regarding the frontier risks of artificial intelligence. In the People's Republic of China (PRC), publicly accessible survey data on AI is relatively rare, so we hope this report provides some valuable insights into how people in the PRC are thinking about AI (especially the risks). Throughout this post, I'll do my best to weave in other data reflecting the broader Chinese sentiment toward AI. For similar research, check out The Center for Long-Term Artificial Intelligence, YouGov, Monmouth University, The Artificial Intelligence Policy Institute, and notably, a poll conducted by Rethink Priorities, which closely informed our survey design. You can read the full report published in the Jamestown Foundation's China Brief here: Survey: How Do Elite Chinese Students Feel About the Risks of AI? Key Takeaways Students are more optimistic about the benefits of AI than concerned about the harms. 80 percent of respondents agreed or strongly agreed with the statement that AI will do more good than harm for society, with only 7.5 percent actively believing the harms could outweigh the benefits. This, similar to other polling, indicates that the PRC is one of the most optimistic countries concerning the development of AI. Students strongly believe the Chinese government should regulate AI. 85.31 percent of respondents believe AI should be regulated by the government, with only 6 percent actively believing it should not. This contrasts with trends seen in other countries, where there is typically a positive correlation between optimism about AI and calls for minimizing regulation. The strong support for regulation in the PRC, even as optimism about AI remains high, suggests a distinct perspective on the role of government oversight in the PRC context. Students ranked AI the lowest among all possible existential threats to humanity. When asked about the most likely causes of human extinction, misaligned artificial intelligence received the lowest score. Nuclear war, natural disaster, climate change, and pandemics all proved more concerning for students. Students lean towards cooperation between the United States and the PRC as necessary for the safe and responsible development of AI. 60.7 percent of respondents believe AI will not be developed safely without cooperation between China and the U.S., with 25.68 percent believing it will develop safely no matter the level of cooperation. Students are most concerned about the use of AI for surveillance. This was followed by misinformation, existential risk, wealth inequality, increased political tension, various issues related to bias, with the suffering of artificial entities receiving the lowest score. Background As the recent decision (决定) document from the Third Plenum meetings in July made clear, AI is one of eight technologies that the Chinese Communist Party (CCP) leadership sees as critical for achieving "Chinese-style modernization (中国式现代化)," and is central to the strategy of centering the country's economic future around breakthroughs in frontier science ( People's Daily, July 22). The PRC also seeks to shape international norms on AI, including on AI risks. In October 2023, Xi Jinping announced a "Global AI Governance Initiative (全球人工智能治理倡议)" ( CAC, October 18, 2023). Tsinghua and Peking Universty are the two most prestigious universities in the PRC (by far), many of whose graduates will be very influential in shaping the cou...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024), published by Matt MacDermott on September 1, 2024 on The AI Alignment Forum. Yoshua Bengio wrote a blogpost about a new AI safety paper by him, various collaborators, and me. I've pasted the text below, but first here are a few comments from me aimed at an AF/LW audience. The paper is basically maths plus some toy experiments. It assumes access to a Bayesian oracle that can infer a posterior over hypotheses given data, and can also estimate probabilities for some negative outcome ("harm"). It proposes some conservative decision rules one could use to reject actions proposed by an agent, and proves probabilistic bounds on their performance under appropriate assumptions. I expect the median reaction in these parts to be something like: ok, I'm sure there are various conservative decision rules you could apply using a Bayesian oracle, but isn't obtaining a Bayesian oracle the hard part here? Doesn't that involve advances in Bayesian machine learning, and also probably solving ELK to get the harm estimates? My answer to that is: yes, I think so. I think Yoshua does too, and that that's the centre of his research agenda. Probably the main interest of this paper to people here is to provide an update on Yoshua's research plans. In particular it gives some more context on what the "guaranteed safe AI" part of his approach might look like -- design your system to do explicit Bayesian inference, and make an argument that the system is safe based on probabilistic guarantees about the behaviour of a Bayesian inference machine. This is in contrast to more hardcore approaches that want to do formal verification by model-checking. You should probably think of the ambition here as more like "a safety case involving proofs" than "a formal proof of safety". Bounding the probability of harm from an AI to create a guardrail Published 29 August 2024 by yoshuabengio As we move towards more powerful AI, it becomes urgent to better understand the risks, ideally in a mathematically rigorous and quantifiable way, and use that knowledge to mitigate them. Is there a way to design powerful AI systems based on machine learning methods that would satisfy probabilistic safety guarantees, i.e., would be provably unlikely to take a harmful action? Current AI safety evaluations and benchmarks test the AI for cases where it may behave badly, e.g., by providing answers that could yield dangerous misuse. That is useful and should be legally required with flexible regulation, but is not sufficient. These tests only tell us one side of the story: If they detect bad behavior, a flag is raised and we know that something must be done to mitigate the risks. However, if they do not raise such a red flag, we may still have a dangerous AI in our hands, especially since the testing conditions might be different from the deployment setting, and attackers (or an out-of-control AI) may be creative in ways that the tests did not consider. Most concerningly, AI systems could simply recognize they are being tested and have a temporary incentive to behave appropriately while being tested. Part of the problem is that such tests are spot checks. They are trying to evaluate the risk associated with the AI in general by testing it on special cases. Another option would be to evaluate the risk on a case-by-case basis and reject queries or answers that are considered to potentially violate or safety specification. With the long-term goal of obtaining a probabilistic guarantee that would apply in every context, we thus consider in this new paper (see reference and co-authors below) the objective of estimating a context-dependent upper bound on the probability of violating a given safety specification. Such a risk evaluation would need to be performed at ru...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Epistemic states as a potential benign prior, published by Tamsin Leake on August 31, 2024 on The AI Alignment Forum. Malignancy in the prior seems like a strong crux of the goal-design part of alignment to me. Whether your prior is going to be used to model: processes in the multiverse containing a specific "beacon" bitstring, processes in the multiverse containing the AI, processes which would output all of my blog, so I can make it output more for me, processes which match an AI chatbot's hypotheses about what it's talking with, then you have to sample hypotheses from somewhere; and typically, we want to use either solomonoff induction or time-penalized versions of it such as levin search (penalized by log of runtime) or what QACI uses (penalized by runtime, but with quantum computation available in some cases), or the implicit prior of neural networks (large sequences of multiplying by a matrix, adding a vector, and ReLU, often with a penalty related to how many non-zero weights are used). And the solomonoff prior is famously malign. (Alternatively, you could have knightian uncertainty about parts of your prior that aren't nailed down enough, and then do maximin over your knightian uncertainty (like in infra-bayesianism), but then you're not guaranteed that your AI gets anywhere at all; its knightian uncertainty might remain so immense that the AI keeps picking the null action all the time because some of its knightian hypotheses still say that anything else is a bad idea. Note: I might be greatly misunderstanding knightian uncertainty!) (It does seem plausible that doing geometric expectation over hypotheses in the prior helps "smooth things over" in some way, but I don't think this particularly removes the weight of malign hypotheses in the prior? It just allocates their steering power in a different way, which might make things less bad, but it sounds difficult to quantify.) It does feel to me like we do want a prior for the AI to do expected value calculations over, either for prediction or for utility maximization (or quantilization or whatever). One helpful aspect of prior-distribution-design is that, in many cases, I don't think the prior needs to contain the true hypothesis. For example, if the problem that we're using a prior for is to model processes which match an AI chatbot's hypotheses about what it's talking with then we don't need the AI's prior to contain a process which behaves just like the human user it's interacting with; rather, we just need the AI's prior to contain a hypothesis which: is accurate enough to match observations. is accurate enough to capture the fact that the user (if we pick a good user) implements the kind of decision theory that lets us rely on them pointing back to the actual real physical user when they get empowered - i.e. in CEV(user-hypothesis), user-hypothesis builds and then runs CEV(physical-user), because that's what the user would do in such a situation. Let's call this second criterion "cooperating back to the real user". So we need a prior which: Has at least some mass on hypotheses which correspond to observations cooperate back to the real user and can eventually be found by the AI, given enough evidence (enough chatting with the user) Call this the "aligned hypothesis". Before it narrows down hypothesis space to mostly just aligned hypotheses, doesn't give enough weight to demonic hypothesis which output whichever predictions cause the AI to brainhack its physical user, or escape using rowhammer-type hardware vulnerabilities, or other failures like that. Formalizing the chatbot model First, I'll formalize this chatbot model. Let's say we have a magical inner-aligned "soft" math-oracle: Which, given a "scoring" mathematical function from a non-empty set a to real numbers (not necessarily one that is tractably ...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AIS terminology proposal: standardize terms for probability ranges, published by Egg Syntax on August 30, 2024 on The AI Alignment Forum. Summary: The AI safety research community should adopt standardized terms for probability ranges, especially in public-facing communication and especially when discussing risk estimates. The terms used by the IPCC are a reasonable default. Science communication is notoriously hard. It's hard for a lot of reasons, but one is that laypeople aren't used to thinking in numerical probabilities or probability ranges. One field that's had to deal with this more than most is climatology; climate change has been rather controversial, and a non-trivial aspect of that has been lay confusion about what climatologists are actually saying[1]. As a result, the well-known climate assessment reports from the UN's Intergovernmental Panel on Climate Change (IPCC) have, since the 1990s, used explicitly defined terms for probability ranges[2]: (see below for full figure[3]) Like climatology, AI safety research has become a topic of controversy. In both cases, the controversy includes a mix of genuine scientific disagreement, good-faith confusion, and bad-faith opposition. Scientific disagreement comes from people who can deal with numerical probability ranges. Those who are arguing in bad faith from ulterior motives generally don't care about factual details. But I suspect that the large majority of those who disagree, especially laypeople, are coming from a place of genuine, good-faith confusion. For those people, anything we as practitioners can do to communicate more clearly is quite valuable. Also like climatology, AI safety research, especially assessments of risk, fundamentally involves communicating about probabilities and probability ranges. Therefore I propose that the AIS community follow climatologists in adopting standard terms for probability ranges, especially in position papers and public-facing communication. In less formal and less public-facing contexts, using standard terminology still adds some value but is less important; in sufficiently informal contexts it's probably not worth the hassle of looking up the standard terminology. Of course, in many cases it's better to just give the actual numerical range! But especially in public-facing communication it can be more natural to use natural language terms, and in fact this is already often done. I'm only proposing that when we do use natural language terms for probability ranges, we use them in a consistent and interpretable way (feel free to link to this post as a reference for interpretation, or point to the climatology papers cited below[2]). Should the AIS community use the same terms? That's a slightly harder question. The obvious first-pass answer is 'yes'; it's a natural Schelling point, and terminological consistency across fields is generally preferable when practically possible. The IPCC terms also have the significant advantage of being battle-tested; they've been used over a thirty-year period in a highly controversial field, and terms have been refined when they were found to be insufficiently clear. The strongest argument I see against using the same terms is that the AIS community sometimes needs to deal with more extreme (high or low) risk estimates than these. If we use 'virtually certain' to mean 99 - 100%, what terms can we use for 99.9 - 100.0%, or 99.99 - 100.00%? On the other hand, plausibly once we're dealing with such extreme risk estimates, it's increasingly important to communicate them with actual numeric ranges. My initial proposal is to adopt the IPCC terms, but I'm very open to feedback, and if someone has an argument I find compelling (or which gets strong agreement in votes) for a different or extended set of terms, I'll add it to the proposal. If no su...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Solving adversarial attacks in computer vision as a baby version of general AI alignment, published by stanislavfort on August 29, 2024 on The AI Alignment Forum. I spent the last few months trying to tackle the problem of adversarial attacks in computer vision from the ground up. The results of this effort are written up in our new paper Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness (explainer on X/Twitter). Taking inspiration from biology, we reached state-of-the-art or above state-of-the-art robustness at 100x - 1000x less compute, got human-understandable interpretability for free, turned classifiers into generators, and designed transferable adversarial attacks on closed-source (v)LLMs such as GPT-4 or Claude 3. I strongly believe that there is a compelling case for devoting serious attention to solving the problem of adversarial robustness in computer vision, and I try to draw an analogy to the alignment of general AI systems here. 1. Introduction In this post, I argue that the problem of adversarial attacks in computer vision is in many ways analogous to the larger task of general AI alignment. In both cases, we are trying to faithfully convey an implicit function locked within the human brain to a machine, and we do so extremely successfully on average. Under static evaluations, the human and machine functions match up exceptionally well. However, as is typical in high-dimensional spaces, some phenomena can be relatively rare and basically impossible to find by chance, yet ubiquitous in their absolute count. This is the case for adversarial attacks - imperceptible modifications to images that completely fool computer vision systems and yet have virtually no effect on humans. Their existence highlights a crucial and catastrophic mismatch between the implicit human vision function and the function learned by machines - a mismatch that can be exploited in a dynamic evaluation by an active, malicious agent. Such failure modes will likely be present in more general AI systems, and our inability to remedy them even in the more restricted vision context (yet) does not bode well for the broader alignment project. This is a call to action to solve the problem of adversarial vision attacks - a stepping stone on the path to aligning general AI systems. 2. Communicating implicit human functions to machines The basic goal of computer vision can be viewed as trying to endow a machine with the same vision capabilities a human has. A human carries, locked inside their skull, an implicit vision function mapping visual inputs into semantically meaningful symbols, e.g. a picture of a tortoise into a semantic label tortoise. This function is represented implicitly and while we are extremely good at using it, we do not have direct, conscious access to its inner workings and therefore cannot communicate it to others easily. To convey this function to a machine, we usually form a dataset of fixed images and their associated labels. We then use a general enough class of functions, typically deep neural networks, and a gradient-based learning algorithm together with backpropagation to teach the machine how to correlate images with their semantic content, e.g. how to assign a label parrot to a picture of a parrot. This process is extremely successful in communicating the implicit human vision function to the computer, and the implicit human and explicit, learned machine functions agree to a large extent. The agreement between the two is striking. Given how different the architectures are (a simulated graph-like function doing a single forward pass vs the wet protein brain of a mammal running continuous inference), how different the learning algorithms are (gradient descent with backpropagation vs something completely different but still unknown), a...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would catching your AIs trying to escape convince AI developers to slow down or undeploy?, published by Buck Shlegeris on August 26, 2024 on The AI Alignment Forum. I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat, you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I'm not so sure. As an extreme thought experiment, I'll argue this could be hard even if you caught your AI red-handed trying to escape. Imagine you're running an AI lab at the point where your AIs are able to automate almost all intellectual labor; the AIs are now mostly being deployed internally to do AI R&D. (If you want a concrete picture here, I'm imagining that there are 10 million parallel instances, running at 10x human speed, working 24/7. See e.g. similar calculations here). And suppose (as I think is 35% likely) that these models are egregiously misaligned and are actively looking for ways to subvert your safety measures and escape your control. You caught your AI attempting to backdoor some code that controls access to a security-critical service. You investigated this by running other instances on the inputs they'd see if that backdoored code had actually been deployed; the other AI instances never point out the security flaw, and they often try to exploit that code to escalate their privileges. (See here for more on this.) At this point you're almost sure the AI is scheming. So you try to call the US President and they hear you out, and then they go off and call one of your competitors to ask for their take. They say something like this: It would be extremely costly to shut down all the AI projects. China is racing to catch up and won't stop for this. If you force us to stop deploying our frontier models, they'll outcompete us within a year, and you won't be able to get China to pause without substantial risk of war. AIs are well known to do weird stuff. It would be irresponsible to assume one instance of anomalous behavior meant AIs were systematically going to act that way. I can put you in touch with top AI experts and they'll back me up. Even if the AI is indeed doing something systematically funny, we have no evidence that it has the ambition to seize huge amounts of power, and it's not clear that it would be able to. It seems implausible that AIs would do that; there's been no concrete evidence that the AIs are in fact power-hungry. Maybe our competitor just messed up their alignment, which would make sense because, as we've always told you, they're not very competent (especially compared to us). If they want to shut down, they're welcome to. But it doesn't make sense to penalize us for their shoddy work. Maybe they're just lying; this could be a doomer scheme to shut down AI, which the doomers have wanted to do for years. The logs they sent you could be complete forgeries. Or for all we know, someone there (with or without leadership approval) intentionally backdoored their RL data to make their model do this. It would be a mistake to take rash action before independent experts confirm that this wasn't somehow rigged. I'm sympathetic to all of these arguments. The main reason I'd be more freaked out is that I already think egregious misalignment is fairly plausible; if I thought it was very unlikely, I wouldn't change my mind based on one weird observation. (I think it's pretty plausible that news of the escape attempt wouldn't even make it out of the AI lab: all the above arguments could happen inside the AI lab, between the safety concerned people and people who wanted to push forward.) Now, I'm not sure I'm right to be so pessimistic here. But I think a lot of misalignment-concerned people take for granted that we'd be okay in this situation, an...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Owain Evans on Situational Awareness and Out-of-Context Reasoning in LLMs, published by Michaël Trazzi on August 24, 2024 on The AI Alignment Forum. Owain Evans is an AI Alignment researcher, research associate at the Center of Human Compatible AI at UC Berkeley, and now leading a new AI safety research group. In this episode we discuss two of his recent papers, "Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs" (LW) and "Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data" (LW), alongside some Twitter questions. Below are some highlighted quotes from our conversation (available on Youtube, Spotify, Apple Podcast). For the full context for each of these quotes, you can find the accompanying transcript. Situational Awareness Definition "What is situational awareness? The idea is the model's kind of self-awareness, that is its knowledge of its own identity, and then its awareness of its environment. What are the basic interfaces that it is connected to? [...] And then there's a final point with situational awareness, which is, can the model use knowledge of its identity and environment to take rational actions?" "Situational awareness is crucial for an AI system acting as an agent, doing long-term planning. If you don't understand what kind of thing you are, your capabilities and limitations, it's very hard to make complicated plans. The risks of AI mostly come from agentic models able to do planning." Motivation "We wanted to measure situational awareness in large language models with a benchmark similar to Big Bench or MMLU. The motivation is that situational awareness is important for thinking about AI risks, especially deceptive alignment, and we lacked ways to measure and break it down into components." "Situational awareness is relevant to any situation where the model needs to do agentic long-term planning. [...] A model confused about itself and its situation would likely struggle to pull off such a strategy." On Claude 3 Opus Insightful Answers "Let me explain [the Long Monologue task]. Most of our dataset is typical multiple-choice question answering, but we added a task where models write long answers describing themselves and their situation. The idea is to see if the model can combine different pieces of information about itself coherently and make good inferences about why we're asking these questions. Claude 3 Opus was particularly insightful, guessing it might be part of a research study testing self-awareness in LLMs. These were true inferences not stated in the question. The model was reading between the lines, guessing this wasn't a typical ChatGPT-style interaction. I was moderately surprised, but I'd already seen Opus be very insightful and score well on our benchmark. It's worth noting we sample answers with temperature 1, so there's some randomness. We saw these insights often enough that I don't think it's just luck. Anthropic's post-training RLHF seems good at giving the model situational awareness. The GPT-4 base results were more surprising to us." What Would Saturating The Situational Awareness Benchmark Imply For Safety And Governance "If models can do as well or better than humans who are AI experts, who know the whole setup, who are trying to do well on this task, and they're doing well on all the tasks including some of these very hard ones, that would be one piece of evidence. [...] We should consider how aligned it is, what evidence we have for alignment. We should maybe try to understand the skills it's using." "If the model did really well on the benchmark, it seems like it has some of the skills that would help with deceptive alignment. This includes being able to reliably work out when it's being evaluated by humans, when it has a lot of oversight, and when it needs to...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Showing SAE Latents Are Not Atomic Using Meta-SAEs, published by Bart Bussmann on August 24, 2024 on The AI Alignment Forum. Bart, Michael and Patrick are joint first authors. Research conducted as part of MATS 6.0 in Lee Sharkey and Neel Nanda's streams. Thanks to Mckenna Fitzgerald and Robert Krzyzanowski for their feedback! TL;DR: Sparse Autoencoder (SAE) latents have been shown to typically be monosemantic (i.e. correspond to an interpretable property of the input). It is sometimes implicitly assumed that they are therefore atomic, i.e. simple, irreducible units that make up the model's computation. We provide evidence against this assumption by finding sparse, interpretable decompositions of SAE decoder directions into seemingly more atomic latents, e.g. Einstein -> science + famous + German + astronomy + energy + starts with E We do this by training meta-SAEs, an SAE trained to reconstruct the decoder directions of a normal SAE. We argue that, conceptually, there's no reason to expect SAE latents to be atomic - when the model is thinking about Albert Einstein, it likely also thinks about Germanness, physicists, etc. Because Einstein always entails those things, the sparsest solution is to have the Albert Einstein latent also boost them. Key results SAE latents can be decomposed into more atomic, interpretable meta-latents. We show that when latents in a larger SAE have split out from latents in a smaller SAE, a meta SAE trained on the larger SAE often recovers this structure. We demonstrate that meta-latents allow for more precise causal interventions on model behavior than SAE latents on a targeted knowledge editing task. We believe that the alternate, interpretable decomposition using MetaSAEs casts doubt on the implicit assumption that SAE latents are atomic. We show preliminary results that MetaSAE latents have significant ovelap with latents in a normal SAE of the same size but may relate differently to the larger SAEs used in MetaSAE training. We made a dashboard that lets you explore meta-SAE latents. Terminology: Throughout this post we use "latents" to describe the concrete components of the SAE's dictionary, whereas "feature" refers to the abstract concepts, following Lieberum et al. Introduction Mechanistic interpretability (mech interp) attempts to understand neural networks by breaking down their computation into interpretable components. One of the key challenges of this line of research is the polysemanticity of neurons, meaning they respond to seemingly unrelated inputs. Sparse autoencoders (SAEs) have been proposed as a method for decomposing model activations into sparse linear sums of latents. Ideally, these latents should be monosemantic i.e. respond to inputs that clearly share a similar meaning (implicitly, from the perspective of a human interpreter). That is, a human should be able to reason about the latents both in relation to the features to which they are associated, and also use the latents to better understand the model's overall behavior. There is a popular notion, both implicitly in related work on SAEs within mech interp and explicitly by the use of the term "atom" in sparse dictionary learning as a whole, that SAE features are atomic or can be "true features". However, monosemanticity does not imply atomicity. Consider the example of shapes of different colors - the set of shapes is [circle, triangle, square], and the set of colors is [white, red, green, black], each of which is represented with a linear direction. 'Red triangle' represents a monosemantic feature, but not an atomic feature, as it can be decomposed into red and triangle. It has been shown that sufficiently wide SAEs on toy models will learn 'red triangle', rather than representing 'red' and 'triangle' with separate latents. Furthermore, whilst one may naively re...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Invitation to lead a project at AI Safety Camp (Virtual Edition, 2025), published by Linda Linsefors on August 23, 2024 on The AI Alignment Forum. Do you have AI Safety research ideas that you would like to work on with others? Is there a project you want to do and you want help finding a team? AI Safety Camp could be the solution for you! Summary AI Safety Camp Virtual is a 3-month long online research program from January to April 2025, where participants form teams to work on pre-selected projects. We want you to suggest the projects! If you have an AI Safety project idea and some research experience, apply to be a Research Lead. If accepted, we offer some assistance to develop your idea into a plan suitable for AI Safety Camp. When project plans are ready, we open up team member applications. You get to review applications for your team, and select who joins as a team member. From there, it's your job to guide work on your project. Who is qualified? We require that you have some previous research experience. If you are at least 1 year into a PhD or if you have completed an AI Safety research program (such as a previous AI Safety Camp, PIBBSS, MATS, and similar), or done a research internship with an AI Safety org, then you are qualified already. Other research experience can count, too. More senior researchers are of course also welcome, as long as you think our format of leading an online team inquiring into your research questions suits you and your research. We require that all Research Leads are active participants in their projects and spend at least 10h/week on AISC. Apply here If you are unsure, or have any questions you are welcome to: Book a call with Robert Send an email to Robert Choosing project idea(s) AI Safety Camp is about ensuring future AIs are either reasonably safe or not built at all. We welcome many types of projects including projects aimed at stopping or pausing AI development, aligning AI, deconfusion research, or anything else you think will help make the world safer from AI. If you like, you can read more of our perspectives on AI safety, or look at past projects. If you already have an idea for what project you would like to lead, that's great. Apply with that one! However, you don't need to come up with an original idea. What matters is you understanding the idea you want to work on, and why. If you base your proposal on someone else's idea, make sure to cite them. 1. For ideas on stopping harmful AI, see here and/or email Remmelt. 2. For some mech-interp ideas see here. 3. We don't have specific recommendations for where to find other types of project ideas, so just take inspiration wherever you find it. You can submit as many project proposals as you want. However, you are only allowed to lead one project. Use this template to describe each of your project proposals. We want one document per proposal. We'll help you improve your project As part of the Research Lead application process, we'll help you improve your project. The organiser whose ideas match best with yours, will work with you to create the best version of your project. We will also ask for assistance from previous Research Leads, and up to a handful of other trusted people, to give you additional feedback. Timeline Research Lead applications September 22 (Sunday): Application deadline for Research Leads. October 20 (Sunday): Deadline for refined proposals. Team member applications: October 25 (Friday): Accepted proposals are posted on the AISC website. Application to join teams open. November 17 (Sunday): Application to join teams closes. December 22 (Sunday): Deadline for Research Leads to choose their team. Program Jan 11 - 12: Opening weekend. Jan 13 Apr 25: Research is happening. Teams meet weekly, and plan in their own work hours. April 26 - 27 (preliminary dates):...