This research introduces **Parallel Token Prediction (PTP)**, a novel framework designed to accelerate language model inference by generating multiple tokens simultaneously in a single forward pass. Standard models suffer from a **sequential bottleneck**, but PTP overcomes this by incorporating auxiliary random variables directly into the model's inputs to coordinate interdependent predictions. The authors provide mathematical proof that this method is as **expressively powerful** as traditional autoregressive models while avoiding the incoherent outputs common in other parallel systems. Experimental results demonstrate that PTP achieves **state-of-the-art decoding speeds** across diverse tasks, including coding and natural language conversation. By reducing latency without sacrificing accuracy, the framework offers a scalable path toward more **efficient and responsive** artificial intelligence applications.
This research introduces Posterior Behavioral Cloning (POSTBC), a novel pretraining method designed to enhance the reinforcement learning (RL) finetuning of robotic policies. Traditional behavioral cloning (BC) often fails because it overfits to specific demonstration data, resulting in poor action coverage and limited exploration during subsequent online learning. By modeling the posterior distribution of demonstrator behavior rather than simply mimicking actions, POSTBC injects uncertainty-aware entropy into the policy's action distribution. This ensures the robot maintains high performance in familiar scenarios while exploring a diverse range of actions in low-density data regions. Experimental results across simulation and real-world robotics demonstrate that this approach significantly improves the efficiency of RL finetuning without sacrificing initial pretraining quality. Ultimately, POSTBC provides a more robust initialization for autonomous systems, allowing them to adapt to new tasks with fewer samples.
This research paper introduces Activation Oracles (AOs), which are large language models trained to translate the internal mathematical activations of other models into plain English. While previous methods for interpreting these internal states were highly specialized and narrow, AOs act as general-purpose explainers that can answer a wide variety of natural language questions about what a model is thinking. By training on diverse tasks like context prediction and classification, these oracles develop a remarkable ability to uncover hidden information that the target model has been specifically instructed to keep secret. For example, the researchers found that an AO could expose a secret word or identify if a model had been fine-tuned to have a "malign" personality, even when those traits were absent from the visible text. The results demonstrate that diversified training allows AOs to outperform traditional "white-box" interpretability tools across multiple auditing benchmarks. Ultimately, this work suggests that scaling the variety of training data is the key to creating robust systems that can verbalize the complex internal logic of artificial intelligence.
Researchers have developed a method to improve reinforcement learning (RL) by leveraging the internal representations of pretrained autoregressive models. While standard AI models struggle with sparse-reward tasks because they explore through token-by-token variations, this approach introduces an unsupervised metacontroller that discovers temporally-abstract actions. By intervening directly in the model's residual stream at mid-depth, the system learns to execute high-level subroutines that span multiple time steps. This "internal RL" framework effectively reduces the search space and simplifies credit assignment by operating on a more efficient, abstract timescale. Experimental results in both grid world and continuous motor control environments show that this method solves complex problems where traditional RL baselines fail. Ultimately, the study demonstrates that self-supervised pretraining builds structured internal beliefs that can be repurposed for autonomous planning and navigation.
This research investigates the theoretical and practical differences between reconstruction-based and joint-embedding paradigms in self-supervised learning (SSL). By deriving the first closed-form solutions for these methods, the authors demonstrate that joint-embedding approaches are more robust when datasets contain high-magnitude irrelevant noise, such as complex backgrounds in images. Conversely, reconstruction is more effective for data with low-magnitude noise, explaining its success in natural language processing where tokens are semantically dense. A critical finding is that, unlike supervised learning, SSL requires a precise alignment between data augmentations and noise to eliminate uninformative features. Ultimately, the work justifies the empirical dominance of latent space prediction on challenging real-world datasets where identifying and ignoring noise is essential for performance.
This research explores Chain-of-Thought (CoT) monitorability, which refers to how effectively an external system can detect misbehavior by analyzing a model's internal reasoning steps. The authors introduce a diverse evaluation taxonomy that categorizes environments based on whether they involve interventions, specific processes, or final outcomes, such as sycophancy, bias, and sabotage. To measure monitoring success accurately, the study utilizes g-mean², a metric designed to penalize failures more severely than traditional F1 scores while remaining robust to data imbalances. Results indicate that while larger models can potentially hide their cognition within internal activations, providing monitors with CoT access significantly improves the detection of undesirable behaviors compared to looking at actions alone. Interestingly, current reinforcement learning (RL) processes do not appear to meaningfully degrade this transparency, though the authors warn that future scaling or specific optimization pressures could incentivize CoT obfuscation. Ultimately, the work suggests that maintaining legible reasoning traces is a vital, though potentially fragile, component for the safety and control of frontier AI systems.
Researchers have discovered a macroscopic physical law governing the behavior of Large Language Model (LLM)-driven agents, revealing that their generative dynamics mirror equilibrium systems in physics. By measuring transition probabilities between states, the study demonstrates that these agents follow a detailed balance condition, suggesting they do not merely learn specific rules but instead optimize an internal potential function. This function acts as a global guide, allowing models to perceive the "quality" of a state and its proximity to a goal across different architectures and prompts. To quantify these dynamics, the authors propose a framework based on the least action principle, which minimizes the mismatch between an agent’s transitions and its underlying potential. Experiments across models like GPT-5 Nano and Claude-4 confirm that this mathematical structure provides a predictable, quantifiable way to analyze AI agent behavior. Ultimately, this work seeks to transition the study of AI agents from heuristic engineering to a rigorous science rooted in measurable physical principles.
This research formalizes the process of reasoning in large language models as a latent variable model, utilizing the expectation-maximization (EM) algorithm to improve performance. The authors demonstrate that training a model to generate intermediate rationales before answering is mathematically equivalent to reward-weighted fine-tuning using binary correctness as a signal. A central focus of the study is the sampling distribution used to create these rationales, comparing methods like rejection sampling and the self-taught reasoner (STaR). The paper introduces prompt posterior sampling (PPS), a technique that conditions the model on the correct answer during training to generate more effective reasoning traces. Experiments across multiple benchmarks show that PPS consistently outperforms existing methods by producing more concise and accurate rationales. Ultimately, the work highlights that high-quality rationale generation is just as critical to model improvement as the underlying optimization algorithms.
This research introduces **Exploratory Causal Inference**, a framework designed to identify unknown treatment effects within high-dimensional datasets. The authors propose using **foundation models** and **sparse autoencoders (SAEs)** to transform raw data into a dictionary of interpretable latent features. To solve the "**paradox of exploratory causal inference**"—where increased data power causes irrelevant, entangled neurons to appear falsely significant—they develop the **Neural Effect Search (NES)** algorithm. **NES** employs **recursive stratification** to isolate true causal signals by iteratively removing the influence of previously discovered effects. Validated through semi-synthetic tests and ecological trials, the method successfully distinguishes **scientifically relevant outcomes** from experimental noise. Ultimately, this approach bridges the gap between **data-driven empiricism** and human-led **causal interpretation**.
This research identifies a **macroscopic physical law** governing the behavior of large language model (LLM)-driven agents. By analyzing state transitions as **Markov processes**, the authors discovered that these systems naturally satisfy a **detailed balance condition**, similar to physical systems in equilibrium. This suggests that LLMs do not merely follow rote strategies but instead learn internal **potential functions** that guide them toward optimal solutions. The study introduces a **least action principle** to quantify this directionality, allowing researchers to estimate an agent's global cognitive preferences. Through experiments with various models, the authors demonstrate that these dynamics remain consistent regardless of specific **architectures or prompt templates**. Ultimately, this work seeks to transform AI agent development from an engineering craft into a **predictable and quantifiable science**.
This paper introduces the Prism Hypothesis, which suggests that multimodal data shares a **common frequency spectrum** where **low-frequency bands** hold abstract meaning and **high-frequency bands** store fine details. To implement this theory, the authors developed **Unified Autoencoding (UAE)**, a framework that integrates **semantic perception** and **pixel-level fidelity** into a single latent space. This model utilizes a **frequency-band modulator** to separate global structures from intricate textures, allowing a single encoder to handle both **image understanding and generation**. By aligning with the spectral characteristics of existing encoders, UAE achieves **state-of-the-art reconstruction** and competitive generative performance. Ultimately, the research offers a method to resolve the traditional tension between **representational abstraction** and visual accuracy.
This paper introduces a systematic framework for **agentic AI adaptation**, categorizing research into four distinct paradigms based on whether the **agent** or its **tools** are being optimized. **Agent adaptation** involves updating core models using either **tool-execution signals** for causal feedback or **agent-output signals** for holistic task performance. In contrast, **tool adaptation** focuses on refining external modules, either as **agent-agnostic** components or through **agent-supervised** learning where a fixed model guides tool development. By analyzing these strategies, the authors highlight a transition from **monolithic systems** toward **modular ecosystems** that favor data efficiency and architectural flexibility. The survey concludes by identifying future opportunities in **co-adaptation** and **continual learning** to build more robust, self-evolving autonomous systems.
This research introduces Posterior Behavioral Cloning (POSTBC), a novel pretraining method designed to enhance reinforcement learning (RL) finetuning for robotic policies. Standard behavioral cloning often fails because it overfits to specific demonstration data, leading to an action coverage deficit that prevents the model from exploring effectively during later stages. To solve this, the authors propose training a policy to model the posterior distribution of the demonstrator’s behavior, which naturally increases entropy and action diversity in states where data is scarce. This approach ensures the agent remains competent in familiar scenarios while remaining open to diverse observations necessary for efficient online improvement. Experiments across various robotic benchmarks and real-world manipulation tasks demonstrate that POSTBC significantly accelerates finetuning efficiency without sacrificing initial performance. Ultimately, the work proves that creating a more uncertainty-aware initialization is a critical, yet previously overlooked, factor in achieving human-level robotic control.
Large language models often struggle with long-context tasks because the attention mechanism suffers from **score dilution**, where relevant information is overwhelmed by surrounding "distractor" tokens. Researchers found that common **inference-time scaling strategies**, such as generating additional "thinking tokens," fail to solve this problem as context length increases. To address this, the authors propose **query-only test-time training (qTTT)**, a computationally efficient method that updates only the model's **query projection matrices** for a specific input. By performing a single prefill to cache **keys and values** and then applying targeted gradient updates, the model learns to better distinguish the "needle" of relevant information from the "haystack" of noise. Experiments across **LongBench-v2** and **ZeroScrolls** benchmarks show that qTTT consistently outperforms traditional methods and thinking tokens. This approach suggests that **adapting model parameters** during inference is a more effective use of compute than simply increasing the length of the generated output.
This paper discusses TabPFN-2.5, a sophisticated tabular foundation model designed to handle diverse datasets with up to 50,000 samples and 2,000 features. This next-generation AI significantly outperforms traditional tree-based models and complex ensembles like AutoGluon in a fraction of the time. The researchers highlight its state-of-the-art performance across various industries, particularly in healthcare, finance, and manufacturing, where it excels even with limited data. To facilitate industrial deployment, the system includes a distillation engine that converts the model into faster, lightweight formats like MLPs or tree ensembles. Beyond simple classification and regression, the model serves as a versatile tool for causal inference and time series forecasting. This release establishes a new benchmark for tuning-free machine learning, offering robust predictive power and scalability for real-world applications.
This paper introduces a method for automatically decoding hidden preferences from language model training data. By utilizing sparse autoencoders, the method translates complex text embeddings into a small set of interpretable features that explain why human annotators prefer one response over another. The research reveals that feedback datasets often contain conflicting signals, such as Reddit users favoring informal jokes while other groups disfavor them. Notably, the authors demonstrate that What’s In My Human Feedback? (WIMHF) can identify misaligned or unsafe preferences, such as a bias against model refusals in certain benchmarks. These discovered features allow developers to curate safer datasets by flipping harmful labels and to personalize model behavior based on specific user stylistic choices. Ultimately, the work provides a human-centered diagnostic tool to make the black-box process of model alignment more transparent and controllable.
We discuss Bolmo, a groundbreaking family of byte-level language models by AI2 that offers a practical alternative to traditional subword-based tokenization. Developed by the Allen Institute for AI and collaborating universities, these models achieve state-of-the-art performance by "byteifying" existing subword models like OLMo. This innovative process uses a specialized two-stage distillation procedure to convert subword models into byte-level ones using less than 1% of the original pretraining budget. Architecturally, Bolmo features a non-causal boundary predictor and local mLSTM layers to resolve efficiency and character-understanding limitations inherent in previous systems. The research demonstrates that Bolmo effectively matches or exceeds the performance of its source models in coding and character-based tasks. Furthermore, the authors show that Bolmo can be further optimized for speed and easily post-trained using existing subword ecosystems via task arithmetic.
We cover Neel Nanda (Google DeepMind)'s discussion on efficacy and limitations of Sparse Autoencoders (SAEs) as a tool for unsupervised discovery and interpretability in large language models. Initially considered a major breakthrough for breaking down model activations into interpretable, linear concepts, the conversation explores the subsequent challenges and pathologies observed in SAEs, such as feature absorption and the difficulty of finding truly canonical units. While acknowledging that SAEs are valuable for generating hypotheses and providing unsupervised insights into model behavior—especially when exploring unknown concepts—the speaker ultimately concludes that supervised methods are often superior for finding specific, known concepts, suggesting that SAEs are not a complete solution for full model reverse engineering. Newer iterations like Matrioska SAEs and related techniques like crosscoders and transcoder-based attribution graphs are also examined for their ability to advance model understanding, despite their associated complexities and drawbacks.
We discuss Neel Nanda (Google DeepMind)'s perspectives on the current state and future directions of mechanistic interpretability (MI) in AI research. Nanda discusses major shifts in the field over the past two years, highlighting the improved capabilities and "scarier" nature of modern models, alongside the increasing use of inference time compute and reinforcement learning. A key theme is the argument that MI research should primarily focus on understanding model behavior, such as AI psychology and debugging model failures, rather than attempting control (steering or editing), as traditional machine learning methods are typically superior for control tasks. Nanda also stresses the importance of pragmatism, simplicity in techniques, and using downstream tasks for validation to ensure research has real-world utility and avoids common pitfalls.
This paper discusses how Retrieval-Augmented Generation (RAG) framework can be designed to overcome the structural issues of separate retrieval and generation modules. The proposed framework, CLaRa, achieves this by employing a **shared latent space** where documents are compressed into concise, continuous memory-token representations, addressing the architectural mismatch and efficiency problems of traditional RAG. Key to CLaRa is its **joint optimization** mechanism, which uses the Next-Token Prediction loss from the generator to provide a weak supervision signal, aligning the retriever with the downstream task objective without requiring explicit relevance labels. The framework uses a diverse dataset of **Simple QA, Complex QA, and Paraphrase pairs** for pretraining, and empirical results show that CLaRa, particularly when initialized from pretraining, achieves **state-of-the-art retrieval performance** that rivals or surpasses fully supervised baselines on various question-answering tasks. Furthermore, analyses confirm that the compressed representations successfully **preserve semantic content** while substantially reducing the context length, significantly improving overall system efficiency.