Discover
HuggingFace 每日AI论文速递
490 Episodes
Reverse
本期的 15 篇论文如下:[00:20] 🔍 Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning(观察、推理与搜索:面向智能体视频推理的开放网络视频深度研究基准)[01:01] 👶 BabyVision: Visual Reasoning Beyond Language(BabyVision:超越语言的视觉推理)[01:45] 🚀 PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning(PaCoRe:通过并行协调推理学习扩展测试时计算)[02:24] 🧠 X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests(X-Coder:基于全合成任务、解决方案与测试推进竞争性编程)[03:03] ⚡ MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head(MHLA:通过令牌级多头机制恢复线性注意力的表达能力)[03:41] ⚡ GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts(GlimpRouter:通过瞥见思维令牌实现高效协同推理)[04:17] 🤖 OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent(OS-Symphony:一个用于鲁棒且通用的计算机使用智能体的整体框架)[05:20] 📉 Lost in the Noise: How Reasoning Models Fail with Contextual Distractors(迷失于噪声之中:推理模型如何因上下文干扰物而失效)[06:00] 🚀 Beyond Hard Masks: Progressive Token Evolution for Diffusion Language Models(超越硬掩码:扩散语言模型的渐进式令牌演化)[06:30] 🧠 Controllable Memory Usage: Balancing Anchoring and Innovation in Long-Term Human-Agent Interaction(可控内存使用:在长期人机交互中平衡锚定与创新)[07:10] 🚗 DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving(DrivingGen:自动驾驶生成式视频世界模型的综合基准)[07:58] 🤖 MegaFlow: Large-Scale Distributed Orchestration System for the Agentic Era(MegaFlow:面向智能体时代的大规模分布式编排系统)[08:26] 🎨 Boosting Latent Diffusion Models via Disentangled Representation Alignment(通过解耦表征对齐提升潜在扩散模型)[09:08] 🤔 What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models(用户未言明之处:欠明确的查询限制视觉语言模型)[09:45] 🔧 ET-Agent: Incentivizing Effective Tool-Integrated Reasoning Agent via Behavior Calibration(ET-Agent:通过行为校准激励有效的工具集成推理智能体)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
本期的 15 篇论文如下:[00:20] 🗺 Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization(借助地图思考:用于地理定位的强化并行地图增强智能体)[01:03] 🧠 MMFormalizer: Multimodal Autoformalization in the Wild(MMFormalizer:面向真实世界的多模态自动形式化方法)[01:38] 🧬 The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning(思维分子结构:长链思维推理的拓扑映射)[02:21] 🎭 CaricatureGS: Exaggerating 3D Gaussian Splatting Faces With Gaussian Curvature(CaricatureGS:基于高斯曲率夸张3D高斯泼溅人脸)[03:04] 🔍 Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards(证据链构建:基于引文感知评分奖励的深度搜索智能体鲁棒强化学习)[03:47] ⚙ EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis(EnvScaler:通过程序化合成扩展LLM智能体的工具交互环境)[04:22] 🔮 Can We Predict Before Executing Machine Learning Agents?(我们能在执行前预测机器学习智能体的行为吗?)[04:59] 🖼 AgentOCR: Reimagining Agent History via Optical Self-Compression(AgentOCR:通过光学自压缩重构智能体历史)[05:39] 🎬 VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction(VideoAR:通过下一帧与尺度预测的自回归视频生成)[06:29] 🔍 Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking(Qwen3-VL-Embedding与Qwen3-VL-Reranker:用于最先进多模态检索与排序的统一框架)[07:23] 🔍 Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency(自信的幻觉?通过邻域一致性诊断大语言模型的真实性)[08:07] 🔄 Orient Anything V2: Unifying Orientation and Rotation Understanding(Orient Anything V2:统一物体朝向与旋转理解的增强基础模型)[08:37] 🔍 SmartSearch: Process Reward-Guided Query Refinement for Search Agents(SmartSearch:面向搜索代理的流程奖励引导查询优化框架)[09:23] ⚙ Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals(目标力:教导视频模型实现物理条件目标)[10:11] 📊 Same Claim, Different Judgment: Benchmarking Scenario-Induced Bias in Multilingual Financial Misinformation Detection(相同声明,不同判断:多语言金融虚假信息检测中场景诱导偏见的基准测试)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
本期的 5 篇论文如下:[00:39] TOP1(🔥126) | 📈 GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization(GDPO:面向多奖励强化学习优化的组奖励解耦归一化策略优化)[02:31] TOP2(🔥108) | 🌍 NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos(NeoVerse:利用野外单目视频增强4D世界模型)[04:40] TOP3(🔥107) | 🤖 Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization(Youtu-Agent:通过自动化生成与混合策略优化扩展智能体生产力)[07:00] TOP4(🔥93) | 🔍 InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields(InfiniDepth:基于神经隐式场的任意分辨率与细粒度深度估计)[09:40] TOP5(🔥87) | 🎬 LTX-2: Efficient Joint Audio-Visual Foundation Model(LTX-2:高效的联合视听基础模型)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
本期的 15 篇论文如下:[00:21] 📈 GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization(GDPO:面向多奖励强化学习优化的组奖励解耦归一化策略优化)[01:05] ⚖ Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers(可学习的乘数:释放语言模型矩阵层的尺度)[01:33] 🌙 RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes(RL-AWB:基于深度强化学习的低光照夜间场景自动白平衡校正)[02:07] 🤖 RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation(RoboVIP:基于视觉身份提示的多视角视频生成增强机器人操作)[02:56] 🤝 RelayLLM: Efficient Reasoning via Collaborative Decoding(RelayLLM:基于协作解码的高效推理框架)[03:31] 🌲 AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search(AT²PO:基于树搜索的智能体回合制策略优化)[04:24] 🤔 VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice(VideoAuto-R1:通过思考一次、回答两次实现视频自动推理)[04:57] 🎬 VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control(VerseCrafter:具有4D几何控制的动态逼真视频世界模型)[05:34] 🔍 The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models(专业化的幻象:揭示混合专家模型中的领域不变“常务委员会”)[06:09] 🎯 Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models(少数令牌至关重要:针对视觉语言模型的熵引导攻击)[06:40] 🎥 Plenoptic Video Generation(全光视频生成)[07:12] ⚖ Agent-as-a-Judge(智能体作为评审者)[07:43] 📄 DocDancer: Towards Agentic Document-Grounded Information Seeking(DocDancer:面向智能体化的文档驱动信息检索)[08:20] 🧠 Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing(Re-Align:基于结构化推理引导对齐的上下文图像生成与编辑)[09:05] 🧠 DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs(DiffCoT:大语言模型中的扩散风格思维链推理)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
本期的 15 篇论文如下:[00:21] ⚖ Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting(熵自适应微调:解决置信冲突以缓解遗忘)[01:15] 🧠 Evolving Programmatic Skill Networks(演化式程序化技能网络)[01:51] 🧠 Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning(Atlas:面向多领域复杂推理的异构模型与工具编排框架)[02:31] 📊 Benchmark^2: Systematic Evaluation of LLM Benchmarks(基准测试的基准测试:大语言模型评估基准的系统性评估)[03:12] 🎬 Klear: Unified Multi-Task Audio-Video Joint Generation(Klear:统一的多任务音视频联合生成)[03:53] 🎬 Choreographing a World of Dynamic Objects(动态物体的编排:一个通用生成式流水线)[04:36] ✅ Agentic Rubrics as Contextual Verifiers for SWE Agents(作为上下文验证器的智能评分标准在软件工程代理中的应用)[05:11] ⚗ MDAgent2: Large Language Model for Code Generation and Knowledge Q&A in Molecular Dynamics(MDAgent2:用于分子动力学代码生成与知识问答的大语言模型)[05:55] 🚀 E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models(E-GRPO:高熵步驱动流模型的有效强化学习)[06:53] 🛡 RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models(RedBench:一个用于大型语言模型全面红队测试的通用数据集)[07:36] 📊 EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning(EpiQAL:面向增强对齐与推理的流行病学问答大语言模型基准评测)[08:15] 🧠 Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks(通过语言学习任务预训练增强语言模型的语言能力)[08:48] 🔬 Why LLMs Aren't Scientists Yet: Lessons from Four Autonomous Research Attempts(为什么大语言模型还不是科学家:来自四次自主研究尝试的教训)[09:25] 🤖 ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing(ThinkRL-Edit:基于强化学习的思维式推理中心图像编辑)[10:17] 🧠 MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents(MAGMA:一种基于多图的AI智能体记忆架构)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
本期的 15 篇论文如下:[00:25] 🔍 InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields(InfiniDepth:基于神经隐式场的任意分辨率与细粒度深度估计)[01:07] 🎙 MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization(MOSS转录与说话人分离:带说话人归属和时间戳的准确转录)[01:46] 🔬 SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence(SciEvalKit:一个用于科学通用智能的开源评估工具包)[02:32] 🎬 LTX-2: Efficient Joint Audio-Visual Foundation Model(LTX-2:高效的联合视听基础模型)[03:26] 🦄 UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision(UniCorn:通过自生成监督实现自改进统一多模态模型)[04:06] 🎨 DreamStyle: A Unified Framework for Video Stylization(DreamStyle:视频风格化的统一框架)[04:38] 🧠 CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving(CogFlow:通过知识内化桥接感知与推理,用于视觉数学问题求解)[05:25] ⚡ MiMo-V2-Flash Technical Report(MiMo-V2-Flash 技术报告)[06:15] 🎮 NitroGen: An Open Foundation Model for Generalist Gaming Agents(NitroGen:通用游戏智能体的开放基础模型)[06:58] 🤖 SOP: A Scalable Online Post-Training System for Vision-Language-Action Models(SOP:一种可扩展的视觉-语言-动作模型在线后训练系统)[07:43] 🛡 OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs(OpenRT:一个用于多模态大语言模型的开源红队测试框架)[08:31] 📍 The Sonar Moment: Benchmarking Audio-Language Models in Audio Geo-Localization(声纳时刻:音频语言模型在音频地理定位中的基准测试)[09:14] 🔍 X-MuTeST: A Multilingual Benchmark for Explainable Hate Speech Detection and A Novel LLM-consulted Explanation Framework(X-MuTeST:一个用于可解释仇恨言论检测的多语言基准及一种新颖的LLM咨询解释框架)[09:57] 🧠 Parallel Latent Reasoning for Sequential Recommendation(并行潜在推理用于序列推荐)[10:27] 🤖 WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks(WebGym:利用真实任务扩展视觉网络代理的训练环境)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
本期的 15 篇论文如下:[00:21] 🧠 K-EXAONE Technical Report(K-EXAONE技术报告)[00:56] 🚀 NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation(NextFlow:统一序列建模激活多模态理解与生成)[01:36] 🎭 DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer(DreamID-V:通过扩散Transformer弥合图像到视频的鸿沟以实现高保真人脸交换)[02:19] 🎨 VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation(VAR强化学习优化之道:解决视觉自回归生成中的异步策略冲突)[03:04] 🚀 GARDO: Reinforcing Diffusion Models without Reward Hacking(GARDO:无需奖励黑客攻击的扩散模型强化方法)[03:41] 🎨 VINO: A Unified Visual Generator with Interleaved OmniModal Context(VINO:一种具有交错式全模态上下文的统一视觉生成器)[04:17] ♾ InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams(InfiniteVGGT:面向无尽流数据的视觉几何基础Transformer)[04:54] 🧠 Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits(大型语言模型能否预测自身失败?通过内部电路实现自我感知)[05:23] 🚀 Falcon-H1R: Pushing the Reasoning Frontiers with a Hybrid Model for Efficient Test-Time Scaling(Falcon-H1R:通过混合模型实现高效测试时扩展,推动推理前沿)[05:57] 🔄 Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes(Talk2Move:基于强化学习的文本指令场景物体几何变换框架)[06:43] 🔄 Recursive Language Models(递归语言模型)[07:12] 🧠 KV-Embedding: Training-free Text Embedding via Internal KV Re-routing in Decoder-only LLMs(KV-嵌入:通过仅解码器大语言模型内部KV重路由实现免训练文本嵌入)[07:51] ⚠ COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs(COMPASS:评估大语言模型中组织特定政策对齐性的框架)[08:52] 🛰 Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion(通过协同引导与协同融合实现稳定的半监督遥感分割)[09:40] 🧱 SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving(SWE-Lego:推动软件问题解决的监督微调极限)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
本期的 12 篇论文如下:[00:22] 🤖 Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization(Youtu-Agent:通过自动化生成与混合策略优化扩展智能体生产力)[00:52] 🌍 NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos(NeoVerse:利用野外单目视频增强4D世界模型)[01:27] 🤖 Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation(Avatar Forcing:面向自然对话的实时交互式头部虚拟人生成)[01:59] 🚀 SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning(SenseNova-MARS:通过强化学习赋能多模态代理推理与搜索)[02:35] 🎭 Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation(驯服幻觉:通过反事实视频生成提升多模态大语言模型的视频理解能力)[03:14] 🎬 AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction(AdaGaR:用于动态场景重建的自适应Gabor表示)[03:47] 🧠 Deep Delta Learning(深度Delta学习)[04:11] 🧠 Nested Learning: The Illusion of Deep Learning Architectures(嵌套学习:深度学习架构的幻象)[04:47] 🧠 Diversity or Precision? A Deep Dive into Next Token Prediction(多样性还是精确性?深入探究下一个词元预测)[05:23] 🧠 Fast-weight Product Key Memory(快速权重乘积键值记忆)[05:58] 🧬 InfoSynth: Information-Guided Benchmark Synthesis for LLMs(InfoSynth:面向大语言模型的信息引导基准合成框架)[06:28] 🌀 MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing(MorphAny3D:释放结构化隐空间在3D形变中的力量)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
本期的 5 篇论文如下:[00:33] TOP1(🔥132) | 🧠 mHC: Manifold-Constrained Hyper-Connections(mHC:流形约束的超连接)[02:32] TOP2(🔥100) | 🧠 Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding(面向提升长文本理解的思维景观感知检索增强生成)[04:45] TOP3(🔥94) | 🎬 InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion(InsertAnywhere:连接4D场景几何与扩散模型以实现逼真的视频对象插入)[07:17] TOP4(🔥86) | 🔗 Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss(通过辅助损失耦合专家混合模型中的专家与路由器)[09:23] TOP5(🔥62) | 🎬 LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation(LiveTalk:通过改进的策略内蒸馏实现实时多模态交互式视频扩散)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
本期的 3 篇论文如下:[00:19] 🧠 Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space(动态大型概念模型:自适应语义空间中的潜在推理)[00:56] 🧠 DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models(DiffThinker:基于扩散模型的生成式多模态推理)[01:27] 🔄 On the Role of Discreteness in Diffusion LLMs(论离散性在扩散语言模型中的作用)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
本期的 15 篇论文如下:[00:22] 🚀 Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models(Youtu-LLM:解锁轻量级大语言模型的原生智能体潜力)[01:00] 🤖 Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem(任其流动:摇滚乐上的智能体构建,在开放智能体学习生态系统中建立ROME模型)[01:52] 🧠 mHC: Manifold-Constrained Hyper-Connections(mHC:流形约束的超连接)[02:25] 🔍 GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction(GaMO:面向稀疏视图三维重建的几何感知多视角扩散外绘)[03:11] 🔮 Scaling Open-Ended Reasoning to Predict the Future(扩展开放端推理以预测未来)[04:00] 🧠 AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents(AI遇见大脑:从认知神经科学到自主智能体的记忆系统)[04:32] 🎬 PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation(PhyGDPO:面向物理一致性文本到视频生成的物理感知分组直接偏好优化)[05:16] 🦾 GR-Dexter Technical Report(GR-Dexter技术报告)[05:59] 🎬 SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time(SpaceTimePilot:跨时空动态场景的生成式渲染)[06:56] 🔍 Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process(神奇推理行为及其发现:推理过程的无监督探索)[07:28] 🧠 BEDA: Belief Estimation as Probabilistic Constraints for Performing Strategic Dialogue Acts(BEDA:将信念估计作为执行策略性对话行为的概率约束)[08:06] 🧭 Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems(锻造空间智能:面向自主系统的多模态数据预训练路线图)[08:47] 🧠 Figure It Out: Improving the Frontier of Reasoning with Active Visual Thinking(图式求解:通过主动视觉思维提升推理前沿)[09:20] 🎯 Factorized Learning for Temporally Grounded Video-Language Models(面向时序定位视频语言模型的因子化学习)[09:59] 🎞 Pretraining Frame Preservation in Autoregressive Video Memory Compression(自回归视频记忆压缩中的预训练帧保留)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
本期的 10 篇论文如下:[00:29] TOP1(🔥279) | 🧠 From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence(从代码基础模型到智能体与应用:代码智能实用指南)[02:22] TOP2(🔥242) | 🚀 DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models(DeepSeek-V3.2:推动开放大型语言模型前沿)[04:45] TOP3(🔥217) | 🚀 Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer(Z-Image:基于单流扩散Transformer的高效图像生成基础模型)[06:47] TOP4(🔥195) | ⚙ DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI(DataFlow:面向数据为中心AI时代的统一数据准备与工作流自动化LLM驱动框架)[09:19] TOP5(🔥181) | 🎬 LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling(LongVT:通过原生工具调用激励“长视频思考”)[11:39] TOP6(🔥167) | 🤖 Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length(实时虚拟化身:基于无限时长的流式实时音频驱动化身生成)[13:15] TOP7(🔥163) | 🎬 Kling-Omni Technical Report(Kling-Omni技术报告)[15:12] TOP8(🔥149) | 📊 DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle(DAComp:跨全数据智能生命周期的数据智能体基准测试)[17:44] TOP9(🔥146) | 🧠 Qwen3-VL Technical Report(Qwen3-VL 技术报告)[20:34] TOP10(🔥128) | 🎬 Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance(Wan-Move:通过潜在轨迹引导实现运动可控的视频生成)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
本期的 6 篇论文如下:[00:24] 🧊 UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement(UltraShape 1.0:通过可扩展几何精化的高保真3D形状生成)[01:00] 🎨 DreamOmni3: Scribble-based Editing and Generation(DreamOmni3:基于涂鸦的编辑与生成)[01:34] 🧠 End-to-End Test-Time Training for Long Context(面向长上下文的端到端测试时训练)[02:18] 🔬 Evaluating Parameter Efficient Methods for RLVR(评估强化学习可验证奖励中的参数高效方法)[03:02] 🔍 GraphLocator: Graph-guided Causal Reasoning for Issue Localization(GraphLocator:基于图引导因果推理的缺陷定位方法)[03:35] ⚠ GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs(GateBreaker:针对专家混合大语言模型的基于门控的引导攻击)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
本期的 15 篇论文如下:[00:24] 🔗 Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss(通过辅助损失耦合专家混合模型中的专家与路由器)[01:07] 🎬 LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation(LiveTalk:通过改进的策略内蒸馏实现实时多模态交互式视频扩散)[01:55] 🌍 Yume-1.5: A Text-Controlled Interactive World Generation Model(Yume-1.5:一种文本控制的交互式世界生成模型)[02:30] 🔍 SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents(SmartSnap:自验证智能体的主动证据寻求范式)[02:59] 🔮 Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation(扩散模型知晓透明度:将视频扩散模型重新用于透明物体的深度与法线估计)[03:40] 🎯 SpotEdit: Selective Region Editing in Diffusion Transformers(SpotEdit:扩散变换器中的选择性区域编辑)[04:23] 🚀 Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone(Dream-VL与Dream-VLA:基于扩散语言模型骨干的开放视觉-语言与视觉-语言-动作模型)[05:09] 🔍 GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models(GRAN-TED:为扩散模型生成鲁棒、对齐且细致的文本嵌入)[05:56] 🤖 Act2Goal: From World Model To General Goal-conditioned Policy(Act2Goal:从世界模型到通用目标条件策略)[06:31] ⚡ Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion(Stream-DiffVSR:基于自回归扩散的低延迟可流式视频超分辨率)[06:59] 🌐 Web World Models(Web世界模型)[07:34] 🚀 DiRL: An Efficient Post-Training Framework for Diffusion Language Models(DiRL:一种高效的扩散语言模型后训练框架)[08:19] 🎬 Video-BrowseComp: Benchmarking Agentic Video Research on Open Web(Video-BrowseComp:面向开放网络的智能体视频研究基准测试)[09:02] 🧠 Training AI Co-Scientists Using Rubric Rewards(使用评分标准奖励训练AI科研助手)[09:39] 🧩 Monadic Context Engineering(单子上下文工程)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
本期的 13 篇论文如下:[00:27] 🧠 Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding(面向提升长文本理解的思维景观感知检索增强生成)[01:07] 🎬 InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion(InsertAnywhere:连接4D场景几何与扩散模型以实现逼真的视频对象插入)[01:46] 🤖 MAI-UI Technical Report: Real-World Centric Foundation GUI Agents(MAI-UI技术报告:面向真实世界的通用图形用户界面智能体)[02:22] 👁 UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture(UniPercept:迈向跨美学、质量、结构与纹理的统一感知级图像理解)[03:04] 🎨 ProEdit: Inversion-based Editing From Prompts Done Right(ProEdit:基于反演的提示编辑的正确方法)[03:58] ⏱ TimeBill: Time-Budgeted Inference for Large Language Models(TimeBill:面向大语言模型的时间预算推理框架)[04:37] 🧠 See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning(少看,看对:用于多模态推理的双向感知塑造)[05:16] 🌦 Omni-Weather: Unified Multimodal Foundation Model for Weather Generation and Understanding(Omni-Weather:用于天气生成与理解的多模态统一基础模型)[05:48] 🧠 SVBench: Evaluation of Video Generation Models on Social Reasoning(SVBench:视频生成模型在社会推理能力上的评估)[06:27] 🔍 InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search(InSight-o3:赋能多模态基础模型实现广义视觉搜索)[07:15] 🎨 SlideTailor: Personalized Presentation Slide Generation for Scientific Papers(SlideTailor:面向科研论文的个性化演示文稿幻灯片生成)[08:11] 🤖 SWE-RM: Execution-free Feedback For Software Engineering Agents(SWE-RM:面向软件工程智能体的无执行反馈机制)[08:48] ⚡ A 58-Addition, Rank-23 Scheme for General 3x3 Matrix Multiplication(一种用于通用3x3矩阵乘法的58次加法、秩23方案)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
本期的 5 篇论文如下:[00:42] TOP1(🔥188) | ⚙ DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI(DataFlow:面向数据为中心AI时代的统一数据准备与工作流自动化LLM驱动框架)[02:34] TOP2(🔥105) | 🔬 Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows(通过科学家对齐的工作流程探究大语言模型的科学通用智能)[05:04] TOP3(🔥85) | 🎬 SemanticGen: Video Generation in Semantic Space(SemanticGen:在语义空间中的视频生成)[07:03] TOP4(🔥73) | 🔍 Step-DeepResearch Technical Report(Step-DeepResearch技术报告)[09:31] TOP5(🔥71) | 🧠 PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence(PhysBrain:以人类第一人称数据为桥梁,从视觉语言模型迈向物理智能)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
本期的 6 篇论文如下:[00:19] 🧠 Latent Implicit Visual Reasoning(潜在隐式视觉推理)[00:56] 🎬 Spatia: Video Generation with Updatable Spatial Memory(Spatia:基于可更新空间记忆的视频生成)[01:36] 🧠 Schoenfeld's Anatomy of Mathematical Reasoning by Language Models(基于舍恩菲尔德理论的语言模型数学推理解剖)[02:11] 🔍 How Much 3D Do Video Foundation Models Encode?(视频基础模型编码了多少3D信息?)[02:58] 🎯 VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation(VA-π:面向像素感知自回归生成的变分策略对齐)[03:36] 🚀 GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training(GTR-Turbo:合并的检查点秘密成为智能体化视觉语言模型训练的免费教师)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
本期的 14 篇论文如下:[00:20] 🧠 Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models(学习在四维空间中推理:视觉语言模型的动态空间理解)[01:11] ⚡ TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times(TurboDiffusion:将视频扩散模型加速100-200倍)[01:52] 🧭 T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation(T2AV-Compass:迈向文本到音视频生成的统一评估)[02:38] 🎬 DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation(DreaMontage:基于任意帧引导的单镜头视频生成)[03:21] 🔍 Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models(超越记忆:一个多模态序数回归基准揭示视觉语言模型中的流行度偏差)[04:07] 🎬 HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming(HiStream:通过消除冗余的流式处理实现高效高分辨率视频生成)[04:52] 🚀 Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning(Nemotron 3 Nano:用于智能体推理的开放、高效混合专家Mamba-Transformer模型)[05:38] 🔍 TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior(TokSuite:衡量分词器选择对语言模型行为的影响)[06:12] 🚀 NVIDIA Nemotron 3: Efficient and Open Intelligence(NVIDIA Nemotron 3:高效且开放的智能模型)[06:57] 🎬 Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations(基于下一帧预测的学习:自回归视频建模编码有效表示)[07:27] 🎬 Streaming Video Instruction Tuning(流式视频指令微调)[08:02] 🧠 Multi-hop Reasoning via Early Knowledge Alignment(通过早期知识对齐实现多跳推理)[08:43] 📊 SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios(SWE-EVO:在长周期软件演化场景中评估编码智能体的基准)[09:24] 🏆 LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics(LLM瑞士轮:通过竞争性瑞士制动态聚合多基准性能)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
本期的 15 篇论文如下:[00:19] 🎬 SemanticGen: Video Generation in Semantic Space(SemanticGen:在语义空间中的视频生成)[01:01] 🔍 Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies(自底向上策略优化:你的语言模型策略中暗含内部策略)[01:48] 🧠 SpatialTree: How Spatial Abilities Branch Out in MLLMs(SpatialTree:多模态大语言模型中的空间能力如何分支发展)[02:23] 🤖 LongVideoAgent: Multi-Agent Reasoning with Long Videos(LongVideoAgent:基于多智能体推理的长视频理解)[03:06] 🧠 MemEvolve: Meta-Evolution of Agent Memory Systems(MemEvolve:智能体记忆系统的元进化)[03:46] 🔍 Step-DeepResearch Technical Report(Step-DeepResearch技术报告)[04:22] 🎧 SAM Audio: Segment Anything in Audio(SAM Audio:音频中的任意分割)[05:00] 🚀 INTELLECT-3: Technical Report(INTELLECT-3:技术报告)[05:30] 🔍 FaithLens: Detecting and Explaining Faithfulness Hallucination(FaithLens:检测与解释忠实性幻觉)[06:07] 🧠 Reinforcement Learning for Self-Improving Agent with Skill Library(基于技能库与强化学习的自进化智能体研究)[06:53] 📊 QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models(QuantiPhy:评估视觉语言模型物理推理能力的定量基准)[07:38] 🔊 Simulstream: Open-Source Toolkit for Evaluation and Demonstration of Streaming Speech-to-Text Translation Systems(Simulstream:用于流式语音到文本翻译系统评估与演示的开源工具包)[08:18] 🧠 Active Intelligence in Video Avatars via Closed-loop World Modeling(通过闭环世界建模实现视频化身的主动智能)[08:55] 🔬 Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen's Kappa and Semantic Similarity for Qualitative Research Validation(基于多LLM与双重可靠性度量的主题分析:结合Cohen's Kappa与语义相似度进行定性研究验证)[09:32] ⚠ Toxicity Ahead: Forecasting Conversational Derailment on GitHub(毒性预警:预测GitHub对话中的脱轨行为)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
本期的 15 篇论文如下:[00:22] ⚙ DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI(DataFlow:面向数据为中心AI时代的统一数据准备与工作流自动化LLM驱动框架)[01:04] 🔍 The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding(棱镜假说:通过统一自编码协调语义与像素表示)[01:50] 🎬 Region-Constraint In-Context Generation for Instructional Video Editing(区域约束的上下文生成用于教学视频编辑)[02:33] 🎥 Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation(无限单应性变换作为相机控制视频生成的鲁棒条件)[03:08] 🔍 QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation(QuCo-RAG:基于预训练语料的动态检索增强生成不确定性量化)[03:58] 🤔 Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction(大型语言模型能否评估学生困境?基于能力模拟的人机难度对齐用于试题难度预测)[04:35] 🧭 LoGoPlanner: Localization Grounded Navigation Policy with Metric-aware Visual Geometry(LoGoPlanner:基于定位与度量感知视觉几何的导航策略)[05:13] 🎬 WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion(WorldWarp:利用异步视频扩散传播三维几何)[06:08] 🔍 UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models(UCoder:通过内部探测大语言模型实现无监督代码生成)[06:45] 🧬 GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators(GenEnv:基于难度对齐的大语言模型智能体与环境模拟器协同进化框架)[07:22] 🎨 Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs(推理调色板:通过潜在情境化调节推理以实现(视觉)语言模型的可控探索)[07:56] ⚡ LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding(LoPA:通过前瞻并行解码扩展扩散大语言模型推理)[08:38] 📱 MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments(MobileWorld:在智能体-用户交互与MCP增强环境中评测自主移动智能体)[09:20] ⚖ Does It Tie Out? Towards Autonomous Legal Agents in Venture Capital(它能对上吗?迈向风险投资领域的自主法律智能体)[10:00] 🎬 StoryMem: Multi-shot Long Video Storytelling with Memory(StoryMem:基于记忆的多镜头长视频故事讲述)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递







