HuggingFace 每日AI论文速递

每天10分钟,带您快速了解当日HuggingFace热门AI论文内容。每个工作日更新,欢迎订阅。 📢播客节目在小宇宙、Apple Podcast平台搜索【HuggingFace 每日AI论文速递】 🖼另外还有图文版,可在小红书搜索并关注【AI速递】

2025.12.01 | Z-Image小参高效夺冠;REASONEDIT先思后画登顶

本期的 15 篇论文如下:[00:26] 🚀 Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer(Z-Image:基于单流扩散Transformer的高效图像生成基础模型)[01:00] 🤔 REASONEDIT: Towards Reasoning-Enhanced Image Editing Models(REASONEDIT:迈向推理增强的图像编辑模型)[01:25] 🎬 AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement(AnyTalker:通过交互性精炼实现可扩展的多人物对话视频生成)[01:59] 🌉 Vision Bridge Transformer at Scale(大规模视觉桥接变换器)[02:35] 🔍 Architecture Decoupling Is Not All You Need For Unified Multimodal Model(架构解耦并非统一多模态模型的全部所需)[03:23] ⚡ DiP: Taming Diffusion Models in Pixel Space(DiP:在像素空间驾驭扩散模型)[03:49] 🧠 Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models(每个令牌都重要:在大型语言模型中泛化1600万超长上下文)[04:19] 🤖 DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action(DualVLA:通过部分解耦推理与动作构建可泛化的具身智能体)[05:02] ⚡ Adversarial Flow Models(对抗性流模型)[05:29] 🔬 Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield(解耦的DMD:CFG增强为矛,分布匹配为盾)[06:10] 🎥 Captain Safari: A World Engine(Captain Safari:一种世界引擎)[06:43] 🌍 World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models(框架中的世界:理解文化混合作为视觉语言模型的新挑战)[07:20] 🔍 The Collapse of Patches(图像块坍缩)[07:50] 🔍 RefineBench: Evaluating Refinement Capability of Language Models via Checklists(RefineBench:基于检查表评估语言模型精炼能力)[08:23] 🦷 OralGPT-Omni: A Versatile Dental Multimodal Large Language Model(OralGPT-Omni:一个通用的牙科多模态大语言模型)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

12-01
09:33

【周末特辑】11月第5周最火AI论文 | 自适应正交稳训练;GAM代理即搜忆

本期的 5 篇论文如下:[00:51] TOP1(🔥161) | ⚡ ROOT: Robust Orthogonalized Optimizer for Neural Network Training(ROOT:面向神经网络训练的鲁棒正交化优化器)[02:35] TOP2(🔥141) | 🧠 General Agentic Memory Via Deep Research(通过深度研究的通用代理记忆)[04:33] TOP3(🔥110) | 🧬 GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms(GigaEvo:基于大语言模型与进化算法的开源优化框架)[06:56] TOP4(🔥88) | 🎯 SAM 3: Segment Anything with Concepts(SAM 3:基于概念的通用分割模型)[09:12] TOP5(🔥88) | 🌍 GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization(GeoVista:用于地理定位的Web增强智能视觉推理)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

11-29
11:54

2025.11.28 | 潜在奖励模型提速降显存;画布多模态生成碾压SOTA

本期的 6 篇论文如下:[00:19] 🎬 Video Generation Models Are Good Latent Reward Models(视频生成模型是优秀的潜在奖励模型)[01:07] 🎨 Canvas-to-Image: Compositional Image Generation with Multimodal Controls(画布到图像:基于多模态控制的组合式图像生成)[01:49] 🎨 MIRA: Multimodal Iterative Reasoning Agent for Image Editing(MIRA:多模态迭代推理代理用于图像编辑)[02:30] 📊 Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following(多准则:多模态评估器在多元化标准遵循上的基准测试)[03:12] 🧠 What does it mean to understand language?(理解语言意味着什么?)[03:47] 🧠 Agentic Learner with Grow-and-Refine Multimodal Semantic Memory(具有生长与精炼多模态语义记忆的自主学习者)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

11-28
04:47

2025.11.27 | 俄语多模态评测补空白;潜协作提速14%

本期的 15 篇论文如下:[00:22] 🔍 Multimodal Evaluation of Russian-language Architectures(俄语多模态架构的评估框架)[01:15] 🧠 Latent Collaboration in Multi-Agent Systems(多智能体系统中的潜在协作)[01:47] 🌍 Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation(Inferix:基于块扩散的新一代世界模拟推理引擎)[02:18] 🎭 Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy(和谐:通过跨任务协同实现音频与视频生成的统一)[03:10] 📄 NVIDIA Nemotron Parse 1.1(英伟达Nemotron解析1.1)[03:46] 🧠 Monet: Reasoning in Latent Visual Space Beyond Images and Language(Monet:超越图像与语言的潜在视觉空间推理)[04:25] ⚡ Terminal Velocity Matching(终端速度匹配)[05:03] 📊 Revisiting Generalization Across Difficulty Levels: It's Not So Easy(重新审视跨难度级别的泛化能力:并非易事)[05:42] 🤖 MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots(MobileVLA-R1:强化移动机器人的视觉-语言-动作能力)[06:25] ⚡ Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs(基于轨迹采样对的连续时间一致性图像自由时间步蒸馏)[06:59] 🎮 UniGame: Turning a Unified Multimodal Model Into Its Own Adversary(UniGame:将统一多模态模型转化为其自身的对抗者)[07:47] 🧩 SPHINX: A Synthetic Environment for Visual Perception and Reasoning(SPHINX:用于视觉感知与推理的合成环境)[08:33] ⚡ Block Cascading: Training Free Acceleration of Block-Causal Video Models(块级联:免训练的块因果视频模型加速)[09:12] 🏙 RAISECity: A Multimodal Agent Framework for Reality-Aligned 3D World Generation at City-Scale(RAISECity:面向城市尺度的现实对齐三维世界生成多模态智能体框架)[09:58] 📊 I-GLIDE: Input Groups for Latent Health Indicators in Degradation Estimation(I-GLIDE:基于输入组的退化估计潜在健康指标)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

11-27
11:03

2025.11.26 | 大模型育种进化框架开源;MedSAM-3听懂临床精准分割

本期的 15 篇论文如下:[00:17] 🧬 GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms(GigaEvo:基于大语言模型与进化算法的开源优化框架)[00:57] 🔬 MedSAM3: Delving into Segment Anything with Medical Concepts(MedSAM3:深入探索基于医学概念的通用分割模型)[01:34] 🔍 Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning(Agent0-VL:探索工具集成视觉语言推理的自进化智能体)[02:03] 🎨 iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation(iMontage:统一、通用、高度动态的多对多图像生成)[02:38] 🕺 SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation(SteadyDancer:基于首帧保持的协调连贯人体图像动画)[03:18] 🔍 Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward(理解是否真正指导统一多模态模型的生成?从分析到前进路径)[04:04] 🤖 GigaWorld-0: World Models as Data Engine to Empower Embodied AI(GigaWorld-0:世界模型作为数据引擎赋能具身AI)[04:44] 🎯 Soft Adaptive Policy Optimization(软自适应策略优化)[05:14] 🎬 UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers(UltraViCo:突破视频扩散变换器的外推极限)[05:55] 🎯 SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space(SSA:通过特征空间中对齐全注意力和稀疏注意力输出的稀疏稀疏注意力)[06:51] 🎨 OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation(OmniAlpha:面向统一多任务RGBA生成的序列到序列框架)[07:41] 🎬 ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding(ReDirector:使用旋转相机编码创建任意长度视频重拍)[08:13] 🖼 VQ-VA World: Towards High-Quality Visual Question-Visual Answering(VQ-VA世界:迈向高质量视觉问题-视觉回答)[09:06] 🔍 HunyuanOCR Technical Report(幻方OCR技术报告)[09:48] 🏙 MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts(MajutsuCity:语言驱动美学自适应城市生成与可控3D资产及布局)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

11-26
11:05

2025.11.25 | 即时编译让记忆无损;AutoEnv自动挑环境提两成

本期的 15 篇论文如下:[00:25] 🧠 General Agentic Memory Via Deep Research(通过深度研究的通用代理记忆)[00:52] 🧪 AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning(AutoEnv:用于跨环境智能体学习的自动化环境测量)[01:24] 🤖 Computer-Use Agents as Judges for Generative User Interface(以计算机使用代理作为生成式用户界面的评判者)[01:55] 🎨 DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation(DeCo:用于端到端图像生成的频率解耦像素扩散)[02:24] 🎨 UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios(UltraFlux:面向高质量原生4K文本到图像跨多样宽高比的数据-模型协同设计)[03:10] 🔍 DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research(DR Tulu:基于演化评分标准的深度研究强化学习)[03:46] 🎬 In-Video Instructions: Visual Signals as Generative Control(视频内指令:视觉信号作为生成控制)[04:24] 📊 Budget-Aware Tool-Use Enables Effective Agent Scaling(预算感知的工具使用实现有效的智能体扩展)[05:12] 🎬 Plan-X: Instruct Video Generation via Semantic Planning(Plan-X:通过语义规划指导视频生成)[05:54] 🧪 M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark(M3-Bench:多模态、多跳、多线程工具使用MLLM智能体基准)[06:25] 🤖 Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO(多智能体深度研究:使用M-GRPO训练多智能体系统)[07:24] 🎬 HunyuanVideo 1.5 Technical Report(混元视频1.5技术报告)[07:56] 🧠 Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens(视觉思维链:通过连续视觉标记教导视觉语言模型更好地观察与思考)[08:36] 🧠 MIST: Mutual Information Via Supervised Training(MIST:通过监督训练实现互信息估计)[09:07] 🎨 Controllable Layer Decomposition for Reversible Multi-Layer Image Generation(可控层分解用于可逆多层图像生成)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

11-25
10:01

2025.11.24 | 开源7B模型刷新多模态推理;GeoVista小模型精准地理定位

本期的 15 篇论文如下:[00:21] 🧠 OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe(OpenMMReasoner:以开放通用方案推动多模态推理前沿)[01:04] 🌍 GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization(GeoVista:用于地理定位的Web增强智能视觉推理)[01:41] 🎯 SAM 3: Segment Anything with Concepts(SAM 3:基于概念的通用分割模型)[02:31] 📊 Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story(揭示文本的内在维度:从学术摘要到创意故事)[03:09] 🧠 O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents(O-Mem:面向个性化、长周期、自进化智能体的全能记忆系统)[03:43] 🦜 Parrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs(鹦鹉:输出真相的说服与一致性鲁棒性评级——一个面向大语言模型的谄媚鲁棒性基准)[04:26] 🧠 RynnVLA-002: A Unified Vision-Language-Action and World Model(RynnVLA-002:统一的视觉-语言-动作与世界模型)[05:19] 🧠 VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models(VisMem:潜在视觉记忆解锁视觉语言模型潜力)[05:51] 🌍 WorldGen: From Text to Traversable and Interactive 3D Worlds(WorldGen:从文本到可遍历交互式3D世界)[06:34] 🎨 Loomis Painter: Reconstructing the Painting Process(Loomis Painter:重建绘画过程)[07:06] 🔮 Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight(Mantis:具有解耦视觉预测能力的多功能视觉-语言-动作模型)[07:48] 🎨 InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization(InstructMix2Mix:通过多视图模型个性化实现一致的稀疏视图编辑)[08:21] 🔬 OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists(全能科学家:迈向人类与AI科学家共同进化的生态系统)[09:07] 🧬 MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging(MergeDNA:基于动态标记化的上下文感知基因组建模)[09:41] 🔍 Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination(Video-R4:通过视觉反刍增强文本丰富视频推理)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

11-24
10:42

【周末特辑】11月第4周最火AI论文 | Kandinsky 5.0开源全家桶;MiroThinker开源智能体

本期的 5 篇论文如下:[00:41] TOP1(🔥171) | 🎨 Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation(Kandinsky 5.0:用于图像和视频生成的基础模型家族)[02:02] TOP2(🔥150) | 🚀 MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling(MiroThinker:通过模型、上下文与交互扩展,将开源研究智能体性能推向新边界)[04:31] TOP3(🔥127) | 🏅 P1: Mastering Physics Olympiads with Reinforcement Learning(用强化学习攻克物理奥赛)[06:43] TOP4(🔥126) | 🍲 Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance(“汤”级模型:简单加权平均即可让大语言模型性能跃升)[08:09] TOP5(🔥104) | 🧠 VIDEOP2R: Video Understanding from Perception to Reasoning(VIDEOP2R:从感知到推理的视频理解)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

11-22
10:19

2025.11.21 | V-ReasonBench考视频模型推理;Step-Audio-R1让语音越“想”越强

本期的 15 篇论文如下:[00:22] 📊 V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models(V-ReasonBench:面向视频生成模型的统一推理基准套件)[01:06] 🧠 Step-Audio-R1 Technical Report(Step-Audio-R1技术报告)[01:48] 🧭 Scaling Spatial Intelligence with Multimodal Foundation Models(通过多模态基础模型扩展空间智能)[02:18] 🎬 First Frame Is the Place to Go for Video Content Customization(首帧是实现视频内容定制化的关键所在)[02:49] 🎬 Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO(视频即答案:使用联合GRPO预测并生成下一视频事件)[03:29] 🔮 SAM 3D: 3Dfy Anything in Images(SAM 3D:图像中任意物体的三维化)[04:03] 🚀 MiMo-Embodied: X-Embodied Foundation Model Technical Report(MiMo-Embodied:跨具身基础模型技术报告)[04:38] 🧠 Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation(边生成边思考:在视觉生成中交织文本推理)[05:10] 🏆 TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval(TurkColBERT:土耳其信息检索中稠密与延迟交互模型的基准研究)[05:53] 🌀 Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs(Nemotron Elastic:迈向高效多合一推理大语言模型)[06:26] 🚀 SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models(自参考策略优化:面向视觉-语言-动作模型)[07:09] 🎬 TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding(TimeViper:一种用于高效长视频理解的混合Mamba-Transformer视觉语言模型)[07:46] 🔬 SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking(SAM2S:通过语义长期跟踪实现手术视频中的任意分割)[08:23] 🎨 NaTex: Seamless Texture Generation as Latent Color Diffusion(NaTex:作为潜在颜色扩散的无缝纹理生成)[08:58] 📐 PartUV: Part-Based UV Unwrapping of 3D Meshes(PartUV:基于部件分割的3D网格UV展开方法)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

11-21
09:54

2025.11.20 | 视频模型拍推理链,迷宫百发百中;无标注左右互搏,视觉模型自学跃升

本期的 4 篇论文如下:[00:23] 🎬 Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks(通过视频进行推理:基于走迷宫任务对视频模型推理能力的首次评测)[01:17] 🔄 VisPlay: Self-Evolving Vision-Language Models from Images(VisPlay:基于无标注图像自我进化的视觉-语言模型)[01:54] 📚 ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries(ARC-Chapter:将超长视频结构化导航章节与分层摘要)[02:45] 🦴 MHR: Momentum Human Rig(MHR:动量人体绑定模型)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

11-20
03:36

2025.11.19 | 像素演员难推理;视觉误导测真章

本期的 11 篇论文如下:[00:23] 🧠 Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark(世界模拟器会推理吗?Gen-ViRe生成式视觉推理基准)[01:03] 🕵 MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs(MVI-Bench:评估大型视觉语言模型对误导性视觉输入鲁棒性的综合基准)[01:49] 🎞 REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding(REVISOR:超越文本反思,迈向长视频理解中的多模态内省推理)[03:02] 🧪 ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning(ATLAS:面向通用人工智能的高难度跨学科科学推理基准)[03:43] 🔍 Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework(大语言模型遇上极端多标签分类:可扩展多模态框架)[04:16] 🤖 Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning(Agent-R1:以端到端强化学习训练强大语言模型智能体)[05:02] 🤖 Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution(Orion:统一视觉智能体,实现多模态感知、高级视觉推理与执行)[05:32] ⚖ Mitigating Label Length Bias in Large Language Models(缓解大语言模型中的标签长度偏差)[06:14] 🧠 Agent READMEs: An Empirical Study of Context Files for Agentic Coding(智能体README:面向代理编程的上下文文件实证研究)[06:49] 🎧 Proactive Hearing Assistants that Isolate Egocentric Conversations(主动式听力助手:以自我为中心的对话自动分离技术)[07:20] 🎯 Error-Driven Scene Editing for 3D Grounding in Large Language Models(面向3D大模型的误差驱动场景编辑实现精准视觉定位)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

11-19
08:19

2025.11.18 | RL奥赛夺金;Uni-MoE 2.0全能跃升

本期的 14 篇论文如下:[00:17] 🏅 P1: Mastering Physics Olympiads with Reinforcement Learning(用强化学习攻克物理奥赛)[00:56] 🌐 Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data(Uni-MoE 2.0 Omni:以语言为中心的万模态大模型,通过先进MoE、训练与数据实现规模跃升)[01:42] 🧩 Part-X-MLLM: Part-aware 3D Multimodal Large Language Model(Part-X-MLLM:面向部件感知的3D多模态大语言模型)[02:22] 🧠 TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models(TiViBench:视频生成模型思维推理基准测试)[03:08] 🚀 GroupRank: A Groupwise Reranking Paradigm Driven by Reinforcement Learning(GroupRank:一种由强化学习驱动的分组重排范式)[03:49] 🧩 PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image(PhysX-Anything:单张图像生成可仿真物理3D资产)[04:28] 🌌 UFO$^3$: Weaving the Digital Agent Galaxy(UFO³:编织数字智能体银河)[04:59] 🍲 Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance(“汤”级模型:简单加权平均即可让大语言模型性能跃升)[05:38] 🌍 OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation(OlmoEarth:面向多模态地球观测的稳定潜变量图像建模)[06:19] 🔄 Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?(Live-SWE-agent:软件工程智能体能否实时自我进化?)[06:51] 🚀 MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling(MiroThinker:通过模型、上下文与交互扩展,将开源研究智能体性能推向新边界)[07:36] 🎯 Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models(测试时谱感知潜变量引导实现视觉-语言模型零样本泛化)[08:19] 🧠 WebCoach: Self-Evolving Web Agents with Cross-Session Memory Guidance(WebCoach:具备跨会话记忆引导的自进化网页智能体)[09:10] 🧬 Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs(进化方法而非提示:面向大模型的越狱攻击演化合成)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

11-18
10:08

2025.11.17 | RoPE去噪救长文本;AI速筛离子液体

本期的 13 篇论文如下:[00:24] 🧹 DoPE: Denoising Rotary Position Embedding(DoPE:面向旋转位置嵌入的去噪处理)[00:58] 🧪 AIonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery(AIonopedia:面向离子液体发现的LLM智能体多模态学习编排)[01:44] 🖼 UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation(UI2Code^N:面向测试时可扩展交互式UI转代码生成的视觉语言模型)[02:20] 🚀 Virtual Width Networks(虚拟宽度网络)[02:56] ⚡ LiteAttention: A Temporal Sparse Attention for Diffusion Transformers(LiteAttention:面向扩散Transformer的时序稀疏注意力机制)[03:32] 🌐 Simulating the Visual World with Artificial Intelligence: A Roadmap(用人工智能模拟视觉世界:路线图)[04:12] 📐 GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models(GGBench:面向统一多模态模型的几何生成推理基准)[05:00] 🧏 HI-TransPA: Hearing Impairments Translation Personal Assistant(HI-TransPA:面向听障者的语音-唇形翻译个人助手)[05:35] 🚀 MarsRL: Advancing Multi-Agent Reasoning System via Reinforcement Learning with Agentic Pipeline Parallelism(MarsRL:基于智能体流水线并行强化学习的多智能体推理系统进阶研究)[06:38] 🎭 EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation(EmoVid:面向情感中心视频理解与生成的大规模多模态情感视频数据集)[07:18] 🧭 SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards(SpatialThinker:用空间奖励强化多模态大模型的3D推理)[07:55] 📊 Workload Schedulers -- Genesis, Algorithms and Differences(工作负载调度器——起源、算法与差异)[08:51] 🚗 CATS-V2V: A Real-World Vehicle-to-Vehicle Cooperative Perception Dataset with Complex Adverse Traffic Scenarios(CATS-V2V:面向复杂恶劣交通场景的真实车车协同感知数据集)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

11-17
10:06

【周末特辑】11月第3周最火AI论文 | 3D游戏智能体开源方案;桌面AI少样本精准操控

本期的 5 篇论文如下:[00:38] TOP1(🔥135) | 🌍 Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds(Lumine:在3D开放世界中打造通才智能体的开源方案)[02:47] TOP2(🔥97) | 🖥 Grounding Computer Use Agents on Human Demonstrations(基于人类演示的计算机使用智能体定位研究)[04:44] TOP3(🔥89) | 🧠 Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B(小模型大逻辑:多样性驱动优化唤醒VibeThinker-1.5B的大模型推理力)[06:33] TOP4(🔥84) | 🧠 HaluMem: Evaluating Hallucinations in Memory Systems of Agents(HaluMem:智能体记忆系统幻觉评估基准)[08:56] TOP5(🔥67) | 🧩 IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction(IterResearch:基于马尔可夫状态重构的长程智能体再思考)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

11-15
11:34

2025.11.14 | UniVA四合一开源视频通才;Depth Anything 3单ViT通吃3D

本期的 4 篇论文如下:[00:24] 🎬 UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist(UniVA:面向开源下一代视频通才的通用视频智能体)[00:59] 🌐 Depth Anything 3: Recovering the Visual Space from Any Views(Depth Anything 3:从任意视角恢复视觉空间)[01:50] 🔍 AlphaResearch: Accelerating New Algorithm Discovery with Language Models(AlphaResearch:用语言模型加速全新算法发现)[02:21] 🔍 MuSc-V2: Zero-Shot Multimodal Industrial Anomaly Classification and Segmentation with Mutual Scoring of Unlabeled Samples(MuSc-V2:无需标注样本的零样本多模态工业异常分类与分割)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

11-14
03:25

2025.11.13 | 原神数据炼成7B通用AI;零训练轨迹秒变视频遥控器

本期的 9 篇论文如下:[00:19] 🌍 Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds(Lumine:在3D开放世界中打造通才智能体的开源方案)[00:54] 🎬 Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising(Time-to-Move:无需训练的双时钟去噪运动控制视频生成)[01:31] ⚡ TiDAR: Think in Diffusion, Talk in Autoregression(TiDAR:扩散式思考,自回归式表达)[02:15] 🔄 LoopTool: Closing the Data-Training Loop for Robust LLM Tool Calls(LoopTool:闭合数据-训练循环,铸就鲁棒LLM工具调用)[02:51] 🤖 WMPO: World Model-based Policy Optimization for Vision-Language-Action Models(基于世界模型的视觉-语言-动作策略优化)[03:33] 🖥 WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation(WebVIA:可交互可验证的网页端视觉-语言智能体UI代码生成框架)[04:19] 🎯 Toward the Frontiers of Reliable Diffusion Sampling via Adversarial Sinkhorn Attention Guidance(迈向对抗式Sinkhorn注意力引导的可靠扩散采样新前沿)[04:55] 🤖 Agentic Refactoring: An Empirical Study of AI Coding Agents(智能体重构:AI编程智能体的大规模实证研究)[05:31] 🛡 Stemming Hallucination in Language Models Using a Licensing Oracle(利用许可证预言机遏制语言模型幻觉)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

11-13
06:28

2025.11.12 | 1.5B小模型反超671B大模型;多智能体质检聊天机器人

本期的 9 篇论文如下:[00:24] 🧠 Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B(小模型大逻辑:多样性驱动优化唤醒VibeThinker-1.5B的大模型推理力)[00:59] 🤝 Adaptive Multi-Agent Response Refinement in Conversational Systems(对话系统中自适应多智能体响应精炼机制)[01:30] 🧩 Wasm: A Pipeline for Constructing Structured Arabic Interleaved Multimodal Corpora(Wasm:构建结构化阿拉伯交错型多模态语料的流水线)[02:17] ⚡ KLASS: KL-Guided Fast Inference in Masked Diffusion Models(KLASS:基于KL散度引导的掩码扩散模型快速采样)[02:53] 🖥 Grounding Computer Use Agents on Human Demonstrations(基于人类演示的计算机使用智能体定位研究)[03:37] 🎥 VideoSSR: Video Self-Supervised Reinforcement Learning(VideoSSR:视频自监督强化学习)[04:19] 🚪 The Path Not Taken: RLVR Provably Learns Off the Principals(未被选择的路径:RLVR确实沿非主方向学习)[05:14] 🔗 BiCA: Effective Biomedical Dense Retrieval with Citation-Aware Hard Negatives(BiCA:面向引文感知难负样本的生物医学稠密检索)[05:56] 🤹 Walking the Tightrope of LLMs for Software Development: A Practitioners' Perspective(游走于大型语言模型的钢丝绳——开发者视角的平衡之道)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

11-12
06:56

2025.11.11 | 小窗口勤总结刷新深度研究;先广撒网再啃难题激活代码竞赛

本期的 13 篇论文如下:[00:25] 🧩 IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction(IterResearch:基于马尔可夫状态重构的长程智能体再思考)[01:16] 🏆 DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation(DRIVE:面向可验证奖励强化学习的竞赛级代码生成数据精选最佳实践)[02:03] 🔬 The Station: An Open-World Environment for AI-Driven Discovery(“站”:面向AI驱动科学发现的开放世界环境)[02:43] 🚀 RedOne 2.0: Rethinking Domain-specific LLM Post-Training in Social Networking Services(RedOne 2.0:社交网络场景下领域大模型后训练新范式)[03:15] 🧠 SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization(SofT-GRPO:用Gumbel重参数化软思考策略优化让离散Token强化学习望尘莫及)[03:53] 🧭 Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs(路由流形对齐提升混合专家大语言模型的泛化能力)[04:30] 🔍 Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads(以置信度推理:通过不确定性头高效验证大模型推理步骤)[05:10] 🎬 MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs(MVU-Eval:面向多模态大模型的多视频理解评测基准)[05:50] 🎨 MPJudge: Towards Perceptual Assessment of Music-Induced Paintings(MPJudge:面向音乐诱发绘画的感知评估)[06:57] 🔄 RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization(RLoop:一种通过迭代策略初始化自我提升的强化学习框架)[07:36] 🤖 Robot Learning from a Physical World Model(基于物理世界模型的机器人学习)[08:21] 🛠 NURBGen: High-Fidelity Text-to-CAD Generation through LLM-Driven NURBS Modeling(NURBGen:基于大模型驱动NURBS建模的高保真文本转CAD生成)[08:52] 🚀 SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?(SWE-fficiency:语言模型能否在真实工作负载下优化真实仓库性能?)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

11-11
09:58

2025.11.10 | DeepEyesV2小模型边看图边写代码;纯数据让AI长出立体眼

本期的 7 篇论文如下:[00:21] 🧠 DeepEyesV2: Toward Agentic Multimodal Model(DeepEyesV2:迈向智能体多模态模型)[01:13] 🧭 Visual Spatial Tuning(视觉空间调优)[01:54] 🦹 Too Good to be Bad: On the Failure of LLMs to Role-Play Villains(过于完美以致无法邪恶:大语言模型反派角色扮演的失败)[02:27] 🧠 Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings(通过精炼文本嵌入减轻大型视觉-语言模型中的幻觉)[03:13] 🪡 Jailbreaking in the Haystack(干草堆中的越狱攻击)[03:48] 🎯 CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?(CritiCal:语言批判能否校准大模型置信度?)[04:23] 🏃 Dense Motion Captioning(密集动作字幕生成)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

11-10
05:30

【周末特辑】11月第2周最火AI论文 | 视频生成即推理;SVG草图变代码

本期的 5 篇论文如下:[00:31] TOP1(🔥137) | 🎬 Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm(用视频思考:视频生成作为统一多模态推理新范式)[02:43] TOP2(🔥95) | 🖼 VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation(VCode:以SVG为符号视觉表征的多模态代码评测基准)[05:12] TOP3(🔥90) | 🚀 Diffusion Language Models are Super Data Learners(扩散语言模型是超级数据学习者)[07:18] TOP4(🔥88) | 👁 Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization(别让VLA变盲:对齐视觉表征实现分布外泛化)[09:24] TOP5(🔥79) | 🧠 Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation(全激活赋能:将通用推理模型扩展到万亿参数的开放语言基座)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递

11-08
12:07

Recommend Channels