Discover
HuggingFace 每日AI论文速递
562 Episodes
Reverse
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 5 篇论文如下:[00:40] TOP1(🔥309) | 🧠 FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization(FIPO:通过未来KL影响策略优化引导深度推理)[02:58] TOP2(🔥302) | 🚁 CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence(CARLA-Air:在CARLA世界中飞行无人机——面向空地具身智能的统一基础设施)[05:23] TOP3(🔥170) | 🛡 ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers(ClawKeeper:通过技能、插件和监视器为OpenClaw代理提供全面的安全保护)[07:56] TOP4(🔥151) | 🎬 ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling(ShotStream:用于交互式叙事的多镜头流式视频生成)[10:17] TOP5(🔥147) | 🧠 Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models(视野之外,记忆犹在:用于动态视频世界模型的混合记忆)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 15 篇论文如下:[00:41] 🔄 DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models(DataFlex:面向大语言模型数据中心化动态训练的统一框架)[01:48] 🧠 The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook(潜在空间:基础、演进、机制、能力与展望)[02:45] 🧠 SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization(SKILL0:用于技能内化的上下文智能体强化学习)[03:22] 🎮 Generative World Renderer(生成式世界渲染器)[04:09] 👁 EgoSim: Egocentric World Simulator for Embodied Interaction Generation(EgoSim:面向具身交互生成的第一人称世界模拟器)[05:24] 🧠 LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model(LatentUM:通过潜在空间统一模型释放交错跨模态推理的潜力)[06:06] 🧠 Omni-SimpleMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory(Omni-SimpleMem:基于自主研究引导的终身多模态智能体记忆发现)[06:47] 🚗 UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving(UniDriveVLA:统一自动驾驶中的理解、感知与动作规划)[07:35] 🎯 Steerable Visual Representations(可操控的视觉表示)[08:12] 🎬 VOID: Video Object and Interaction Deletion(VOID:视频对象与交互删除)[09:06] 🤖 Investigating Autonomous Agent Contributions in the Wild: Activity Patterns and Code Change over Time(探究自主编码代理在真实项目中的贡献:活动模式与代码随时间的变化)[09:47] 🚀 ASI-Evolve: AI Accelerates AI(ASI-Evolve:人工智能加速人工智能发展)[10:50] 🎭 Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models(Tex3D:通过对抗性3D纹理将物体作为视觉-语言-动作模型的攻击面)[11:36] 🤖 GPA: Learning GUI Process Automation from Demonstrations(GPA:通过演示学习图形用户界面流程自动化)[12:24] 🔍 VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification(VideoZeroBench:通过时空证据验证探究视频多模态大语言模型的极限)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 15 篇论文如下:[00:27] 🛡 ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers(ClawKeeper:通过技能、插件和监视器为OpenClaw代理提供全面的安全保护)[01:20] 💻 Terminal Agents Suffice for Enterprise Automation(终端智能体足以实现企业自动化)[02:03] 📊 MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome(MiroEval:面向过程和结果的多模态深度研究智能体基准测试)[02:54] 🧠 ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?(ViGoR-Bench:视觉生成模型距离零样本视觉推理器还有多远?)[03:40] 🔬 Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification(Vision2Web:基于智能体验证的视觉网站开发分层基准)[04:26] 📊 QuitoBench: A High-Quality Open Time Series Forecasting Benchmark(QuitoBench:一个高质量开放时间序列预测基准)[05:12] 🧠 Reasoning Shift: How Context Silently Shortens LLM Reasoning(推理偏移:上下文如何悄然缩短大语言模型的推理过程)[05:59] 📊 HippoCamp: Benchmarking Contextual Agents on Personal Computers(HippoCamp:在个人计算机上评估情境智能体的基准)[06:52] 🧠 PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning(PerceptionComp:面向复杂感知推理的视频基准测试)[07:34] ⚡ Universal YOCO for Efficient Depth Scaling(通用YOCO:实现高效深度扩展)[08:12] 🔄 Brevity Constraints Reverse Performance Hierarchies in Language Models(简洁性约束逆转语言模型的性能层级)[08:48] 🧠 GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation(GaussianGPT:迈向自回归3D高斯场景生成)[09:25] 📝 Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers(论文重构评估:评估AI撰写论文的呈现质量与幻觉问题)[10:11] 🚀 Embarrassingly Simple Self-Distillation Improves Code Generation(极其简单的自蒸馏提升代码生成能力)[10:54] 🤖 Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants(主动式智能体研究环境:通过模拟主动用户来评估主动式助手)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 15 篇论文如下:[00:30] 🧠 FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization(FIPO:通过未来KL影响策略优化引导深度推理)[01:12] 🧩 LongCat-Next: Lexicalizing Modalities as Discrete Tokens(LongCat-Next:将多模态信息离散化为标记)[01:48] 🚁 CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence(CARLA-Air:在CARLA世界中飞行无人机——面向空地具身智能的统一基础设施)[02:31] 🧬 Lingshu-Cell: A generative cellular world model for transcriptome modeling toward virtual cells(Lingshu-Cell:一种用于转录组建模的生成式细胞世界模型,迈向虚拟细胞)[03:33] 🤖 GEMS: Agent-Native Multimodal Generation with Memory and Skills(GEMS:具备记忆与技能的智能体原生多模态生成框架)[04:12] 🎬 VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward(VGGRPO:迈向具有4D潜在奖励的世界一致性视频生成)[05:04] 🤖 Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis(Unify-Agent:面向世界接地的图像合成的统一多模态智能体)[05:45] 🔬 daVinci-LLM:Towards the Science of Pretraining(daVinci-LLM:迈向预训练的科学)[06:19] 🎬 CutClaw: Agentic Hours-Long Video Editing via Music Synchronization(CutClaw:通过音乐同步实现代理式数小时视频编辑)[07:10] 🔍 MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models(MonitorBench:大型语言模型中思维链可监控性的综合基准)[07:58] 🧬 FlowPIE: Test-Time Scientific Idea Evolution with Flow-Guided Literature Exploration(FlowPIE:基于流引导文献探索的测试时科学思想演化)[08:46] 🏙 Extend3D: Town-Scale 3D Generation(Extend3D:城镇尺度的三维生成)[09:28] 💭 Think Anywhere in Code Generation(代码生成中的随处思考)[10:18] ⚙ OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training(OptiMer:最优分布向量合并优于数据混合用于持续预训练)[11:03] 🎨 VectorGym: A Multitask Benchmark for SVG Code Generation, Sketching, and Editing(VectorGym:面向SVG代码生成、绘制与编辑的多任务基准)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 15 篇论文如下:[00:30] 🚀 TAPS: Task Aware Proposal Distributions for Speculative Sampling(TAPS:面向推测采样的任务感知提议分布)[01:11] 🔬 Towards a Medical AI Scientist(迈向医学AI科学家)[02:03] 🔍 Gen-Searcher: Reinforcing Agentic Search for Image Generation(Gen-Searcher:强化图像生成的代理搜索)[02:43] ⚠ Emergent Social Intelligence Risks in Generative Multi-Agent Systems(生成式多智能体系统中的涌现社会智能风险)[03:22] ⚙ EpochX: Building the Infrastructure for an Emergent Agent Civilization(EpochX:构建涌现性智能体文明的基础设施)[04:01] 📊 GEditBench v2: A Human-Aligned Benchmark for General Image Editing(GEditBench v2:一个面向人类对齐的通用图像编辑基准)[05:00] 🧠 On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models(论令牌的困境:用于大型视觉语言模型持续学习的、具有漂移感知令牌分配能力的动态混合专家模型)[05:56] 🔬 PRBench: End-to-end Paper Reproduction in Physics Research(PRBench:物理学研究中的端到端论文复现基准)[06:37] 🧠 Make Geometry Matter for Spatial Reasoning(让几何信息在空间推理中发挥作用)[07:28] 🖼 ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks(ImagenWorld:基于可解释人类评估对开放世界任务进行图像生成模型的压力测试)[08:18] 🎨 On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers(基于上下文空间即时排斥的扩散变换器多样性增强研究)[09:11] 🧠 MuSEAgent: A Multimodal Reasoning Agent with Stateful Experiences(MuSEAgent:一种具备状态化经验的多模态推理智能体)[09:55] ⚡ Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization(Kernel-Smith:进化式内核优化的统一方案)[10:55] 🎯 ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning(ResAdapt:面向高效多模态推理的自适应分辨率)[12:07] 🔍 Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design(Marco DeepResearch:通过以验证为中心的设计解锁高效深度研究智能体)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 10 篇论文如下:[00:28] 🎬 ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling(ShotStream:用于交互式叙事的多镜头流式视频生成)[01:07] 🎬 PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference(PackForcing:短视频训练足以实现长视频采样与长上下文推理)[01:54] 🧠 Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills(Trace2Skill:将轨迹局部经验提炼为可迁移的智能体技能)[02:43] 📊 RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation(RealChart2Code:基于真实数据与多任务评估推进图表到代码生成)[03:53] 🚗 LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset(带有推理轨迹的长尾驾驶场景:KITScenes长尾数据集)[04:42] 🧠 Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models(Know3D:利用视觉语言模型知识驱动的3D生成提示)[05:25] 🛠 Natural-Language Agent Harnesses(自然语言智能体控制框架)[06:10] 🎤 Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models(侍酒师:面向全双工语音语言模型的可扩展开放多轮音频预处理)[06:59] 🔬 MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies(MedOpenClaw:基于未整理完整研究的可审计医学影像智能体推理)[07:46] 🚀 Diffutron: A Masked Diffusion Language Model for Turkish Language(Diffutron:面向土耳其语的掩码扩散语言模型)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 5 篇论文如下:[00:49] TOP1(🔥124) | 🔍 MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding(MinerU-Diffusion:将文档OCR重新思考为通过扩散解码的逆向渲染)[03:11] TOP2(🔥122) | 🧪 Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models(Omni-WorldBench:迈向面向世界模型的全面交互中心化评估)[05:47] TOP3(🔥114) | 🚀 Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model(速度源于简洁:用于快速音视频生成基础模型的单流架构)[07:54] TOP4(🔥104) | 🎬 Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models(Astrolabe:面向蒸馏自回归视频模型的前向过程强化学习引导框架)[10:09] TOP5(🔥104) | 🔗 HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning(HopChain:用于可泛化视觉语言推理的多跳数据合成)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 15 篇论文如下:[00:35] 😊 PixelSmile: Toward Fine-Grained Facial Expression Editing(PixelSmile:面向细粒度面部表情编辑)[01:27] 🚀 Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale(Intern-S1-Pro:万亿参数规模的科学多模态基础模型)[02:10] 🖼 RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models(RealRestorer:基于大规模图像编辑模型实现可泛化的真实世界图像复原)[02:52] 🖼 MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data(MACRO:利用结构化长上下文数据推进多参考图像生成)[03:42] ⚙ Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration(Calibri:通过参数高效校准增强扩散变换器)[04:25] 🗣 Voxtral TTS(Voxtral TTS:基于混合架构的富有表现力多语言文本转语音模型)[05:03] 📉 SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks(SlopCodeBench:基准测试编码智能体在长视野迭代任务中的性能退化)[05:49] 🧠 MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens(MSA:内存稀疏注意力机制,实现端到端内存模型高效扩展至1亿词元)[06:39] 🎬 AVControl: Efficient Framework for Training Audio-Visual Controls(AVControl:用于训练视听控制的高效框架)[07:23] 🎨 Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting(更少的高斯,更多的纹理:4K前馈纹理化高斯泼溅)[08:10] 🔍 MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models(MuRF:解锁视觉基础模型的多尺度潜力)[09:12] 🔍 Representation Alignment for Just Image Transformers is not Easier than You Think(表征对齐对于纯图像Transformer而言并非易事)[10:06] ⚡ S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation(S2D2:基于免训练自推测的扩散大语言模型快速解码方法)[10:46] 📊 FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol(FinMCP-Bench:基于模型上下文协议的真实世界金融工具使用场景下大语言模型智能体基准测试)[11:35] 🔬 BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment(BioVITA:面向视觉-文本-声学对齐的生物数据集、模型与基准)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 15 篇论文如下:[00:27] 🎬 CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents(CUA-Suite:用于计算机使用代理的大规模人工标注视频演示集)[01:24] 🎬 EVA: Efficient Reinforcement Learning for End-to-End Video Agent(EVA:面向端到端视频智能体的高效强化学习框架)[02:05] 🛡 T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search(T-MAP:基于轨迹感知进化搜索的LLM智能体红队测试)[02:50] 🤖 UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience(UI-Voyager:一种通过失败经验学习的自进化图形用户界面代理)[03:33] 🤔 Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?(自蒸馏为何(有时)会削弱大语言模型的推理能力?)[04:20] 🎮 GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents(GameplayQA:面向决策密集型第一人称同步多视频理解的3D虚拟智能体基准测试框架)[05:13] 🧠 When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning(当模型自我评判时:多模态推理的无监督自我进化)[06:11] 🤖 CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare(CarePilot:面向医疗领域长周期计算机任务自动化的多智能体框架)[07:13] 🌀 4DGS360: 360° Gaussian Reconstruction of Dynamic Objects from a Single Video(4DGS360:基于单视频的动态物体360度高斯重建)[07:54] 🎬 OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning(OmniWeaving:面向自由组合与推理的统一视频生成)[08:38] 🚗 Toward Physically Consistent Driving Video World Models under Challenging Trajectories(面向挑战性轨迹下物理一致性驾驶视频世界模型的研究)[09:18] 📊 Can LLM Agents Be CFOs? A Benchmark for Resource Allocation in Dynamic Enterprise Environments(LLM智能体能否胜任CFO?动态企业环境中资源分配的基准测试)[10:10] 🧠 Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning(通过文本表征引导推理释放多模态大语言模型的空间推理能力)[10:53] 🤖 StreamingClaw Technical Report(StreamingClaw技术报告)[11:30] 🔍 LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis(LagerNVS:基于潜在几何的全神经实时新视角合成)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 15 篇论文如下:[00:29] 🔍 MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding(MinerU-Diffusion:将文档OCR重新思考为通过扩散解码的逆向渲染)[01:18] 🎮 WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG(WildWorld:面向生成式动作角色扮演游戏的大规模动态世界建模数据集,包含动作与显式状态)[02:10] ⚡ SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning(SpecEyes:通过推测式感知与规划加速智能体多模态大语言模型)[02:59] 🎥 PEARL: Personalized Streaming Video Understanding Model(PEARL:个性化流式视频理解模型)[03:46] 🔍 DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models(DA-Flow:基于扩散模型的退化感知光流估计)[04:30] 📊 From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents(从静态模板到动态运行时图:LLM智能体工作流优化综述)[05:13] 🤖 SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM(SIMART:通过大语言模型将整体网格分解为仿真就绪的关节化资产)[05:52] 🧠 UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation(UniGRPO:面向推理驱动视觉生成的统一策略优化)[06:45] 🎬 RealMaster: Lifting Rendered Scenes into Photorealistic Video(RealMaster:将渲染场景提升为逼真视频)[07:32] 🤖 2Xplat: Two Experts Are Better Than One Generalist(2Xplat:两个专家胜过一个通才)[08:15] 🔍 Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought(重新思考多模态思维链的令牌级策略优化)[09:03] 👁 Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing(先注视后注意:通过自回归凝视实现高效可扩展的视频理解)[09:57] 🎯 VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models(VP-VLA:视觉提示作为视觉-语言-动作模型的接口)[10:48] 🧠 ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model(ThinkJEPA:赋能潜在世界模型的大型视觉-语言推理模型)[11:40] 🤖 AgentSLR: Automating Systematic Literature Reviews in Epidemiology with Agentic AI(AgentSLR:基于智能体人工智能的流行病学系统文献综述自动化)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 15 篇论文如下:[00:32] 🧪 Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models(Omni-WorldBench:迈向面向世界模型的全面交互中心化评估)[01:13] 🚀 Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model(速度源于简洁:用于快速音视频生成基础模型的单流架构)[01:55] 🧠 LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning(LongCat-Flash-Prover:通过智能体工具集成强化学习推进原生形式推理)[02:42] 🔍 VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding(VideoDetective:基于外部查询与内部相关性的线索搜寻用于长视频理解)[03:30] 🧠 SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning(SpatialBoost:通过语言引导推理增强视觉表征)[04:10] 🎯 F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting(F4Splat:用于前馈3D高斯泼溅的前馈预测性致密化)[05:03] 🎬 Manifold-Aware Exploration for Reinforcement Learning in Video Generation(面向视频生成的强化学习中的流形感知探索)[05:56] ⚖ mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT(mSFT:解决多任务监督微调中数据集混合的异质过拟合问题)[06:46] 🧠 Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection(Group3D:基于多模态大语言模型的语义分组开放词汇3D物体检测)[07:35] 🔄 Repurposing Geometric Foundation Models for Multi-view Diffusion(几何基础模型在多视角扩散中的再利用)[08:21] 🤖 RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models(RoboAlign:学习视觉-语言-动作模型中语言-动作对齐的测试时推理)[09:15] 🔍 OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis(OpenResearcher:一个完全开源的深度研究长程轨迹合成流程)[10:02] 💭 BubbleRAG: Evidence-Driven Retrieval-Augmented Generation for Black-Box Knowledge Graphs(BubbleRAG:面向黑盒知识图谱的证据驱动检索增强生成)[10:54] ⚖ SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models(SEM:用于视觉语言模型事后去偏的稀疏嵌入调制)[11:43] 🧭 On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation(论RLVR更新方向对LLM推理的影响:识别与利用)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 15 篇论文如下:[00:31] 🔗 HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning(HopChain:用于可泛化视觉语言推理的多跳数据合成)[01:28] 🎬 Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models(Astrolabe:面向蒸馏自回归视频模型的前向过程强化学习引导框架)[02:06] 🛰 TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation(TerraScope:面向地球观测的像素级视觉推理)[02:56] 🔍 ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models(ProactiveBench:多模态大语言模型主动性能力评测基准)[03:45] 🎬 LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation(LumosX:通过属性关联任意身份实现个性化视频生成)[04:50] 🏠 FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow(FlowScene:基于多模态图整流流的风格一致室内场景生成)[05:35] 🧠 The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus(面向大语言模型的Y组合子:用λ演算解决长上下文困境)[06:20] 🎯 A Subgoal-driven Framework for Improving Long-Horizon LLM Agents(一种用于改进长视野LLM智能体的子目标驱动框架)[07:02] 🔍 How Well Does Generative Recommendation Generalize?(生成式推荐模型的泛化能力究竟如何?)[07:48] 🌍 WorldAgents: Can Foundation Image Models be Agents for 3D World Models?(WorldAgents:基础图像模型能否成为3D世界模型的智能体?)[08:24] ⚡ BEAVER: A Training-Free Hierarchical Prompt Compression Method via Structure-Aware Page Selection(BEAVER:一种基于结构感知页面选择的免训练分层提示压缩方法)[09:05] 🚀 Hyperagents(超智能体:可自我编辑的元认知自改进智能体)[09:54] 🎬 HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering(HiMu:面向长视频问答的分层多模态帧选择方法)[10:37] 🎬 EgoForge: Goal-Directed Egocentric World Simulator(EgoForge:目标导向的自我中心世界模拟器)[11:50] 🎬 Versatile Editing of Video Content, Actions, and Dynamics without Training(无需训练的通用视频内容、动作与动态编辑)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 5 篇论文如下:[00:42] TOP1(🔥344) | 🧠 Demystifing Video Reasoning(揭秘视频推理机制)[02:20] TOP2(🔥289) | 🏭 InCoder-32B: Code Foundation Model for Industrial Scenarios(InCoder-32B:面向工业场景的代码基础模型)[04:37] TOP3(🔥263) | 🧠 AI Can Learn Scientific Taste(AI可以学习科学品味)[06:33] TOP4(🔥238) | 🗣 SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models(SocialOmni:全模态模型中视听社交交互能力的基准测试)[08:31] TOP5(🔥166) | 🤖 MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification(MiroThinker-1.7与H1:通过验证迈向重型研究智能体)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 15 篇论文如下:[00:29] 🧠 Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding(生成模型懂空间:释放隐式3D先验用于场景理解)[01:09] 🎬 SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing(SAMA:基于分解式语义锚定与运动对齐的指令引导视频编辑)[01:45] ⚡ FASTER: Rethinking Real-Time Flow VLAs(FASTER:重新思考实时流视觉语言动作模型)[02:30] 🎬 3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model(3DreamBooth:高保真三维主体驱动视频生成模型)[03:31] 🤖 Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer(基于扩散的离散运动分词器:连接语义与运动学条件)[04:21] 🤖 MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction(MonoArt:基于渐进式结构推理的单目铰接三维重建)[05:13] 🧩 Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens(立方离散扩散:基于高维表示令牌的离散视觉生成)[05:47] 📊 LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs(LVOmniBench:面向全模态大语言模型的长音频视频理解评估新基准)[06:42] 🧠 Memento-Skills: Let Agents Design Agents(Memento-Skills:让智能体设计智能体)[07:18] 🌍 F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World(F2LLM-v2:面向多语言世界的包容性、高性能且高效的嵌入模型)[08:00] 🧠 Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation(Nemotron-Cascade 2:通过级联强化学习和多领域同策略蒸馏进行大语言模型的后训练)[08:54] 🧠 Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding(多模态大语言模型在离散符号理解中的认知错配)[09:45] 🎬 EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing(EffectErase:面向高质量效果擦除的视频对象联合移除与插入)[10:58] 🔧 VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining(VTC-Bench:通过组合式视觉工具链评估代理式多模态模型)[11:39] 🗣 MOSS-TTS Technical Report(MOSS-TTS技术报告)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 15 篇论文如下:[00:30] 🔮 Video-CoE: Reinforcing Video Event Prediction via Chain of Events(Video-CoE:通过事件链强化视频事件预测)[01:13] 🧬 MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild(MetaClaw:只需对话——一种在真实环境中元学习与进化的智能体)[02:01] 🧠 MosaicMem: Hybrid Spatial Memory for Controllable Video World Models(MosaicMem:用于可控视频世界模型的混合空间记忆)[02:55] ⚖ Alignment Makes Language Models Normative, Not Descriptive(对齐使语言模型趋于规范性,而非描述性)[03:42] 🧠 Complementary Reinforcement Learning(互补强化学习)[04:33] 🤖 Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models(先看后动:增强视觉-语言-动作模型的视觉基础表征)[05:24] 🤖 GigaWorld-Policy: An Efficient Action-Centered World--Action Model(GigaWorld-Policy:一种高效的动作中心化世界-动作模型)[06:07] 🎬 Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models(时间增益,空间代价:重新审视多模态大语言模型中的视频微调)[06:54] 🤖 When AI Navigates the Fog of War(当AI穿越战争迷雾:基于2026年中东冲突早期阶段的时序性案例研究)[07:49] 🧩 LoST: Level of Semantics Tokenization for 3D Shapes(LoST:面向三维形状的语义层级分词方法)[08:21] 🧠 BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs(BenchPreS:面向持久记忆大语言模型上下文感知个性化偏好选择性的基准测试)[09:09] 🧠 ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models(ESPIRE:面向视觉语言模型的具身空间推理诊断基准)[09:47] 🤖 Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting(通过后验转移重加权实现保守的离线机器人策略学习)[10:46] 🎥 Stereo World Model: Camera-Guided Stereo Video Generation(立体世界模型:相机引导的立体视频生成)[11:32] 🧠 AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents(AdaMem:面向长程对话代理的自适应用户中心记忆)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 15 篇论文如下:[00:29] 🤖 MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification(MiroThinker-1.7与H1:通过验证迈向重型研究智能体)[01:10] 🏭 InCoder-32B: Code Foundation Model for Industrial Scenarios(InCoder-32B:面向工业场景的代码基础模型)[02:08] 🧠 Qianfan-OCR: A Unified End-to-End Model for Document Intelligence(千帆OCR:一个用于文档智能的统一端到端模型)[02:50] 🤖 Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation(Kinema4D:面向时空具身仿真的运动学4D世界建模)[03:28] 🧠 Demystifing Video Reasoning(揭秘视频推理机制)[04:26] 🎮 WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation(WorldCam:以相机位姿为统一几何表示的交互式自回归3D游戏世界)[05:26] 🧠 TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas(TRUST-SQL:面向未知模式的文本到SQL工具集成多轮强化学习)[06:12] 🤔 Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding(在不确定性中思考:通过潜在熵感知解码缓解多模态大推理模型的幻觉问题)[07:02] 🔄 Online Experiential Learning for Language Models(语言模型的在线体验式学习)[07:54] 📊 FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use(FinToolBench:评估面向现实世界金融工具使用的大语言模型智能体)[08:47] 🚀 Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training(重新思考统一多模态模型视觉生成:基于掩码建模的高效纯图像预训练)[09:30] 🧭 WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation(WiT:基于轨迹冲突导航的路径点扩散Transformer)[10:20] 🔍 AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents(AgentProcessBench:诊断工具使用智能体的步骤级过程质量)[11:03] 🎨 SegviGen: Repurposing 3D Generative Model for Part Segmentation(SegviGen:重新利用3D生成模型进行部件分割)[11:59] 🗣 SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models(SocialOmni:全模态模型中视听社交交互能力的基准测试)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 15 篇论文如下:[00:29] 🧠 AI Can Learn Scientific Taste(AI可以学习科学品味)[01:13] 🔍 OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data(OpenSeeker:通过完全开源训练数据实现前沿搜索代理的民主化)[02:06] 🏢 EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings(EnterpriseOps-Gym:企业环境中状态感知的智能体规划与工具使用评估环境)[03:00] 🌆 Grounding World Simulation Models in a Real-World Metropolis(将世界仿真模型锚定于真实大都市)[03:53] 🤖 HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions(HSImul3R:基于物理闭环的仿真就绪人-场景交互重建)[04:39] 🧠 Attention Residuals(注意力残差)[05:38] 🧠 Mixture-of-Depths Attention(混合深度注意力机制)[06:44] 🧠 Effective Distillation to Hybrid xLSTM Architectures(面向混合xLSTM架构的高效知识蒸馏)[07:23] 🔍 Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models(谎言剖析:追踪视觉语言模型幻觉的多阶段诊断框架)[08:14] 🎬 ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer(ViFeEdit:一种无需视频数据的视频扩散变换器调谐器)[08:54] 🚀 POLCA: Stochastic Generative Optimization with LLM(POLCA:基于大语言模型的随机生成优化)[10:00] 🤖 Safe and Scalable Web Agent Learning via Recreated Websites(通过重建网站实现安全且可扩展的网页智能体学习)[10:45] 🔍 Make it SING: Analyzing Semantic Invariants in Classifiers(使其SING:分析分类器中的语义不变量)[11:28] ⏱ TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning(终结者:学习链式思维推理中提前停止的最优退出点)[12:30] 🎬 WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics(WebVR:基于人类对齐视觉量表的视频到网页重建多模态大语言模型评测基准)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 15 篇论文如下:[00:28] 🧠 LMEB: Long-horizon Memory Embedding Benchmark(LMEB:长时程记忆嵌入基准)[01:12] 🔄 Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation(Cheers:通过解耦补丁细节与语义表征实现统一的多模态理解与生成)[01:59] 🐳 daVinci-Env: Open SWE Environment Synthesis at Scale(daVinci-Env:大规模开源软件工程环境合成)[02:46] 🔍 Can Vision-Language Models Solve the Shell Game?(视觉语言模型能破解“猜球游戏”吗?)[03:26] ⚡ OmniForcing: Unleashing Real-time Joint Audio-Visual Generation(OmniForcing:释放实时联合视听生成)[04:14] 🎯 Visual-ERM: Reward Modeling for Visual Equivalence(Visual-ERM:面向视觉等价性的奖励建模)[05:11] 🔍 MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning(MM-CondChain:一个经程序验证的视觉基础深度组合推理基准)[06:18] 🌉 V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration(V-Bridge:将视频生成先验桥接至通用少样本图像复原)[07:05] 🔍 Multimodal OCR: Parse Anything from Documents(多模态OCR:从文档中解析一切)[07:49] 🧠 Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously(视频流式思考:VideoLLMs能够边观看边推理)[08:22] ⚠ HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios(HomeSafe-Bench:评估视觉语言模型在家庭场景具身智能体不安全动作检测中的表现)[09:13] 🔍 From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space(从稀疏到稠密:通过增强条件空间实现流模型的多视图GRPO)[09:59] ⚡ HybridStitch: Pixel and Timestep Level Model Stitching for Diffusion Acceleration(HybridStitch:用于扩散加速的像素与时间步级别模型拼接)[11:04] 🧠 Steve-Evolving: Open-World Embodied Self-Evolution via Fine-Grained Diagnosis and Dual-Track Knowledge Distillation(史蒂夫进化:通过细粒度诊断与双轨知识蒸馏实现开放世界具身自我进化)[11:54] 🎬 VQQA: An Agentic Approach for Video Evaluation and Quality Improvement(VQQA:一种用于视频评估与质量提升的智能体方法)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 5 篇论文如下:[00:50] TOP1(🔥136) | 🎨 Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing(几何引导的强化学习用于多视角一致的3D场景编辑)[02:57] TOP2(🔥104) | 🐧 Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders(Penguin-VL:探索基于LLM视觉编码器的VLM效率极限)[04:30] TOP3(🔥90) | 🤖 OpenClaw-RL: Train Any Agent Simply by Talking(OpenClaw-RL:通过对话训练任意智能体)[06:24] TOP4(🔥81) | 📖 Lost in Stories: Consistency Bugs in Long Story Generation by LLMs(迷失于故事:大语言模型生成长篇故事中的一致性错误)[08:02] TOP5(🔥77) | 🧠 Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence(Holi-Spatial:将视频流演化为整体的3D空间智能)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递
【赞助商】通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd【目录】本期的 15 篇论文如下:[00:32] 🧠 Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training(Spatial-TTT:基于测试时训练的流式视觉空间智能)[01:17] 🤔 Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections(策略性导航还是随机搜索?智能体与人类在文档集合上的推理方式研究)[02:11] ⚡ IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse(IndexCache:通过跨层索引复用加速稀疏注意力)[02:54] 🎬 Video-Based Reward Modeling for Computer-Use Agents(基于视频的计算机使用智能体奖励建模)[03:55] 🎬 DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning(DreamVideo-Omni:基于潜在身份强化学习的全运动控制多主体视频定制)[04:46] 🎯 Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation(信任你的评判者:用于忠实图像编辑与生成的鲁棒奖励建模与强化学习)[05:40] 🎬 DVD: Deterministic Video Depth Estimation with Generative Priors(DVD:基于生成先验的确定性视频深度估计)[06:29] 🖼 WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing(WeEdit:面向文本中心图像编辑的数据集、基准与字形引导框架)[07:29] 🎬 ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation(ShotVerse:面向文本驱动多镜头视频创作的电影级摄像机控制技术)[08:24] 🧠 GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing(GRADE:基准测试学科知识驱动的图像编辑推理能力)[09:08] 🎬 EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation(EVATok:面向高效视觉自回归生成的自适应长度视频分词)[09:55] ⚡ One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers(一模型,多预算:用于扩散变换器的弹性潜在接口)[10:46] 🤖 OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams(OmniStream:在连续流中掌握感知、重建与行动)[11:29] 🧠 EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models(EndoCoT:在扩散模型中扩展内生思维链推理)[12:37] 🧠 XSkill: Continual Learning from Experience and Skills in Multimodal Agents(XSkill:多模态智能体从经验与技能中的持续学习)【关注我们】您还可以在以下平台找到我们,获得播客内容以外更多信息小红书: AI速递







