Open Data Synthesis For Deep Research

Update: 2025-09-05

Description

🤗 Upvotes: 37 | cs.CL, cs.AI

Authors:

Ziyi Xia, Kun Luo, Hongjin Qian, Zheng Liu

Title:

Open Data Synthesis For Deep Research

Arxiv:

http://arxiv.org/abs/2509.00375v1

Abstract:

Large language models (LLMs) are increasingly expected to go beyond simple factual queries toward Deep Research-tasks that require decomposing questions into sub-problems, coordinating multi-step reasoning, and synthesizing evidence from diverse sources. We formalize Deep Research tasks with verifiable answers as Hierarchical Constraint Satisfaction Problems (HCSPs), which are fundamentally different from single-constraint, multi-hop, or flat CSP formulations. However, existing benchmarks (e.g., Natural Questions, HotpotQA) fail to capture this complexity, while recent synthetic datasets often introduce shortcut reasoning, knowledge leakage, or lack sufficient structural depth. To address this gap, we introduce InfoSeek, a scalable framework for synthesizing complex Deep Research tasks. InfoSeek uses a dual-agent system to recursively build a Research Tree from large-scale webpages, blurring intermediate nodes into valid sub-problems, and converting these trees into natural language questions that require traversing the full hierarchy. It also enables rapid scaling, yielding over 50K training examples, a curated test set, and reasoning trajectories generated via reject sampling. Experiments show that models trained on InfoSeek consistently outperform strong baselines. On a challenging benchmark BrowseComp-Plus, 3B LLMs optimized with InfoSeek surpass much larger 32B models and lightweight commercial APIs (e.g., Gemini2.5-Flash), while achieving performance comparable to stronger APIs (e.g., Gemini2.5-Pro). By preserving meta-information such as intermediate steps and retrieval labels, InfoSeek further supports advanced optimization strategies, including compound reward design and trajectory-level exploration. We provide our codes and datasets in \href{https://github.com/VectorSpaceLab/InfoSeek}{this repository}.

Comments

In Channel

Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

2025-09-0622:57

From Editor to Dense Geometry Estimator

2025-09-0618:45

Towards a Unified View of Large Language Model Post-Training

2025-09-0623:07

DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks

2025-09-0620:11

Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

2025-09-0623:03

Open Data Synthesis For Deep Research

2025-09-0523:03

Robix: A Unified Model for Robot Interaction, Reasoning and Planning

2025-09-0521:57

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

2025-09-0424:16

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

2025-09-0423:48

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

2025-09-0422:32

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

2025-09-0420:16

Baichuan-M2: Scaling Medical Capability with Large Verifier System

2025-09-0423:34

Kwai Keye-VL 1.5 Technical Report

2025-09-0418:19

Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic

2025-09-0424:18

PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

2025-09-0321:59

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

2025-09-0219:58

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

2025-09-0223:14

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

2025-08-2821:42

VibeVoice Technical Report

2025-08-2821:19

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

2025-08-2820:03

00:00

Open Data Synthesis For Deep Research

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-17573082819933{-webkit-line-clamp:2;}Open Data Synthesis For Deep Research

Open Data Synthesis For Deep Research

Jingwen Liang, Gengyu Wang

Open Data Synthesis For Deep Research