A training-time framework that learns to build an entire multi-agent system at once using function-calling reinforcement learning, enabling global system-level reasoning instead of sequential code execution. Together with MASBENCH, a benchmark that measures task structure across five axes, our study shows when multi-agent systems truly outperform single-agent systems and delivers consistent gains on math, multi-hop QA, and search-based tasks.
Salesforce AI Research
While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design often under-deliver. These shortcomings stem from two key factors: (1) methodological complexity—agent orchestration is performed using sequential, code-level execution that limits global system-level reasoning and scales poorly with agent complexity—and (2) efficacy uncertainty—MAS are deployed without understanding whether they provide tangible benefits compared to single-agent systems (SAS). We propose MAS-Orchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented sub-agents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis shows that MAS gains depend critically on task structure, verification protocols, and the capabilities of both the orchestrator and sub-agents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA. Together, MAS-Orchestra and MASBENCH support better training and understanding of MAS in the pursuit of multi-agent intelligence.
MAS-Orchestra introduces a novel, scalable and effective RL formulation for MAS orchestration, featuring an explicit notion of DoM and a function-calling abstraction that encapsulates complex sub-agents.
MAS-Orchestra introduces a controlled benchmark MASBENCH, tailored for MAS evaluation, enabling systematic empirical analysis of when MAS outperforms single-agent systems—and when it does not. This is the first benchmark established to evaluate MAS benefits to the best of our knowledge.
MAS-Orchestra investigates the benefits of MAS across three analysis directions over a broad range of MAS configurations, covering three orchestrator settings and five sub-agent settings across different model sizes and families.
MAS-Orchestra demonstrates that our approach achieves strong performance on public benchmarks, including math, multi-hop question answering, and multi-step search-based QA.
The resulting benchmark covers all five axes, with axis values ranging from 2 to 12, and provides axis-specific training and test splits:
When the sub-agent is “weaker” (Qwen-7b), the MAS (MAS-Orchestra) outperforms the SAS across most task structures, except along the Depth axis.
Surprisingly, when the sub-agent is “stronger” (GPT-120b (low)), a different pattern emerges—performance gains for MAS diminish across Depth, Horizon, Breadth, or Parallel structures.
As tasks become more complex, performance decreases. Importantly, out-of-distribution complexity follows the same trend as in-distribution complexity.
MAS consistently exhibit superior robustness under data poisoning. While SAS performance collapses to near-zero accuracy in this adversarial setting, MAS retain substantially higher accuracy.
MAS are most effective at the edge of sub-agent competence. MAS provide clear gains over SAS when the underlying sub-agent is capable but not yet strong enough to reliably internalize complex task structure on its own. In this regime (e.g., task structure is not purely sequential or the task contains potential adversarial sub-tasks), explicit decomposition, orchestration and moderation help expose and utilize latent reasoning capacity.
Surprisingly, the instruction-tuned LLM as orchestrator outperforms the strong RLM counterparts.
Instruction-tuned LLMs can be better orchestrators than RLMs. Effective orchestration prioritizes task decomposition, delegation, and coordination over directly solving sub-tasks. Current RLM training objectives emphasize end-to-end problem solving rather than structural control, making them misaligned with the requirements of orchestration (especially when paired with a strong sub-agent). However, the gains diminish as the sub-agent becomes stronger.
Higher reasoning effort is not necessarily better. We consistently observe that increasing reasoning effort makes it easier to hit the length limit (a default response length of 512 tokens was used), especially as task complexity grows. When the limit is reached, both MAS and SAS experience performance degradation—MAS are not immune to this effect.
More reasoning effort is not always better. From the Reasoning Effort and Max Context Length plots above, pushing for longer reasoning increases the chance of hitting length limits (e.g., 512 tokens), which hurts both MAS and SAS. Effective context management likely requires dedicated training and better budgeting—not simply increasing reasoning length.
MAS-Orchestra consistently outperforms the underlying sub-agents as well as SoTA inference-time and training-time orchestration systems across all evaluated benchmarks, and it demonstrates strong OOD generalization on GPQA.
Under high DoM, MAS-Orchestra generates substan- tially more sub-agents. It learns to invoke SEARCHA- GENT and employs multiple parallel search processes, typically using 3 to 4 parallel searches per question.
Under low DoM, MAS-Orchestra learns to delegate the task entirely to a single sub-agent (100% delegation after 20 steps) and dynamically selects strong sub-agents, pri- marily REFLEXIONAGENT and DEBATEAGENT, which are the best-performing SAS baselines
Taken together, the observations under low and high DoM in- dicate that MAS-Orchestra dynamically adapts to the given task by proposing MAS designs that align with the under- lying sub-task structure and by delegating execution to the most effective agent configurations
Browse real, generated multi-agent designs. Select an example to preview it inline, or open it in a new tab for the full interactive view.
@misc{ke2025masr1,
title={MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks},
author={Zixuan Ke and Yifei Ming and Austin Xu and Ryan Chin and Xuan-Phi Nguyen and Prathyusha Jwalapuram and Semih Yavuz and Caiming Xiong and Shafiq Joty},
year={2026},
archivePrefix={arXiv},
primaryClass={cs.CL},
}