A training-time framework that learns to build an entire multi-agent system at once using function-calling reinforcement learning, enabling global system-level reasoning instead of sequential code execution. Together with MASBENCH, a benchmark that measures task structure across five axes, our study shows when multi-agent systems truly outperform single-agent systems and delivers consistent gains on math, multi-hop QA, and search-based tasks.
Salesforce AI Research
While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design often under-deliver. These shortcomings stem from two key factors: (1) methodological complexity—agent orchestration is performed using sequential, code-level execution that limits global system-level reasoning and scales poorly with agent complexity—and (2) efficacy uncertainty—MAS are deployed without understanding whether they provide tangible benefits compared to single-agent systems (SAS). We propose MAS-Orchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented sub-agents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis shows that MAS gains depend critically on task structure, verification protocols, and the capabilities of both the orchestrator and sub-agents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA. Together, MAS-Orchestra and MASBENCH support better training and understanding of MAS in the pursuit of multi-agent intelligence.
MAS-Orchestra introduces a novel, scalable and effective RL formulation for MAS orchestration, featuring an explicit notion of DoM and a function-calling abstraction that encapsulates complex sub-agents.
MAS-Orchestra introduces a controlled benchmark MASBENCH, tailored for MAS evaluation, enabling systematic empirical analysis of when MAS outperforms single-agent systems—and when it does not. This is the first benchmark established to evaluate MAS benefits to the best of our knowledge.
MAS-Orchestra investigates the benefits of MAS across three analysis directions over a broad range of MAS configurations, covering three orchestrator settings and five sub-agent settings across different model sizes and families.
MAS-Orchestra demonstrates that our approach achieves strong performance on public benchmarks, including math, multi-hop question answering, and multi-step search-based QA.
The resulting benchmark covers all five axes, with axis values ranging from 2 to 12, and provides axis-specific training and test splits:
When the sub-agent is “weaker” (Qwen-7b), the MAS (MAS-Orchestra) outperforms the SAS across most sub-task structures, except along the Depth axis.
When the sub-agent is “stronger” (GPT-120b (low)), performance gains for MAS diminish across Depth, Horizon, Breadth, or Parallel.
MAS are most effective at the edge of sub-agent competence. MAS provide clear gains over SAS when the underlying sub-agent is capable but not yet strong enough to reliably internalize complex task structure on its own. In this regime (e.g., task structure is not purely sequential or the task contains potential adversarial sub-tasks), explicit decomposition, orchestration and moderation help expose and utilize latent reasoning capacity.
Instruction-tuned LLM orchestrator initialization outperforms RLM initialization. This is surprising, as RLMs have been shown to outperform LLMs on many reasoning tasks. However, prior work has not systematically examined their effectiveness as orchestrators.
A closer inspection reveals that RLM tends to solve the task itself first and then delegate it to only one simple sub-agent, even when the sub-agent is stronger at solving the task. Effective orchestration prioritizes task decomposition, delegation, and coordination over directly solving sub-tasks. Current RLM training objectives emphasize end-to-end problem solving rather than structural control, making them misaligned with the requirements of orchestration.
MAS are not immune to exceeding maximum context length limits under higher reasoning effort. We consistently observe that increasing reasoning effort makes it easier to hit the length limit (a default response length of 512 tokens was used), especially as task complexity grows. When the limit is reached, both MAS and SAS experience performance degradation—MAS are not immune to this effect.
MAS improve Robustness under high reasoning effort when context lengths are longer. Adversarial robustness remains the primary regime where MAS provide consistent gains, including when sub-agents operate with high reasoning effort. In contrast, effective context management likely requires dedicated training rather than increased reasoning effort alone.
MAS-Orchestra consistently outperforms the underlying sub-agents as well as SoTA inference-time and training-time orchestration systems across all evaluated benchmarks, and it demonstrates strong OOD generalization on GPQA.
Under high DoM, MAS-Orchestra generates substan- tially more sub-agents. It learns to invoke SEARCHA- GENT and employs multiple parallel search processes, typically using 3 to 4 parallel searches per question.
Under low DoM, MAS-Orchestra learns to delegate the task entirely to a single sub-agent (100% delegation after 20 steps) and dynamically selects strong sub-agents, pri- marily REFLEXIONAGENT and DEBATEAGENT, which are the best-performing SAS baselines
Taken together, the observations under low and high DoM indicate that MAS-Orchestra dynamically adapts to the given task by proposing MAS designs that align with the under- lying sub-task structure and by delegating execution to the most effective agent configurations
Browse real, generated multi-agent designs. Select an example to preview it inline, or open it in a new tab for the full interactive view.
@misc{Ke2026MASOrchestra,
title = {MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks},
author = {Zixuan Ke and Yifei Ming and Austin Xu and Ryan Chin and Xuan-Phi Nguyen and Prathyusha Jwalapuram and Semih Yavuz and Caiming Xiong and Shafiq Joty},
year = {2026},
eprint = {2601.14652},
archivePrefix= {arXiv},
primaryClass = {cs.AI},
note = {Preprint; Work in Progress},
}