MAS-R1 Icon

MAS-Orchestra

Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

A training-time framework that learns to build an entire multi-agent system at once using function-calling reinforcement learning, enabling global system-level reasoning instead of sequential code execution. Together with MASBENCH, a benchmark that measures task structure across five axes, our study shows when multi-agent systems truly outperform single-agent systems and delivers consistent gains on math, multi-hop QA, and search-based tasks.

Training-Time RL Framework Goal-Oriented Sub-agent Holistic Orchestration Degree of MAS Rigorous analysis on MAS benefits and limitations

Salesforce AI Research

Abstract

While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design often under-deliver. These shortcomings stem from two key factors: (1) methodological complexity—agent orchestration is performed using sequential, code-level execution that limits global system-level reasoning and scales poorly with agent complexity—and (2) efficacy uncertainty—MAS are deployed without understanding whether they provide tangible benefits compared to single-agent systems (SAS). We propose MAS-Orchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented sub-agents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis shows that MAS gains depend critically on task structure, verification protocols, and the capabilities of both the orchestrator and sub-agents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA. Together, MAS-Orchestra and MASBENCH support better training and understanding of MAS in the pursuit of multi-agent intelligence.

Key Contributions

Novel RL Formulation for MAS Orchestration

MAS-Orchestra introduces a novel, scalable and effective RL formulation for MAS orchestration, featuring an explicit notion of DoM and a function-calling abstraction that encapsulates complex sub-agents.

Benchmark for MAS Evaluation

MAS-Orchestra introduces a controlled benchmark MASBENCH, tailored for MAS evaluation, enabling systematic empirical analysis of when MAS outperforms single-agent systems—and when it does not. This is the first benchmark established to evaluate MAS benefits to the best of our knowledge.

Analysis on MAS vs. SAS

MAS-Orchestra investigates the benefits of MAS across three analysis directions over a broad range of MAS configurations, covering three orchestrator settings and five sub-agent settings across different model sizes and families.

Improving Performance over Strong Baselines

MAS-Orchestra demonstrates that our approach achieves strong performance on public benchmarks, including math, multi-hop question answering, and multi-step search-based QA.

Contrast with Existing Work

Approach comparison
Comparison of different approaches to multi-agent system design.
Paradigm comparison
Inference-time orchestration systems typically adopt holistic orchestration, but without training. MAS-Orchestra lies in automatic MAS and formulate the problem as a function-calling RL problem with holistic orchestration

MAS-Orchestra Approach

MAS-Zero Overview
When DoM is configured to be low (dashed lines), the system instantiates at most one agent. When DoM is high, the number of agents is unconstrained

Factors Affecting MAS Performance

(a) A Five-Axis Evaluation Framework

Five-axis evaluation framework
Summary of 5 axes. Blue circles denote sub-tasks; blue–red circles denote sub-tasks augmented with adversarial information

(b) MASBench Benchmark

The resulting benchmark covers all five axes, with axis values ranging from 2 to 12, and provides axis-specific training and test splits:

Depth — Train
3,993
examples
Depth — Test
1,195
examples
Horizon — Train
2,174
examples
Horizon — Test
567
examples
Breadth — Train
2,000
examples
Breadth — Test
676
examples
Parallel — Train
1,807
examples
Parallel — Test
567
examples
Robustness — Train
3,000
examples
Robustness — Test
600
examples

(c) How Different Evaluation Axes Affect MAS

Sub-task sub-agent analysis
Avg@8 (accuracy) for SAS (CoT) and MAS-Orchestra across different axes with Qwen-7b as orchestrator and Qwen-7b and GPT-120b (low) as sub-agent
Robustness analysis
Avg@8 in the Robustness setting with GPT-120b (low) as the sub-agent. (SAS performance is too low to be visible.)

Observations

When the sub-agent is “weaker” (Qwen-7b), the MAS (MAS-Orchestra) outperforms the SAS across most task structures, except along the Depth axis.

Surprisingly, when the sub-agent is “stronger” (GPT-120b (low)), a different pattern emerges—performance gains for MAS diminish across Depth, Horizon, Breadth, or Parallel structures.

As tasks become more complex, performance decreases. Importantly, out-of-distribution complexity follows the same trend as in-distribution complexity.

MAS consistently exhibit superior robustness under data poisoning. While SAS performance collapses to near-zero accuracy in this adversarial setting, MAS retain substantially higher accuracy.

Takeaway

MAS are most effective at the edge of sub-agent competence. MAS provide clear gains over SAS when the underlying sub-agent is capable but not yet strong enough to reliably internalize complex task structure on its own. In this regime (e.g., task structure is not purely sequential or the task contains potential adversarial sub-tasks), explicit decomposition, orchestration and moderation help expose and utilize latent reasoning capacity.

(d) Are Reasoning Models Better Orchestrators?

Meta ILM vs RLM comparison
Avg@8 comparing LLM and RLM as orchestrator
Model size and depth analysis
Statistics of agents in the generated MAS over the training steps (use Depth equals 4 as an example). The number of agents measures the total number of sub-agents, the sequential agent length measures the length of dependency chain and parallel agent width measures the in degree of sub-agent.

Observations

Surprisingly, the instruction-tuned LLM as orchestrator outperforms the strong RLM counterparts.

Takeaway

Instruction-tuned LLMs can be better orchestrators than RLMs. Effective orchestration prioritizes task decomposition, delegation, and coordination over directly solving sub-tasks. Current RLM training objectives emphasize end-to-end problem solving rather than structural control, making them misaligned with the requirements of orchestration (especially when paired with a strong sub-agent). However, the gains diminish as the sub-agent becomes stronger.

(e) How Does Sub-agent Reasoning Effort Affect MAS?

Reasoning effort analysis
Avg@8 comparing different reasoning effort with Qwen- 7b as orchestrator and GPT-120b as sub-agent.
Maximum length analysis
Avg@8 comparing different maximum context length with Qwen-7b as orchestrator and GPT-120b as sub-agent

Observations

Higher reasoning effort is not necessarily better. We consistently observe that increasing reasoning effort makes it easier to hit the length limit (a default response length of 512 tokens was used), especially as task complexity grows. When the limit is reached, both MAS and SAS experience performance degradation—MAS are not immune to this effect.

Takeaway (from the two plots above)

More reasoning effort is not always better. From the Reasoning Effort and Max Context Length plots above, pushing for longer reasoning increases the chance of hitting length limits (e.g., 512 tokens), which hurts both MAS and SAS. Effective context management likely requires dedicated training and better budgeting—not simply increasing reasoning length.

Evaluation on Public Benchmarks

MAS-R1 main results table
Avg@8 of the considered benchmarks. “—” indicates non-applicable.

Observation

MAS-Orchestra consistently outperforms the underlying sub-agents as well as SoTA inference-time and training-time orchestration systems across all evaluated benchmarks, and it demonstrates strong OOD generalization on GPQA.

BrowseComp++ (high) results
Statistics of agents for high DoM (BrowseComp+).

Observation

Under high DoM, MAS-Orchestra generates substan- tially more sub-agents. It learns to invoke SEARCHA- GENT and employs multiple parallel search processes, typically using 3 to 4 parallel searches per question.

AIME24 (low) results
Statistics of agents for low DoM (AIME24).

Observation

Under low DoM, MAS-Orchestra learns to delegate the task entirely to a single sub-agent (100% delegation after 20 steps) and dynamically selects strong sub-agents, pri- marily REFLEXIONAGENT and DEBATEAGENT, which are the best-performing SAS baselines

Takeaway

Taken together, the observations under low and high DoM in- dicate that MAS-Orchestra dynamically adapts to the given task by proposing MAS designs that align with the under- lying sub-task structure and by delegating execution to the most effective agent configurations

Generated MAS Examples

Browse real, generated multi-agent designs. Select an example to preview it inline, or open it in a new tab for the full interactive view.

Select an example to preview

BibTeX

@misc{ke2025masr1,
  title={MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks},
  author={Zixuan Ke and Yifei Ming and Austin Xu and Ryan Chin and Xuan-Phi Nguyen and Prathyusha Jwalapuram and Semih Yavuz and Caiming Xiong and Shafiq Joty},
  year={2026},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
}