MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning

Abstract

While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design often under-deliver. These shortcomings stem from two key factors: (1) methodological complexity—agent orchestration is performed using sequential, code-level execution that limits global system-level reasoning and scales poorly with agent complexity—and (2) efficacy uncertainty—MAS are deployed without understanding whether they provide tangible benefits compared to single-agent systems (SAS). We propose MAS-Orchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented sub-agents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis shows that MAS gains depend critically on task structure, verification protocols, and the capabilities of both the orchestrator and sub-agents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA. Together, MAS-Orchestra and MASBENCH support better training and understanding of MAS in the pursuit of multi-agent intelligence.

Key Contributions

Novel RL Formulation for MAS Orchestration

MAS-Orchestra introduces a novel, scalable and effective RL formulation for MAS orchestration, featuring an explicit notion of DoM and a function-calling abstraction that encapsulates complex sub-agents.

Benchmark for MAS Evaluation

MAS-Orchestra introduces a controlled benchmark MASBENCH, tailored for MAS evaluation, enabling systematic empirical analysis of when MAS outperforms single-agent systems—and when it does not. This is the first benchmark established to evaluate MAS benefits to the best of our knowledge.

Analysis on MAS vs. SAS

MAS-Orchestra investigates the benefits of MAS across three analysis directions over a broad range of MAS configurations, covering three orchestrator settings and five sub-agent settings across different model sizes and families.

Improving Performance over Strong Baselines

MAS-Orchestra demonstrates that our approach achieves strong performance on public benchmarks, including math, multi-hop question answering, and multi-step search-based QA.

Contrast with Existing Work

Approach comparison — Comparison of different approaches to multi-agent system design.

Paradigm comparison — Inference-time orchestration systems typically adopt holistic orchestration, but without training. MAS-Orchestra lies in automatic MAS and formulate the problem as a function-calling RL problem with holistic orchestration

MAS-Orchestra Approach

MAS-Zero Overview — When DoM is configured to be low (dashed lines), the system instantiates at most one agent. When DoM is high, the number of agents is unconstrained

Factors Affecting MAS Performance

(a) A Five-Axis Evaluation Framework

(b) MASBench Benchmark

The resulting benchmark covers all five axes, with axis values ranging from 2 to 12, and provides axis-specific training and test splits:

Depth — Train

3,993

examples

Depth — Test

1,195

examples

Horizon — Train

2,174

examples

Horizon — Test

567

examples

Breadth — Train

2,000

examples

Breadth — Test

676

examples

Parallel — Train

1,807

examples

Parallel — Test

567

examples

Robustness — Train

3,000

examples

Robustness — Test

600

examples

(c) How Different Evaluation Axes Affect MAS

Sub-task sub-agent analysis — Avg@8 accuracy for SAS (CoT) and MAS-Orchestra across different axes with Qwen-7b as orchestrator and Qwen-7b and GPT-120b (low) as sub-agent

Robustness analysis — Avg@8 accuracy in the Robustness setting with GPT-120b (low) as the sub-agent. (SAS performance is too low to be visible.)

Observations

When the sub-agent is “weaker” (Qwen-7b), the MAS (MAS-Orchestra) outperforms the SAS across most sub-task structures, except along the Depth axis.

When the sub-agent is “stronger” (GPT-120b (low)), performance gains for MAS diminish across Depth, Horizon, Breadth, or Parallel.

Takeaway

MAS are most effective at the edge of sub-agent competence. MAS provide clear gains over SAS when the underlying sub-agent is capable but not yet strong enough to reliably internalize complex task structure on its own. In this regime (e.g., task structure is not purely sequential or the task contains potential adversarial sub-tasks), explicit decomposition, orchestration and moderation help expose and utilize latent reasoning capacity.

(d) Are Reasoning Models Better Orchestrator Initializations?

Meta ILM vs RLM comparison — Avg@8 accuracy comparing LLM and RLM as orchestrator

Model size and depth analysis — Statistics of agents in the generated MAS over the training steps (use Depth equals 4 as an example). The number of agents measures the total number of sub-agents, the sequential agent length measures the length of dependency chain and parallel agent width measures the in degree of sub-agent.

Observations

Instruction-tuned LLM orchestrator initialization outperforms RLM initialization. This is surprising, as RLMs have been shown to outperform LLMs on many reasoning tasks. However, prior work has not systematically examined their effectiveness as orchestrators.

Takeaway

A closer inspection reveals that RLM tends to solve the task itself first and then delegate it to only one simple sub-agent, even when the sub-agent is stronger at solving the task. Effective orchestration prioritizes task decomposition, delegation, and coordination over directly solving sub-tasks. Current RLM training objectives emphasize end-to-end problem solving rather than structural control, making them misaligned with the requirements of orchestration.

(e) How Does Sub-agent Reasoning Effort Affect MAS?

Reasoning effort analysis — Avg@8 accuracy comparing different reasoning effort with Qwen- 7b as orchestrator and GPT-120b as sub-agent.

Maximum length analysis — Avg@8 accuracy comparing different maximum context length with Qwen-7b as orchestrator and GPT-120b as sub-agent

Observations

MAS are not immune to exceeding maximum context length limits under higher reasoning effort. We consistently observe that increasing reasoning effort makes it easier to hit the length limit (a default response length of 512 tokens was used), especially as task complexity grows. When the limit is reached, both MAS and SAS experience performance degradation—MAS are not immune to this effect.

Takeaway

MAS improve Robustness under high reasoning effort when context lengths are longer. Adversarial robustness remains the primary regime where MAS provide consistent gains, including when sub-agents operate with high reasoning effort. In contrast, effective context management likely requires dedicated training rather than increased reasoning effort alone.

Evaluation on Public Benchmarks

MAS-R1 main results table — Avg@8 accuracy of the considered benchmarks. “—” indicates non-applicable.

Cost-performance trade-off computed using average accuracy and total inference cost on AIME24 and GPQA. MAS-Orchestra lie on the Pareto frontier, delivering higher accuracy at lower or comparable cost than strong baselines. All training-time orchestration models use the released orchestrator, which may have been trained in a different environment than ours.

Observation

MAS-Orchestra consistently outperforms the underlying sub-agents as well as SoTA inference-time and training-time orchestration systems across all evaluated benchmarks, and it demonstrates strong OOD generalization on GPQA.

BrowseComp++ (high) results — Statistics of agents for high DoM (BrowseComp+).

Observation

Under high DoM, MAS-Orchestra generates substan- tially more sub-agents. It learns to invoke SEARCHA- GENT and employs multiple parallel search processes, typically using 3 to 4 parallel searches per question.

AIME24 (low) results — Statistics of agents for low DoM (AIME24).

Observation

Under low DoM, MAS-Orchestra learns to delegate the task entirely to a single sub-agent (100% delegation after 20 steps) and dynamically selects strong sub-agents, pri- marily REFLEXIONAGENT and DEBATEAGENT, which are the best-performing SAS baselines

Takeaway

Taken together, the observations under low and high DoM indicate that MAS-Orchestra dynamically adapts to the given task by proposing MAS designs that align with the under- lying sub-task structure and by delegating execution to the most effective agent configurations

Demo

A short illustration of MAS-Orchestra.

Open the video in a new tab

Generated MAS Examples

Browse real, generated multi-agent designs. Select an example to preview it inline, or open it in a new tab for the full interactive view.

Select an example to preview

BibTeX

@misc{Ke2026MASOrchestra,
        title        = {MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks},
        author       = {Zixuan Ke and Yifei Ming and Austin Xu and Ryan Chin and Xuan-Phi Nguyen and Prathyusha Jwalapuram and Semih Yavuz and Caiming Xiong and Shafiq Joty},
        year         = {2026},
        eprint       = {2601.14652},
        archivePrefix= {arXiv},
        primaryClass = {cs.AI},
        note         = {Preprint; Work in Progress},
      }