MAS-Zero Icon

MAS-Zero

Designing Multi-Agent Systems with Zero Supervision

The first inference-time-only framework that meta-designs agent teams through self-evolved planning, feedback, and verification loops.

Meta-design Inference-time Scaling No Training & Validation Set Dynamic Agent Composition

📍 Lightning talk - Salesforce Booth #1129
Wed Dec. 3, 4:20–4:45 PM PST · San Diego Convention Center

📍 Lightning talk - SEA Workshop
Sun Dec. 7, Upper Level Room 23ABC · San Diego Convention Center

Salesforce AI Research

Abstract

Multi-agent systems (MAS) leveraging Large Language Models hold enormous promise, yet most current designs rely on manually specified roles and protocols that fail to align with LLM strengths or adapt to new tasks. Automatic approaches reduce this burden but usually require validation sets, stay static at inference, and cannot gracefully collapse into simpler solutions. We introduce MAS-Zero, the first self-evolved, inference-time framework for automatic MAS design. MAS-Zero iteratively designs, critiques, and refines MAS configurations tailored to each instance, using meta-feedback on solvability, completeness, and when beneficial, reduction to simpler systems. Experiments across reasoning (math, graduate-level QA), coding, and agentic (search-based) benchmarks with both open- and closed-source LLM backbones show that MAS-Zero surpasses strong manual and automatic baselines, delivering accuracy gains up to 16.69% on reasoning, 16.66% on coding, and 5.45% on agentic tasks while staying cost-efficient.

Key Contributions

Inference-Time-Only Framework

MAS-Zero is the first automatic MAS system that runs entirely at inference time—no precomputed validation set or outcome supervision—while still inventing bespoke agent hierarchies per instance.

State-of-the-Art Automatic MAS

The meta-design + verification loop delivers substantial accuracy gains over strong manual and automatic baselines across reasoning, coding, and agentic tasks while remaining cost efficient.

Comprehensive Evaluation & Insights

Benchmarks spanning multiple domains, difficulty levels, and both open- and closed-source LLM backbones surface key insights about meta-iterations, verifier strength, and structure selection.

Contrast with Existing Work

Contrast with existing work
Manual MAS design vs. existing automatic MAS design vs. MAS-Zero. MAS-Zero keeps humans out of the loop by automatically inventing the structure, evaluating it, and self-verifying results.

Approach at a Glance

MAS-Zero runs a three-stage meta-loop every time it confronts a new question, continually refining structure and answers without any offline supervision.

1. MAS-Init

Execute a library of established building blocks (CoT, CoT-SC, Debate, Self-Refine) as executable code to seed diverse candidates.

2. MAS-Evolve

Iteratively generate MAS code that reconfigures agent roles, task decompositions, and communication, scoring each design on solvability and completeness.

3. MAS-Verify

Use a verifier to consolidate all intermediate solutions and surface the most reliable final answer.

MAS-Zero Overview
Purple highlights the given input and final output, while orange highlights the MAS-Zero components and steps. Dashed arrows show information flow within Meta-feedback. MAS-Zero consumes the question and building blocks, then solves the task via three stages: MAS-Init, MAS-Evolve, and MAS-Verify.

Illustrated Workflow

MAS-Zero Detailed Overview
MAS-Zero in action: prompts, code generation, execution feedback, and verification woven together into a single inference-time pipeline.

Main Results

MAS-Zero main results table
MAS-Zero establishes a new Pareto frontier across reasoning, coding, and agentic benchmarks, outperforming strong manual and automatic MAS baselines while keeping cost efficiency. Color highlights distinguish single-agent, manual MAS, validation-pruning automatic MAS, validation-generation automatic MAS, training-based automatic MAS, and our method. For fair comparison with validation-based baselines, each benchmark is split into 20% validation and 80% test; methods without validation (including MAS-Zero) are evaluated on the same 80% split. “×” marks zero accuracy when a validation-selected MAS fails, and “↑” reports MAS-Zero’s improvement over that baseline.

Further Analysis

Visual breakdowns that showcase how MAS-Zero restructures workflows, improves the accuracy-cost frontier, scales with more iterations, and benefits from stronger verifiers.

MAS-Zero ablation study
Ablation study: removing MAS-Init, MAS-Evolve, or MAS-Verify shows that verification drives the largest gains, especially when paired with simple yet strong single-agent or manual baselines.
MAS moment example
“MAS moment” example: MAS-Zero reorganizes the workflow across iterations to crack a challenging reasoning task.
Pareto frontier
Pareto frontier: MAS-Zero delivers higher accuracy at lower cost compared to both manual and automatic MAS baselines.
Iteration performance
Performance steadily increases with more meta-iterations, showcasing the value of inference-time scaling.
Upper bound with verification
Oracle verification reveals the headroom of structural improvements—automatic baselines cannot capitalize on external verifiers.

Key Takeaways

1. MAS-Zero is effective Across Domains & Agents

MAS-Zero consistently improves reasoning, coding, and agentic benchmarks while adapting to both open- and closed-source LLMs, indicating robustness across domains and backbone choices.

2. When to use MAS is important

Surprisingly strong single-agent (CoT, CoT-SC) and lightweight manual MAS (Debate, Self-Refine) highlight that MAS-Verify is the critical stage—an oracle verifier unlocks the largest performance jump.

3. Sub-Agent Capability Is the Bottleneck

Better meta-agents yield gains, but sub-agent strength ultimately caps improvements; enhancing solver tools and base models is key to further progress.

BibTeX

@misc{ke2025maszero,
  title={MAS-Zero: Designing Multi-Agent Systems with Zero Supervision},
  author={Zixuan Ke and Austin Xu and Yifei Ming and Xuan-Phi Nguyen and Caiming Xiong and Shafiq Joty},
  year={2025},
  eprint={2505.14996},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2505.14996},
}