MAS-Zero

Designing Multi-Agent Systems with Zero Supervision

The first inference-time-only framework that meta-designs agent teams through self-evolved planning, feedback, and verification loops.

Meta-design Inference-time Scaling No Training & Validation Set Dynamic Agent Composition

Read the Paper GitHub MAS Collection Blog

📍 Lightning talk - Salesforce Booth #1129
Wed Dec. 3, 4:20–4:45 PM PST · San Diego Convention Center

📍 Lightning talk - SEA Workshop
Sun Dec. 7, Upper Level Room 23ABC · San Diego Convention Center

Zixuan Ke

Austin Xu

Yifei Ming

Xuan-Phi Nguyen

Caiming Xiong

Shafiq Joty

Salesforce AI Research

Abstract

Multi-agent systems (MAS) leveraging Large Language Models hold enormous promise, yet most current designs rely on manually specified roles and protocols that fail to align with LLM strengths or adapt to new tasks. Automatic approaches reduce this burden but usually require validation sets, stay static at inference, and cannot gracefully collapse into simpler solutions. We introduce MAS-Zero, the first self-evolved, inference-time framework for automatic MAS design. MAS-Zero iteratively designs, critiques, and refines MAS configurations tailored to each instance, using meta-feedback on solvability, completeness, and when beneficial, reduction to simpler systems. Experiments across reasoning (math, graduate-level QA), coding, and agentic (search-based) benchmarks with both open- and closed-source LLM backbones show that MAS-Zero surpasses strong manual and automatic baselines, delivering accuracy gains up to 16.69% on reasoning, 16.66% on coding, and 5.45% on agentic tasks while staying cost-efficient.

Key Contributions

Inference-Time-Only Framework

MAS-Zero is the first automatic MAS system that runs entirely at inference time—no precomputed validation set or outcome supervision—while still inventing bespoke agent hierarchies per instance.

State-of-the-Art Automatic MAS

The meta-design + verification loop delivers substantial accuracy gains over strong manual and automatic baselines across reasoning, coding, and agentic tasks while remaining cost efficient.

Comprehensive Evaluation & Insights

Benchmarks spanning multiple domains, difficulty levels, and both open- and closed-source LLM backbones surface key insights about meta-iterations, verifier strength, and structure selection.

Contrast with Existing Work

Approach at a Glance

MAS-Zero runs a three-stage meta-loop every time it confronts a new question, continually refining structure and answers without any offline supervision.

1. MAS-Init

Execute a library of established building blocks (CoT, CoT-SC, Debate, Self-Refine) as executable code to seed diverse candidates.

2. MAS-Evolve

Iteratively generate MAS code that reconfigures agent roles, task decompositions, and communication, scoring each design on solvability and completeness.

3. MAS-Verify

Use a verifier to consolidate all intermediate solutions and surface the most reliable final answer.

MAS-Zero Overview — Purple highlights the given input and final output, while orange highlights the MAS-Zero components and steps. Dashed arrows show information flow within Meta-feedback. MAS-Zero consumes the question and building blocks, then solves the task via three stages: **MAS-Init**, **MAS-Evolve**, and **MAS-Verify**.

Illustrated Workflow

MAS-Zero Detailed Overview — MAS-Zero in action: prompts, code generation, execution feedback, and verification woven together into a single inference-time pipeline.

Main Results

Further Analysis

Visual breakdowns that showcase how MAS-Zero restructures workflows, improves the accuracy-cost frontier, scales with more iterations, and benefits from stronger verifiers.

MAS-Zero ablation study — Ablation study: removing MAS-Init, MAS-Evolve, or MAS-Verify shows that verification drives the largest gains, especially when paired with simple yet strong single-agent or manual baselines.

MAS moment example — “MAS moment” example: MAS-Zero reorganizes the workflow across iterations to crack a challenging reasoning task.

Pareto frontier: MAS-Zero delivers higher accuracy at lower cost compared to both manual and automatic MAS baselines.

Iteration performance — Performance steadily increases with more meta-iterations, showcasing the value of inference-time scaling.

Upper bound with verification — Oracle verification reveals the headroom of structural improvements—automatic baselines cannot capitalize on external verifiers.

Key Takeaways

1. MAS-Zero is effective Across Domains & Agents

MAS-Zero consistently improves reasoning, coding, and agentic benchmarks while adapting to both open- and closed-source LLMs, indicating robustness across domains and backbone choices.

2. When to use MAS is important

Surprisingly strong single-agent (CoT, CoT-SC) and lightweight manual MAS (Debate, Self-Refine) highlight that MAS-Verify is the critical stage—an oracle verifier unlocks the largest performance jump.

3. Sub-Agent Capability Is the Bottleneck

Better meta-agents yield gains, but sub-agent strength ultimately caps improvements; enhancing solver tools and base models is key to further progress.

BibTeX

@misc{ke2025maszero,
  title={MAS-Zero: Designing Multi-Agent Systems with Zero Supervision},
  author={Zixuan Ke and Austin Xu and Yifei Ming and Xuan-Phi Nguyen and Caiming Xiong and Shafiq Joty},
  year={2025},
  eprint={2505.14996},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2505.14996},
}