AFLOW: Automating Agentic Workflow Generation

Paper Info:

Authors: Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, Chenglin Wu
Venue: ICLR 2025
Year: 2025
Code URL: https://github.com/FoundationAgents/AFlow
Pages: 38
Archived PDF: Archived PDF

AFLOW: Automating Agentic Workflow Generation

Metadata:

Title: AFLOW: Automating Agentic Workflow Generation
Authors: Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, Chenglin Wu
Venue: ICLR 2025
Year: 2025
Code URL: https://github.com/FoundationAgents/AFlow
Pages: ~33 pages (including appendix)

研究摘要 (Research Summary)

Large Language Models (LLMs, 大语言模型) have demonstrated remarkable potential in solving complex tasks across diverse domains, yet their effectiveness heavily depends on manually designed agentic workflows — structured sequences of LLM invocations accompanied by detailed instructions. The central problem this paper addresses is the significant human effort required to construct and refine these workflows, which fundamentally limits the scalability and generalizability of LLMs to new domains. Each new task demands careful orchestration of multiple LLM calls, precise prompt engineering, and logical structuring of dependencies, creating a bottleneck that prevents widespread adoption of agentic systems.

The key intellectual contribution of AFLOW lies in reframing workflow optimization not as a manual design task, but as a search problem over a vast space of code-represented workflows. This perspective shift transforms the challenge from "how do humans design good workflows?" to "how can an algorithm efficiently discover optimal workflows?" The authors introduce a framework that employs Monte Carlo Tree Search (MCTS, 蒙特卡洛树搜索) to systematically explore this space, iteratively refining workflows through code modification, tree-structured experience preservation, and execution feedback. This approach represents a fundamental departure from prior methods that either relied on fixed workflow templates or used simplistic linear search strategies.

The paper makes three main contributions that form a coherent research narrative. First, it formalizes the workflow optimization problem, generalizing prior approaches as specific cases within a unified framework. This theoretical foundation establishes a common language for future research at both the node and workflow optimization levels. Second, it introduces AFLOW, an MCTS-based method that automatically discovers effective workflows across multiple domains with minimal human intervention. The framework's novelty lies in its soft mixed-probability selection mechanism, LLM-driven expansion, and experience backpropagation — all designed to handle the unique challenges of workflow search. Third, it provides extensive empirical evaluation across six benchmark datasets spanning question answering, code generation, and mathematical reasoning.

The experimental results are striking. AFLOW achieves an average improvement of 5.7% over state-of-the-art manually designed methods and surpasses existing automated approaches by 19.5%. Perhaps most surprisingly, workflows generated by AFLOW enable smaller models like GPT-4o-mini to outperform GPT-4o on specific tasks at merely 4.55% of the inference cost. This cost-performance breakthrough has profound implications for democratizing access to high-performance agentic systems, removing the barrier that previously restricted effective workflows to expensive, manually designed implementations.

Readers should care about this work because it represents a significant step toward fully automated agentic workflow generation. By eliminating the need for manual workflow design, AFLOW opens the door to scalable deployment across countless domains and tasks. The framework's ability to discover workflows that outperform human-designed alternatives challenges the assumption that human expertise is always superior in structuring LLM interactions. Moreover, the cost efficiency gains suggest a paradigm shift where smaller, cheaper models can achieve superior performance through better workflow design, potentially reshaping economic incentives in the AI industry.

理论框架 (Theoretical Framework)

The intellectual lineage of AFLOW traces through multiple converging research streams. From the perspective of automated machine learning, the work builds upon the tradition of AutoML, where the goal is to automate the design of machine learning pipelines. Just as AutoML systems like Auto-Sklearn search over preprocessing, model, and hyperparameter configurations, AFLOW searches over workflow structures. From the reinforcement learning perspective, AFLOW draws inspiration from MCTS, a method that has achieved remarkable success in game playing from AlphaGo to complex board games. The application of MCTS to workflow optimization represents a novel transfer of game-playing search strategies to the domain of LLM orchestration.

The theoretical foundation rests on the formalization of an agentic workflow as a series of LLM-invoking nodes connected by edges. Each node $N_{i}$ represents a specific operation performed by an LLM and is characterized by four parameters: the model $M$ , the prompt $P$ , the temperature $τ$ , and the output format $F$ . The search space $S$ encompasses all possible configurations of these node parameters and edge structures, expressed formally as $S = {(N, E) ∣ E \in E}$ , where $N = {N (M, τ, P, F) ∣ M \in M, τ \in [0, 1], P \in P, F \in F}$ . This formulation transforms workflow optimization into a search process where an algorithm $A$ explores space $S$ to determine the optimal workflow configuration $W^{*}$ that maximizes the evaluation function $G$ for a given task $T$ : $W^{*} = \arg max_{W \in S} G (W, T)$ .

The concept of edges deserves particular attention. The authors compare three representations: graphs, neural networks, and code. While graph structures offer flexibility for representing hierarchical and parallel relationships, they require complex extensions beyond basic DAGs to express conditional logic. Neural networks enable adaptive transitions but sacrifice precise control over execution. Code representation, chosen by AFLOW, inherently supports linear sequences, conditional logic, loops, and complex branching through standard programming constructs. This choice maximizes expressivity while maintaining executability — a crucial consideration when workflows must actually run and produce results.

To manage the vastness of this search space, AFLOW introduces operators — predefined, reusable combinations of nodes representing common agentic operations. These include Generate, Format, Review & Revise, Ensemble, Test, Programmer, and Custom. An operator encapsulates a pattern of nodes and edges that has proven effective across tasks, serving as a building block that constrains the search space to promising regions. The formal optimization problem incorporating operators becomes $S_{AFLOW} = {(P_{1}, \dots, P_{n}, E, O_{1}, \dots, O_{n}) ∣ P_{i} \in P, E \in E, O_{i} \in O}$ , where $O$ represents the set of available operators.

The MCTS variant employed by AFLOW diverges from traditional implementations in a crucial way: each tree node represents a complete workflow rather than an individual action. This design choice enables the discovery of universal solutions for classes of problems rather than task-specific action sequences. The search operates through an iterative cycle of selection, expansion, evaluation, and backpropagation. The selection phase uses a soft mixed probability strategy that combines uniform and score-based weighted distributions to select from top- $k$ workflows and the initial workflow. The formula for this strategy is $P_{mixed} (i) = λ \cdot \frac{1}{n} + (1 - λ) \cdot \frac{\exp (α \cdot (s_{i} - s_{max}))}{\sum_{j = 1}^{n} \exp (α \cdot (s_{j} - s_{max}))}$ , where $n$ is the number of workflows, $s_{i}$ is workflow $i$ 's score, $s_{max}$ is the maximum score, $α = 0.4$ controls score influence, and $λ = 0.2$ balances exploration and exploitation. Including the initial workflow ensures persistent exploration capability while avoiding local optima.

The theoretical framework assumes that workflow performance is measurable through explicit evaluation functions, which limits the direct applicability to tasks with clear numerical feedback. However, the authors discuss extensions to open-ended tasks using LLM-as-a-judge evaluation, demonstrating the framework's potential adaptability beyond its core assumptions.

技术架构 (Technical Architecture)

AFLOW's technical architecture emerges from the elegant marriage of code-represented workflows and Monte Carlo Tree Search. The system operates as an iterative optimization loop where each iteration refines a population of candidate workflows, guided by execution feedback and structured experience preservation. At the highest level, AFLOW can be understood as a meta-optimization system: an LLM optimizer modifies workflows, an execution environment evaluates them, and a tree-structured memory guides future modifications.

The data journey through AFLOW begins with a task dataset $D$ , which is randomly partitioned into a validation set $D^{V}$ (20%) and a test set $D^{T}$ (80%). Before the search begins, AFLOW executes the blank template five times on the validation dataset and selects a subset of problems exhibiting high variance in scores. This filtering step optimizes computational efficiency by focusing the search on examples where workflow changes are most likely to produce measurable differences. The selected high-variance instances form the final validation set that drives the optimization process.

The workflow representation is where AFLOW's technical innovation becomes tangible. Workflows are implemented as Python classes with an __call__ method that defines the execution logic. Nodes are instantiated through operators, and edges are represented by the control flow of the code itself — if-else statements, loops, and variable passing between function calls. This representation is both human-readable and machine-optimizable: the LLM optimizer receives the workflow code as context and modifies it directly, while the execution engine runs the code without interpretation overhead.

The component interactions form a carefully orchestrated cycle. During the selection phase, the system identifies a parent workflow from the search tree using the soft mixed probability strategy. The selected workflow's context is then loaded, including its entire modification history and execution outcomes. This context feeds into the LLM optimizer, which generates a new workflow by making a single modification — either adding an operator, modifying a prompt, or changing the control flow. The new workflow is executed five times on the validation set to compute a robust performance estimate, capturing both mean score and variance. After execution, the system records the workflow's performance, the specific modification made, and whether the modification improved or degraded performance relative to its parent. This experience propagates back through the tree, updating the statistics that guide future selection decisions.

The expansion phase deserves particular attention as it showcases AFLOW's key innovation over prior methods. While ADAS and similar approaches accumulate all historical workflows in a linear list, suffering from information loss as the context grows, AFLOW leverages the tree structure to preserve only the relevant experience. When the optimizer revisits a workflow, it receives precisely the modifications and outcomes that branch from that node, avoiding the noise of unrelated exploration paths. This targeted experience preservation is what enables AFLOW to learn from past iterations effectively rather than drowning in an ever-growing history.

Engineering choices throughout the system reflect pragmatic optimization of the search process. The maximum number of iterations is set to 20, with early stopping if the top- $k$ average score shows no improvement for 5 consecutive rounds. The optimizer (Claude-3.5-sonnet) and executor (various models including GPT-4o-mini, DeepSeek-V2.5, GPT-4o, and Claude-3.5-sonnet) are decoupled, allowing the search to discover workflows optimized for specific execution models. Temperature settings vary by model: 1.0 for DeepSeek-V2.5 and 0.0 for others, reflecting the different models' characteristics. These choices matter because they balance exploration breadth, evaluation reliability, and computational cost in a system where each evaluation requires expensive LLM inference.

实验评估 (Experimental Evaluation)

The experimental design of AFLOW reflects a comprehensive strategy to validate three core hypotheses: that automatically discovered workflows can outperform manually designed ones, that the MCTS-based search is more effective than prior automated methods, and that discovered workflows transfer across different execution models while optimizing cost-performance trade-offs.

The evaluation spans six benchmark datasets carefully selected to cover diverse reasoning domains. GSM8K and MATH (level 5 problems) test mathematical reasoning with solve rate as the metric. HumanEval and MBPP assess code generation through pass@1. HotpotQA and DROP evaluate multi-hop question answering and discrete reasoning over paragraphs using F1 score. This diversity is intentional — it tests whether AFLOW can discover qualitatively different workflow structures for different task types, from the step-by-step verification needed in mathematics to the generate-test-fix cycles effective in code generation.

The baseline comparison is extensive and methodologically sound. The authors compare against direct LLM invocation (IO), Chain-of-Thought prompting (Wei et al., 2022), Self-Consistency CoT (Wang et al., 2022), MultiPersona Debate (Wang et al., 2024a), Self-Refine (Madaan et al., 2023), MedPrompt (Nori et al., 2023), and the automated workflow optimization method ADAS (Hu et al., 2024). All methods are executed with GPT-4o-mini to ensure fair comparison, and each experiment is run three times with averaged results to account for variance.

The main results, presented in Table 1, demonstrate AFLOW's consistent superiority across all benchmarks. Workflows optimized by AFLOW outperform all manually designed methods by an average of 5.7% and surpass ADAS by 19.5%. The most dramatic improvements occur on challenging tasks: on MATH level 5, AFLOW achieves 56.2% solve rate compared to ADAS's 35.4% — a 57% relative improvement. On MBPP, AFLOW reaches 83.4% pass@1 versus ADAS's 53.4%. These results are particularly meaningful because they show AFLOW excels precisely where tasks are most difficult and manual design is most challenging.

Method	HotpotQA	DROP	HumanEval	MBPP	GSM8K	MATH	Avg.
IO (GPT-4o-mini)	68.1	68.3	87.0	71.8	92.7	48.6	72.8
CoT (Wei et al., 2022)	67.9	78.5	88.6	71.8	92.4	48.8	74.7
CoT SC (Wang et al., 2022)	68.9	78.8	91.6	73.6	92.7	50.4	76.0
MedPrompt (Nori et al., 2023)	68.3	78.0	91.6	73.6	90.0	50.0	75.3
MultiPersona (Wang et al., 2024a)	69.2	74.4	89.3	73.6	92.8	50.8	75.1
SelfRefine (Madaan et al., 2023)	60.8	70.2	87.8	69.8	89.6	46.1	70.7
ADAS (Hu et al., 2024)	64.5	76.6	82.4	53.4	90.8	35.4	67.2
AFLOW (Ours)	73.5	80.6	94.7	83.4	93.5	56.2	80.3

The model-agnosticity experiment reveals an important nuance: while workflows discovered using one execution model generally improve performance when transferred to others, they perform best on the model used during search. This finding suggests that optimal workflow structure is not entirely universal but has model-specific components, likely reflecting different models' strengths in reasoning, code generation, or instruction following. The cost analysis, visualized through Pareto front plots, demonstrates that AFLOW can identify workflows allowing weaker models to outperform stronger ones on cost-effectiveness. For instance, executing AFLOW-discovered workflows with GPT-4o-mini achieves parity with GPT-4o direct invocation at only 4.55% of the cost.

The ablation study on GSM8K provides insight into the role of operators. With predefined operators, AFLOW discovers better workflows more efficiently. However, even without operators — using only the basic Custom operator — AFLOW maintains strong performance (93.1%) and autonomously develops ensemble-like structures. This result is significant because it demonstrates AFLOW's capacity for independent workflow design, marking progress toward full automation without human-provided building blocks.

案例研究 (Case Studies)

The GSM8K optimization trace provides a compelling window into AFLOW's iterative learning process. Starting from a blank template containing only a single node without prompts, the system evolves through twenty rounds of modification. In the optimal path, Round 2 adds the Programmer operator to generate and execute Python code for mathematical calculations, improving the score from 0.487 to 0.524. Round 3 introduces ScEnsemble to select from multiple solutions, further increasing performance to 0.528. Round 5 adds detailed step-by-step solution generation, pushing the score to 0.551 — the best achieved in the search. This progression reveals how AFLOW discovers the importance of combining computational verification (Programmer), solution diversity (Ensemble), and reasoning transparency (step-by-step generation).

The unsuccessful explorations are equally instructive. In Round 5 of a suboptimal branch, AFLOW introduces a custom review node that directly modifies answers without additional reasoning, which decreases accuracy. In Round 14, it attempts to rephrase the problem but overly focuses on "discount" information, leading to decreased performance. These failures demonstrate that the tree structure's value lies not just in preserving successes but in enabling the system to learn from failures without being trapped by them. When a modification degrades performance, the branch is effectively pruned, and exploration continues from more promising nodes.

The MBPP case study showcases AFLOW's ability to rediscover expert-designed workflows automatically. The discovered workflow generates multiple code solutions, selects the best through ensemble voting, tests it against public test cases, and if tests fail, fixes the solution using error feedback. This structure closely mirrors the manually designed AlphaCodium workflow (Ridnik et al., 2024), demonstrating that AFLOW can converge to known effective designs without human guidance. The HotpotQA case reveals another dimension: AFLOW discovers that formatting matters. The optimal workflow includes a formatting step that produces concise answers without prefixes like "The answer is," which improves F1 score by matching the expected answer format more precisely. This discovery of formatting optimization through execution feedback illustrates how AFLOW captures subtle performance factors that might be overlooked in manual design.

综合价值与局限 (Synthesis — Value and Limitations)

The theoretical significance of AFLOW lies in its reframing of workflow optimization as a search problem and its demonstration that tree-structured experience preservation is superior to linear accumulation. By formalizing the workflow space and showing that MCTS can effectively navigate it, the paper provides a conceptual foundation that future research can build upon. The tool of "code as workflow representation" is particularly powerful because it unifies expressivity and executability, offering a standard that could enable interoperability between different workflow optimization systems.

Practically, AFLOW's impact could be substantial for organizations seeking to deploy LLM-based systems across multiple domains. The elimination of manual workflow design reduces both time-to-deployment and the need for specialized expertise. The cost-performance breakthrough — enabling smaller models to outperform larger ones — could democratize access to high-quality agentic systems, particularly for applications where budget constraints currently force compromises on model capability.

The paper's strongest aspects are its comprehensive empirical validation and its careful comparison with both manual and automated baselines. The ablation studies thoughtfully isolate the contribution of operators, and the cost analysis provides actionable insights for deployment decisions. The tree-structured experience visualization makes an abstract concept concrete, helping readers understand why AFLOW outperforms methods like ADAS that use linear experience accumulation.

However, honest limitations exist. The framework's effectiveness depends on having explicit evaluation functions, limiting direct applicability to open-ended creative tasks where success is subjective. While the authors propose LLM-as-a-judge as a workaround, this introduces its own biases and may not capture human preferences accurately. The search process is computationally expensive — 20 iterations with 5 evaluations each means 100 workflow executions per optimization, which may be prohibitive for some applications. Additionally, the discovered workflows, while effective, are not guaranteed to be globally optimal; MCTS finds good solutions but does not certify optimality. The model-specificity of discovered workflows also means that optimization must be repeated when switching execution models, though the transfer experiments show that some benefits persist.

Broader implications connect to the trend of moving from prompt engineering to flow engineering in LLM application development. AFLOW accelerates this transition by making flow engineering automated rather than manual. It also raises intriguing questions about the future role of human designers — if algorithms can discover better workflows than humans, what remains for human expertise? The answer may lie in defining the optimization objectives, curating evaluation data, and interpreting discovered workflows to build theoretical understanding.

延伸阅读与思考 (Further Reading and Reflection)

AFLOW builds most directly upon ADAS (Hu et al., 2024), which first explored code-represented workflows and automated workflow optimization. AFLOW extends ADAS by replacing linear heuristic search with MCTS, introducing named nodes and operators, and implementing tree-structured experience preservation. The prompt optimization lineage includes DSPy (Khattab et al., 2024), which optimizes prompts within fixed workflows, and TextGrad (Yükseköğnül et al., 2024), which uses gradient-like text optimization. AFLOW goes beyond these by optimizing entire workflow structures, not just prompts or parameters. GPTSwarm (Zhuge et al., 2024) explores graph-structured agents with reinforcement learning but faces limitations in representing conditional workflows that AFLOW overcomes through code representation.

Related approaches for automated workflow discovery include AutoFlow (Li et al., 2024b) and Archon (Saad-Falcon et al., 2024), which focus on architecture search for inference-time techniques. These methods share the philosophy of automating what humans traditionally design but differ in their search spaces and optimization algorithms. AFLOW's distinctive contribution is the combination of code representation with MCTS, which enables more expressive workflows than graph-based methods while maintaining search efficiency through structured experience.

Future directions opened by this work include extending AFLOW to multi-objective optimization where latency, cost, and accuracy are jointly optimized; developing hierarchical MCTS where high-level workflow structure and low-level prompt optimization occur at different time scales; and creating theoretical guarantees about search convergence in the workflow space. The most intriguing unanswered question is whether AFLOW could eventually discover entirely novel agentic paradigms that humans have not conceived, or whether it is fundamentally limited to recombining known patterns.

The deepest unsolved challenge in this area is creating workflows that are not just effective but also interpretable and trustworthy. AFLOW discovers what works but does not necessarily explain why. As agentic systems are deployed in high-stakes domains, understanding the reasoning behind workflow structure will become essential. AFLOW's insights about which combinations of operators tend to succeed could inform interpretability research, bridging the gap between performance and understanding.

The most thought-provoking aspect of this work is its demonstration that in the domain of LLM orchestration, search can outperform human design. This challenges our intuitions about where human expertise adds value and suggests that the combinatorial complexity of workflow design may exceed human cognitive capacity in many cases. I would want to explore whether AFLOW's discovered workflows reveal any general principles about effective LLM interaction — for instance, whether certain topological patterns (sequential vs. parallel vs. iterative) consistently emerge for particular task types, and what this might tell us about the fundamental nature of computational reasoning.

Analysis completed following the Sub-Agent Paper Analysis Guide. All sections written in narrative prose as required.

Topics: