Agent-E: DOM Distillation and Hierarchical Orchestration for Reliable Web Navigation

A browser automation system built on the AG2 multi-agent framework. DOM distillation filters content by task type. Hierarchical group chat orchestration. MutationObserver validation prevents LLM hallucination.

16Yun Engineering TeamMay 30, 20264 min read

Introduction: The Cognitive Layer

The first six articles in this series all addressed the same problem: how to let AI control a browser. Nanobrowser via extensions, Steel via cloud API, agent-browser via CLI, Camoufox via engine-level modification — each solved "physical control" at a different layer.

Agent-E addresses a different problem: how AI understands what it sees and decides what to do next.

Even with unlimited browser control, an agent is useless if it cannot find the right information in an ocean of HTML, cannot reliably locate target elements, and cannot recover from errors.

Built on the AG2 (formerly AutoGen) multi-agent framework, Agent-E introduces three technologies for cognitive reliability: DOM distillation, mmid injection, and hierarchical execution orchestration.

Architecture

┌──────────────────────────────────────────────┐
│              Agent-E System                   │
│                                               │
│  ┌────────────────────────────────────────┐  │
│  │       AG2 Multi-Agent Framework        │  │
│  │                                        │  │
│  │  ┌──────────────────┐                  │  │
│  │  │ High-Level       │ ← Analyze task   │  │
│  │  │ Planner Agent    │   Decompose into │  │
│  │  │                  │   subtasks       │  │
│  │  └────────┬─────────┘                  │  │
│  │           │                            │  │
│  │  ┌────────▼─────────┐                  │  │
│  │  │ Browser Nav      │ ← Execute       │  │
│  │  │ Agent            │   interactions  │  │
│  │  │                  │   Read DOM state│  │
│  │  └────────┬─────────┘                  │  │
│  │           │                            │  │
│  │  ┌────────▼─────────┐                  │  │
│  │  │ Skill Registry   │ ← click, type,   │  │
│  │  │                  │   get_dom, etc.  │  │
│  │  └──────────────────┘                  │  │
│  └────────────────────────────────────────┘  │
│                                               │
│  ┌────────────────────────────────────────┐  │
│  │  Preprocessing Pipeline (DOM Distill)  │  │
│  │  ┌──────────┐ ┌──────────┐ ┌────────┐ │  │
│  │  │ Text     │ │ Input    │ │ All    │ │  │
│  │  │ only     │ │ fields   │ │ content│ │  │
│  │  └──────────┘ └──────────┘ └────────┘ │  │
│  └────────────────────────────────────────┘  │
│                                               │
│  ┌────────────────────────────────────────┐  │
│  │ Validation Layer (MutationObserver)     │  │
│  │ Monitor DOM changes → Confirm actions  │  │
│  └────────────────────────────────────────┘  │
└──────────────────────────────────────────────┘

DOM Distillation: Filter on Demand

Agent-E's core innovation is DOM distillation. Instead of passing raw DOM or a fixed A11y tree to the LLM, it dynamically cleanses and filters the DOM based on the current task objective.

Three Distillation Modes

ModeContent KeptFiltered OutBest For
Text OnlyPure text contentButtons, inputs, links, imagesReading articles, extracting info
Input FieldsForm elements (inputs, buttons, selects)Text content, images, decorationsForm filling, login
All ContentAll visible elementsHidden elements, scripts, stylesFull page understanding

mmid Element Targeting

Many web pages don't assign unique id attributes to elements. Agent-E injects custom mmid attributes during distillation:

<!-- Original element -->
<button class="btn-primary btn-lg">Submit Order</button>
 
<!-- After distillation (mmid injected) -->
<button class="btn-primary btn-lg" mmid="ae-127">Submit Order</button>

mmid is deterministic — the same page structure produces the same mmid, enabling reliable element referencing.

Hierarchical Execution Orchestration

Instead of a simple "observe-think-act" loop, Agent-E implements nested group chats.

Execution Planner

The Execution Planner is a distinct cognitive layer. It observes the distilled DOM, decides the current step, and predicts + batch-processes subsequent steps:

Traditional linear loop (one LLM call per step):
  Observe → Think → Click #1 → Observe → Think → Type → Observe → Think → Click #2
 
Agent-E batch mode (one call handles multiple steps):
  Observe → Think → [Click #1, Type text, Click #2] (batch execute)

This batching reduces API latency by 40-60% in complex form-filling scenarios.

Communication Flow

User request


High-Level Planner Agent
    │  Analyze task → Decompose into subtasks
    │  Allocate resources to Browser Nav Agent

Browser Navigation Agent
    │  Read distilled DOM
    │  Call Execution Planner for batch operations

Execution Planner
    │  Generate operation sequence: [click, type, click]
    │  Invoke specific skills from registry

Skill Execution
    │  click_using_selector()
    │  enter_text_using_selector()
    │  get_dom_with_content_type()

Validation Layer (MutationObserver)
    │  Listen for DOM changes
    │  Confirm operation effectiveness

Feedback → High-Level Planner
    │  Success → Next step
    │  Failure → Adjust strategy

MutationObserver Validation: Fighting Hallucination

A common problem with AI agents is hallucination — the model believes it clicked a button when the click didn't actually take effect.

Agent-E uses JavaScript's MutationObserver to monitor DOM changes after actions, providing deterministic feedback:

# Simplified MutationObserver logic
observer_code = """
const observer = new MutationObserver((mutations) => {
  mutations.forEach(m => {
    console.log('DOM changed:', m.type, m.target);
  });
  window.__agent_e_last_action_effect = 'observed';
});
observer.observe(document.body, {
  childList: true, subtree: true, attributes: true
});
"""

If no DOM change occurs after a click, the agent knows the action failed and can retry or adjust.

Installation

# Clone and setup
git clone <repository-url>
cd Agent-E
 
# Python environment with uv
uv venv --python 3.11
source .venv/bin/activate
uv pip install -r requirements.txt
 
# Configure
cp .env-example .env
# Edit .env with LLM API key and model name

Configuration

# .env
AUTOGEN_MODEL_NAME=gpt-4-turbo
AUTOGEN_MODEL_API_KEY=your-api-key
BROWSER_STORAGE_DIR=/path/to/chrome/profile

Run

python -m ae.main

Agent-E opens a browser with a chat interface icon. Click it to start conversing in natural language.

Practical Examples

Form Filling

Go to example.16yun.cn/checkout and fill in:
Name: John Smith
Phone: +1-555-0123
Address: 123 Main St, New York, NY 10001

Processing flow:

1. Planner analyzes → "Form filling task"
2. Nav Agent opens target page
3. DOM distillation → "Input Fields" mode
4. Execution Planner identifies form fields
5. Batch execute:
   → Fill name field
   → Fill phone field
   → Fill address field
6. MutationObserver validates each field
7. Submit form

Search and Sort

Search example.16yun.cn for "wireless headphones", sort by price low to high
1. Planner → "Search + sort task"
2. Nav Agent opens example.16yun.cn
3. Distillation → "Input Fields" mode
4. Locate search box, enter keywords
5. Wait for results
6. Distillation → "All Content" mode
7. Locate sort dropdown, select "Price: Low to High"
8. Verify sort took effect
9. Extract result list

Comparison

DimensionAgent-Eagent-browserNanobrowser
Core methodDOM distillation + orchestrationA11y tree + Ref mappingMulti-agent collaboration
DOM processingTask-type dynamic filteringFixed A11y snapshotRaw DOM + element classification
Element targetingmmid + semantic queries@e1, @e2 stable refsDOM path + text match
ValidationMutationObserverNone built-inValidator agent
BatchingExecution Planner batchCLI batch commandsSingle-step
LLM frameworkAG2 (ex-AutoGen)None (direct CDP)Custom multi-agent
Setup complexityHigh (Python + AG2)Low (Rust binary)Low (Chrome extension)

Limitations & Best For

Agent-E excels at:

  • High-reliability form filling (e-commerce checkout, signup)
  • Complex multi-step reasoning (price comparison, cross-site data)
  • LLM call budget-sensitive scenarios

Not suitable for:

  • High-concurrency data extraction (single-instance design)
  • Advanced WAF countermeasures (no anti-detection design)
  • Simple one-shot tasks (high setup overhead)

Summary

Agent-E demonstrates the next evolution of AI browser agents: from "how to control the browser" to "how to understand the page."

DOM distillation filters content on demand. mmid injection solves element targeting. Hierarchical orchestration reduces LLM calls. MutationObserver fights hallucination — these cognitive-layer designs make agents more reliable on complex, dynamic pages.

While Camoufox focuses on engine-level anti-detection and agent-browser on execution speed and token efficiency, Agent-E focuses on the agent's intelligence — can it correctly understand a page, recover from errors, and efficiently use limited LLM context?

The next article covers AWS Agentic Form Filling — an enterprise-grade solution from Amazon's cloud ecosystem that uses episodic memory to let agents continuously learn and optimize navigation strategies across runs.

Need an enterprise proxy plan?

We can tailor architecture to your target domains, concurrency, and reliability goals.