Agent-E: DOM Distillation and Hierarchical Orchestration for Reliable Web Navigation

Introduction: The Cognitive Layer

The first six articles in this series all addressed the same problem: how to let AI control a browser. Nanobrowser via extensions, Steel via cloud API, agent-browser via CLI, Camoufox via engine-level modification — each solved "physical control" at a different layer.

Agent-E addresses a different problem: how AI understands what it sees and decides what to do next.

Even with unlimited browser control, an agent is useless if it cannot find the right information in an ocean of HTML, cannot reliably locate target elements, and cannot recover from errors.

Built on the AG2 (formerly AutoGen) multi-agent framework, Agent-E introduces three technologies for cognitive reliability: DOM distillation, mmid injection, and hierarchical execution orchestration.

Architecture

┌──────────────────────────────────────────────┐
│              Agent-E System                   │
│                                               │
│  ┌────────────────────────────────────────┐  │
│  │       AG2 Multi-Agent Framework        │  │
│  │                                        │  │
│  │  ┌──────────────────┐                  │  │
│  │  │ High-Level       │ ← Analyze task   │  │
│  │  │ Planner Agent    │   Decompose into │  │
│  │  │                  │   subtasks       │  │
│  │  └────────┬─────────┘                  │  │
│  │           │                            │  │
│  │  ┌────────▼─────────┐                  │  │
│  │  │ Browser Nav      │ ← Execute       │  │
│  │  │ Agent            │   interactions  │  │
│  │  │                  │   Read DOM state│  │
│  │  └────────┬─────────┘                  │  │
│  │           │                            │  │
│  │  ┌────────▼─────────┐                  │  │
│  │  │ Skill Registry   │ ← click, type,   │  │
│  │  │                  │   get_dom, etc.  │  │
│  │  └──────────────────┘                  │  │
│  └────────────────────────────────────────┘  │
│                                               │
│  ┌────────────────────────────────────────┐  │
│  │  Preprocessing Pipeline (DOM Distill)  │  │
│  │  ┌──────────┐ ┌──────────┐ ┌────────┐ │  │
│  │  │ Text     │ │ Input    │ │ All    │ │  │
│  │  │ only     │ │ fields   │ │ content│ │  │
│  │  └──────────┘ └──────────┘ └────────┘ │  │
│  └────────────────────────────────────────┘  │
│                                               │
│  ┌────────────────────────────────────────┐  │
│  │ Validation Layer (MutationObserver)     │  │
│  │ Monitor DOM changes → Confirm actions  │  │
│  └────────────────────────────────────────┘  │
└──────────────────────────────────────────────┘

DOM Distillation: Filter on Demand

Agent-E's core innovation is DOM distillation. Instead of passing raw DOM or a fixed A11y tree to the LLM, it dynamically cleanses and filters the DOM based on the current task objective.

Three Distillation Modes

Mode	Content Kept	Filtered Out	Best For
Text Only	Pure text content	Buttons, inputs, links, images	Reading articles, extracting info
Input Fields	Form elements (inputs, buttons, selects)	Text content, images, decorations	Form filling, login
All Content	All visible elements	Hidden elements, scripts, styles	Full page understanding

mmid Element Targeting

Many web pages don't assign unique id attributes to elements. Agent-E injects custom mmid attributes during distillation:

<!-- Original element -->
<button class="btn-primary btn-lg">Submit Order</button>
 
<!-- After distillation (mmid injected) -->
<button class="btn-primary btn-lg" mmid="ae-127">Submit Order</button>

mmid is deterministic — the same page structure produces the same mmid, enabling reliable element referencing.

Hierarchical Execution Orchestration

Instead of a simple "observe-think-act" loop, Agent-E implements nested group chats.

Execution Planner

The Execution Planner is a distinct cognitive layer. It observes the distilled DOM, decides the current step, and predicts + batch-processes subsequent steps:

Traditional linear loop (one LLM call per step):
  Observe → Think → Click #1 → Observe → Think → Type → Observe → Think → Click #2
 
Agent-E batch mode (one call handles multiple steps):
  Observe → Think → [Click #1, Type text, Click #2] (batch execute)

This batching reduces API latency by 40-60% in complex form-filling scenarios.

Communication Flow

User request
    │
    ▼
High-Level Planner Agent
    │  Analyze task → Decompose into subtasks
    │  Allocate resources to Browser Nav Agent
    ▼
Browser Navigation Agent
    │  Read distilled DOM
    │  Call Execution Planner for batch operations
    ▼
Execution Planner
    │  Generate operation sequence: [click, type, click]
    │  Invoke specific skills from registry
    ▼
Skill Execution
    │  click_using_selector()
    │  enter_text_using_selector()
    │  get_dom_with_content_type()
    ▼
Validation Layer (MutationObserver)
    │  Listen for DOM changes
    │  Confirm operation effectiveness
    ▼
Feedback → High-Level Planner
    │  Success → Next step
    │  Failure → Adjust strategy

MutationObserver Validation: Fighting Hallucination

A common problem with AI agents is hallucination — the model believes it clicked a button when the click didn't actually take effect.

Agent-E uses JavaScript's MutationObserver to monitor DOM changes after actions, providing deterministic feedback:

# Simplified MutationObserver logic
observer_code = """
const observer = new MutationObserver((mutations) => {
  mutations.forEach(m => {
    console.log('DOM changed:', m.type, m.target);
  });
  window.__agent_e_last_action_effect = 'observed';
});
observer.observe(document.body, {
  childList: true, subtree: true, attributes: true
});
"""

If no DOM change occurs after a click, the agent knows the action failed and can retry or adjust.

Installation

# Clone and setup
git clone <repository-url>
cd Agent-E
 
# Python environment with uv
uv venv --python 3.11
source .venv/bin/activate
uv pip install -r requirements.txt
 
# Configure
cp .env-example .env
# Edit .env with LLM API key and model name

Configuration

# .env
AUTOGEN_MODEL_NAME=gpt-4-turbo
AUTOGEN_MODEL_API_KEY=your-api-key
BROWSER_STORAGE_DIR=/path/to/chrome/profile

Run

python -m ae.main

Agent-E opens a browser with a chat interface icon. Click it to start conversing in natural language.

Practical Examples

Form Filling

Go to example.16yun.cn/checkout and fill in:
Name: John Smith
Phone: +1-555-0123
Address: 123 Main St, New York, NY 10001

Processing flow:

1. Planner analyzes → "Form filling task"
2. Nav Agent opens target page
3. DOM distillation → "Input Fields" mode
4. Execution Planner identifies form fields
5. Batch execute:
   → Fill name field
   → Fill phone field
   → Fill address field
6. MutationObserver validates each field
7. Submit form

Search and Sort

Search example.16yun.cn for "wireless headphones", sort by price low to high

1. Planner → "Search + sort task"
2. Nav Agent opens example.16yun.cn
3. Distillation → "Input Fields" mode
4. Locate search box, enter keywords
5. Wait for results
6. Distillation → "All Content" mode
7. Locate sort dropdown, select "Price: Low to High"
8. Verify sort took effect
9. Extract result list

Comparison

Dimension	Agent-E	agent-browser	Nanobrowser
Core method	DOM distillation + orchestration	A11y tree + Ref mapping	Multi-agent collaboration
DOM processing	Task-type dynamic filtering	Fixed A11y snapshot	Raw DOM + element classification
Element targeting	mmid + semantic queries	@e1, @e2 stable refs	DOM path + text match
Validation	MutationObserver	None built-in	Validator agent
Batching	Execution Planner batch	CLI batch commands	Single-step
LLM framework	AG2 (ex-AutoGen)	None (direct CDP)	Custom multi-agent
Setup complexity	High (Python + AG2)	Low (Rust binary)	Low (Chrome extension)

Limitations & Best For

Agent-E excels at:

High-reliability form filling (e-commerce checkout, signup)
Complex multi-step reasoning (price comparison, cross-site data)
LLM call budget-sensitive scenarios

Not suitable for:

High-concurrency data extraction (single-instance design)
Advanced WAF countermeasures (no anti-detection design)
Simple one-shot tasks (high setup overhead)

Summary

Agent-E demonstrates the next evolution of AI browser agents: from "how to control the browser" to "how to understand the page."

DOM distillation filters content on demand. mmid injection solves element targeting. Hierarchical orchestration reduces LLM calls. MutationObserver fights hallucination — these cognitive-layer designs make agents more reliable on complex, dynamic pages.

While Camoufox focuses on engine-level anti-detection and agent-browser on execution speed and token efficiency, Agent-E focuses on the agent's intelligence — can it correctly understand a page, recover from errors, and efficiently use limited LLM context?

The next article covers AWS Agentic Form Filling — an enterprise-grade solution from Amazon's cloud ecosystem that uses episodic memory to let agents continuously learn and optimize navigation strategies across runs.