Agent-E: DOM Distillation and Hierarchical Orchestration for Reliable Web Navigation
A browser automation system built on the AG2 multi-agent framework. DOM distillation filters content by task type. Hierarchical group chat orchestration. MutationObserver validation prevents LLM hallucination.
Introduction: The Cognitive Layer
The first six articles in this series all addressed the same problem: how to let AI control a browser. Nanobrowser via extensions, Steel via cloud API, agent-browser via CLI, Camoufox via engine-level modification — each solved "physical control" at a different layer.
Agent-E addresses a different problem: how AI understands what it sees and decides what to do next.
Even with unlimited browser control, an agent is useless if it cannot find the right information in an ocean of HTML, cannot reliably locate target elements, and cannot recover from errors.
Built on the AG2 (formerly AutoGen) multi-agent framework, Agent-E introduces three technologies for cognitive reliability: DOM distillation, mmid injection, and hierarchical execution orchestration.
Architecture
┌──────────────────────────────────────────────┐
│ Agent-E System │
│ │
│ ┌────────────────────────────────────────┐ │
│ │ AG2 Multi-Agent Framework │ │
│ │ │ │
│ │ ┌──────────────────┐ │ │
│ │ │ High-Level │ ← Analyze task │ │
│ │ │ Planner Agent │ Decompose into │ │
│ │ │ │ subtasks │ │
│ │ └────────┬─────────┘ │ │
│ │ │ │ │
│ │ ┌────────▼─────────┐ │ │
│ │ │ Browser Nav │ ← Execute │ │
│ │ │ Agent │ interactions │ │
│ │ │ │ Read DOM state│ │
│ │ └────────┬─────────┘ │ │
│ │ │ │ │
│ │ ┌────────▼─────────┐ │ │
│ │ │ Skill Registry │ ← click, type, │ │
│ │ │ │ get_dom, etc. │ │
│ │ └──────────────────┘ │ │
│ └────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────┐ │
│ │ Preprocessing Pipeline (DOM Distill) │ │
│ │ ┌──────────┐ ┌──────────┐ ┌────────┐ │ │
│ │ │ Text │ │ Input │ │ All │ │ │
│ │ │ only │ │ fields │ │ content│ │ │
│ │ └──────────┘ └──────────┘ └────────┘ │ │
│ └────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────┐ │
│ │ Validation Layer (MutationObserver) │ │
│ │ Monitor DOM changes → Confirm actions │ │
│ └────────────────────────────────────────┘ │
└──────────────────────────────────────────────┘DOM Distillation: Filter on Demand
Agent-E's core innovation is DOM distillation. Instead of passing raw DOM or a fixed A11y tree to the LLM, it dynamically cleanses and filters the DOM based on the current task objective.
Three Distillation Modes
| Mode | Content Kept | Filtered Out | Best For |
|---|---|---|---|
| Text Only | Pure text content | Buttons, inputs, links, images | Reading articles, extracting info |
| Input Fields | Form elements (inputs, buttons, selects) | Text content, images, decorations | Form filling, login |
| All Content | All visible elements | Hidden elements, scripts, styles | Full page understanding |
mmid Element Targeting
Many web pages don't assign unique id attributes to elements. Agent-E injects custom mmid attributes during distillation:
<!-- Original element -->
<button class="btn-primary btn-lg">Submit Order</button>
<!-- After distillation (mmid injected) -->
<button class="btn-primary btn-lg" mmid="ae-127">Submit Order</button>mmid is deterministic — the same page structure produces the same mmid, enabling reliable element referencing.
Hierarchical Execution Orchestration
Instead of a simple "observe-think-act" loop, Agent-E implements nested group chats.
Execution Planner
The Execution Planner is a distinct cognitive layer. It observes the distilled DOM, decides the current step, and predicts + batch-processes subsequent steps:
Traditional linear loop (one LLM call per step):
Observe → Think → Click #1 → Observe → Think → Type → Observe → Think → Click #2
Agent-E batch mode (one call handles multiple steps):
Observe → Think → [Click #1, Type text, Click #2] (batch execute)This batching reduces API latency by 40-60% in complex form-filling scenarios.
Communication Flow
User request
│
▼
High-Level Planner Agent
│ Analyze task → Decompose into subtasks
│ Allocate resources to Browser Nav Agent
▼
Browser Navigation Agent
│ Read distilled DOM
│ Call Execution Planner for batch operations
▼
Execution Planner
│ Generate operation sequence: [click, type, click]
│ Invoke specific skills from registry
▼
Skill Execution
│ click_using_selector()
│ enter_text_using_selector()
│ get_dom_with_content_type()
▼
Validation Layer (MutationObserver)
│ Listen for DOM changes
│ Confirm operation effectiveness
▼
Feedback → High-Level Planner
│ Success → Next step
│ Failure → Adjust strategyMutationObserver Validation: Fighting Hallucination
A common problem with AI agents is hallucination — the model believes it clicked a button when the click didn't actually take effect.
Agent-E uses JavaScript's MutationObserver to monitor DOM changes after actions, providing deterministic feedback:
# Simplified MutationObserver logic
observer_code = """
const observer = new MutationObserver((mutations) => {
mutations.forEach(m => {
console.log('DOM changed:', m.type, m.target);
});
window.__agent_e_last_action_effect = 'observed';
});
observer.observe(document.body, {
childList: true, subtree: true, attributes: true
});
"""If no DOM change occurs after a click, the agent knows the action failed and can retry or adjust.
Installation
# Clone and setup
git clone <repository-url>
cd Agent-E
# Python environment with uv
uv venv --python 3.11
source .venv/bin/activate
uv pip install -r requirements.txt
# Configure
cp .env-example .env
# Edit .env with LLM API key and model nameConfiguration
# .env
AUTOGEN_MODEL_NAME=gpt-4-turbo
AUTOGEN_MODEL_API_KEY=your-api-key
BROWSER_STORAGE_DIR=/path/to/chrome/profileRun
python -m ae.mainAgent-E opens a browser with a chat interface icon. Click it to start conversing in natural language.
Practical Examples
Form Filling
Go to example.16yun.cn/checkout and fill in:
Name: John Smith
Phone: +1-555-0123
Address: 123 Main St, New York, NY 10001Processing flow:
1. Planner analyzes → "Form filling task"
2. Nav Agent opens target page
3. DOM distillation → "Input Fields" mode
4. Execution Planner identifies form fields
5. Batch execute:
→ Fill name field
→ Fill phone field
→ Fill address field
6. MutationObserver validates each field
7. Submit formSearch and Sort
Search example.16yun.cn for "wireless headphones", sort by price low to high1. Planner → "Search + sort task"
2. Nav Agent opens example.16yun.cn
3. Distillation → "Input Fields" mode
4. Locate search box, enter keywords
5. Wait for results
6. Distillation → "All Content" mode
7. Locate sort dropdown, select "Price: Low to High"
8. Verify sort took effect
9. Extract result listComparison
| Dimension | Agent-E | agent-browser | Nanobrowser |
|---|---|---|---|
| Core method | DOM distillation + orchestration | A11y tree + Ref mapping | Multi-agent collaboration |
| DOM processing | Task-type dynamic filtering | Fixed A11y snapshot | Raw DOM + element classification |
| Element targeting | mmid + semantic queries | @e1, @e2 stable refs | DOM path + text match |
| Validation | MutationObserver | None built-in | Validator agent |
| Batching | Execution Planner batch | CLI batch commands | Single-step |
| LLM framework | AG2 (ex-AutoGen) | None (direct CDP) | Custom multi-agent |
| Setup complexity | High (Python + AG2) | Low (Rust binary) | Low (Chrome extension) |
Limitations & Best For
Agent-E excels at:
- High-reliability form filling (e-commerce checkout, signup)
- Complex multi-step reasoning (price comparison, cross-site data)
- LLM call budget-sensitive scenarios
Not suitable for:
- High-concurrency data extraction (single-instance design)
- Advanced WAF countermeasures (no anti-detection design)
- Simple one-shot tasks (high setup overhead)
Summary
Agent-E demonstrates the next evolution of AI browser agents: from "how to control the browser" to "how to understand the page."
DOM distillation filters content on demand. mmid injection solves element targeting. Hierarchical orchestration reduces LLM calls. MutationObserver fights hallucination — these cognitive-layer designs make agents more reliable on complex, dynamic pages.
While Camoufox focuses on engine-level anti-detection and agent-browser on execution speed and token efficiency, Agent-E focuses on the agent's intelligence — can it correctly understand a page, recover from errors, and efficiently use limited LLM context?
The next article covers AWS Agentic Form Filling — an enterprise-grade solution from Amazon's cloud ecosystem that uses episodic memory to let agents continuously learn and optimize navigation strategies across runs.
Need an enterprise proxy plan?
We can tailor architecture to your target domains, concurrency, and reliability goals.