Nanobrowser Source Code: Markdown and Readability Extraction Pipelines

Introduction: How a Browser Extension Extracts Data

Nanobrowser is a Chrome extension that runs inside the user's browser, controlling the active tab via Puppeteer's CDP. When the agent needs to read information from a page, it doesn't feed raw HTML directly to the LLM. Instead, it uses built-in extraction pipelines that transform page content into formats the AI can consume efficiently.

Both pipelines live in browser/dom/service.ts — fewer than 40 lines of core code. They are getMarkdownContent and getReadabilityContent.

Pipeline 1: turn2Markdown (Full Conversion)

Source Location

browser/dom/service.ts:44-61

export async function getMarkdownContent(
  tabId: number, selector?: string
): Promise<string> {
  const results = await chrome.scripting.executeScript({
    target: { tabId: tabId },
    func: sel => {
      return window.turn2Markdown(sel);
    },
    args: [selector || ''],
  });
 
  const result = results[0]?.result;
  if (!result) {
    throw new Error('Failed to get markdown content');
  }
  return result as string;
}

The function injects code into the target page's JavaScript context via chrome.scripting.executeScript, calling window.turn2Markdown(selector).

turn2Markdown is a global function that Nanobrowser injects during initialization. It wraps the turndown library — a mature open-source tool for converting HTML to Markdown.

Key design points:

Optional CSS selector — the selector parameter limits conversion to a specific page region (e.g., #product-description). Useful for focused scraping tasks.
Executed via chrome.scripting — not a regular function call. executeScript runs code in the target page's isolated world and serializes the result back. The turn2Markdown function must exist on the page's global scope.
Simple error handling — throw on failure, no retry logic.

Flow

Nanobrowser background
  │
  ├── chrome.scripting.executeScript({
  │     target: { tabId },
  │     func: (sel) => window.turn2Markdown(sel),
  │     args: ['#product-info']
  │   })
  │
  ▼
Target page context
  │
  ├── window.turn2Markdown('#product-info')
  │     │
  │     ├── Locate #product-info element
  │     ├── Get innerHTML
  │     ├── turndown converts HTML → Markdown
  │     └── Return Markdown string
  │
  ▼
Returns to Nanobrowser → LLM consumes

Use Cases

Scenario	Good For	Not Good For
Full page extraction	Simple structured documents	Pages with lots of irrelevant content
Region extraction	Product detail, article with known container	Content spread across multiple regions
Tabular data	Simple tables	Complex nested tables
Mixed content	Text + image pages	Dynamic/lazy-loaded content

Pipeline 2: parserReadability (Article Extraction)

Source Location

browser/dom/service.ts:63-81

export interface ReadabilityResult {
  title: string;
  content: string;
  textContent: string;
  length: number;
  excerpt: string;
  byline: string;
  dir: string;
  siteName: string;
  lang: string;
  publishedTime: string;
}
 
export async function getReadabilityContent(
  tabId: number
): Promise<ReadabilityResult> {
  const results = await chrome.scripting.executeScript({
    target: { tabId },
    func: () => {
      return window.parserReadability();
    },
  });
  const result = results[0]?.result;
  if (!result) {
    throw new Error('Failed to get readability content');
  }
  return result as ReadabilityResult;
}

parserReadability wraps Mozilla's Readability library — the same algorithm behind Firefox's Reader Mode. It analyzes the DOM, finds the most likely article body, strips navigation, sidebars, and ads.

ReadabilityResult Fields

Field	Type	Purpose	Scraping Use
`title`	string	Article title	Collected title
`content`	string	Article HTML	Raw content for further processing
`textContent`	string	Plain text	Direct LLM input
`length`	number	Content length	Token estimation
`excerpt`	string	Summary	Quick preview
`byline`	string	Author	Metadata
`siteName`	string	Site name	Source attribution
`publishedTime`	string	Publish date	Temporal dimension
`lang`	string	Language	Model selection
`dir`	string	Text direction	Layout handling

Readability Scoring Algorithm (Simplified)

1. Scan page for text-dense elements (<p>, <pre>, <td>)
2. Score each candidate container:
   - Class contains "article", "post", "content" → bonus
   - Class contains "sidebar", "comment", "ad" → penalty
   - Has <p> children → bonus
   - High link density → penalty
3. Select highest-scoring container
4. Clean: remove scripts, styles, ads
5. Extract metadata: title, author, date
6. Return structured ReadabilityResult

Pipeline Comparison

Aspect	getMarkdownContent	getReadabilityContent
Library	turndown	Mozilla Readability
Output	Markdown string	Structured object (HTML + text + metadata)
Content filtering	By CSS selector region	Algorithmic article identification
Best for	Any page	Article pages (blogs, news, docs)
Not for	Dynamic content	Pages without article structure
Metadata	None	Title, author, date, site name
Token efficiency	Fixed compression ratio	High (60-80% reduction for articles)

Injection Mechanism

Both pipelines depend on functions available in the page's global scope. These are injected during extension initialization via content scripts or other mechanisms.

The key is the chrome.scripting.executeScript API:

// Method 1: Inline function (used here)
chrome.scripting.executeScript({
  target: { tabId },
  func: () => window.parserReadability(),
})
 
// Method 2: From file (used for buildDomTree)
chrome.scripting.executeScript({
  target: { tabId },
  files: ['buildDomTree.js'],
})

The func parameter is serialized to a string and deserialized in the target page. Closure variables and external imports are unavailable — all dependencies must exist on the page's window scope. Hence turn2Markdown and parserReadability are injected as global functions.

Lessons for Scraper Developers

When to Use Markdown Pipeline

# Scraping pseudocode: full page → Markdown
page = await browser.new_page()
await page.goto("https://example.16yun.cn/products")
markdown = await page.evaluate("window.turn2Markdown()")
# ~40% fewer tokens than raw HTML for the LLM

When to Use Readability Pipeline

# Scraping pseudocode: article → structured result
page = await browser.new_page()
await page.goto("https://example.16yun.cn/blog/article")
result = await page.evaluate("window.parserReadability()")
# textContent → plain text for LLM
# content → HTML with structure preserved
# title, byline, publishedTime → metadata

Selection Decision Tree

加载图表中...

Summary

getMarkdownContent and getReadabilityContent are Nanobrowser's two content extraction pipelines for AI agents. Their implementations are under 40 lines, backed by turndown and Mozilla Readability.

For scraper developers: full extraction → Markdown. Article extraction → Readability. Neither works for dynamic content or comments — that's when you need a custom third pipeline.

The next article analyzes Nanobrowser's clickable element detection system — how the agent knows what can be clicked, how it distinguishes interactive from static elements, and how it uses hash-based deduplication to avoid redundant operations.

Nanobrowser Source Code: Markdown Extraction and Readability — Two Data Pipelines for Scraping

Introduction: How a Browser Extension Extracts Data

Pipeline 1: turn2Markdown (Full Conversion)

Source Location

Flow

Use Cases

Pipeline 2: parserReadability (Article Extraction)

Source Location

ReadabilityResult Fields

Readability Scoring Algorithm (Simplified)

Pipeline Comparison

Injection Mechanism

Lessons for Scraper Developers

When to Use Markdown Pipeline

When to Use Readability Pipeline

Selection Decision Tree

Summary

Need an enterprise proxy plan?