Nanobrowser Source Code: Markdown Extraction and Readability — Two Data Pipelines for Scraping
Inside Nanobrowser: turndown-based getMarkdownContent and Mozilla Readability-based getReadabilityContent. How they inject code, execute in page context, and what each pipeline is good for.
Introduction: How a Browser Extension Extracts Data
Nanobrowser is a Chrome extension that runs inside the user's browser, controlling the active tab via Puppeteer's CDP. When the agent needs to read information from a page, it doesn't feed raw HTML directly to the LLM. Instead, it uses built-in extraction pipelines that transform page content into formats the AI can consume efficiently.
Both pipelines live in browser/dom/service.ts — fewer than 40 lines of core code. They are getMarkdownContent and getReadabilityContent.
Pipeline 1: turn2Markdown (Full Conversion)
Source Location
browser/dom/service.ts:44-61
export async function getMarkdownContent(
tabId: number, selector?: string
): Promise<string> {
const results = await chrome.scripting.executeScript({
target: { tabId: tabId },
func: sel => {
return window.turn2Markdown(sel);
},
args: [selector || ''],
});
const result = results[0]?.result;
if (!result) {
throw new Error('Failed to get markdown content');
}
return result as string;
}The function injects code into the target page's JavaScript context via chrome.scripting.executeScript, calling window.turn2Markdown(selector).
turn2Markdown is a global function that Nanobrowser injects during initialization. It wraps the turndown library — a mature open-source tool for converting HTML to Markdown.
Key design points:
-
Optional CSS selector — the
selectorparameter limits conversion to a specific page region (e.g.,#product-description). Useful for focused scraping tasks. -
Executed via
chrome.scripting— not a regular function call.executeScriptruns code in the target page's isolated world and serializes the result back. Theturn2Markdownfunction must exist on the page's global scope. -
Simple error handling — throw on failure, no retry logic.
Flow
Nanobrowser background
│
├── chrome.scripting.executeScript({
│ target: { tabId },
│ func: (sel) => window.turn2Markdown(sel),
│ args: ['#product-info']
│ })
│
▼
Target page context
│
├── window.turn2Markdown('#product-info')
│ │
│ ├── Locate #product-info element
│ ├── Get innerHTML
│ ├── turndown converts HTML → Markdown
│ └── Return Markdown string
│
▼
Returns to Nanobrowser → LLM consumesUse Cases
| Scenario | Good For | Not Good For |
|---|---|---|
| Full page extraction | Simple structured documents | Pages with lots of irrelevant content |
| Region extraction | Product detail, article with known container | Content spread across multiple regions |
| Tabular data | Simple tables | Complex nested tables |
| Mixed content | Text + image pages | Dynamic/lazy-loaded content |
Pipeline 2: parserReadability (Article Extraction)
Source Location
browser/dom/service.ts:63-81
export interface ReadabilityResult {
title: string;
content: string;
textContent: string;
length: number;
excerpt: string;
byline: string;
dir: string;
siteName: string;
lang: string;
publishedTime: string;
}
export async function getReadabilityContent(
tabId: number
): Promise<ReadabilityResult> {
const results = await chrome.scripting.executeScript({
target: { tabId },
func: () => {
return window.parserReadability();
},
});
const result = results[0]?.result;
if (!result) {
throw new Error('Failed to get readability content');
}
return result as ReadabilityResult;
}parserReadability wraps Mozilla's Readability library — the same algorithm behind Firefox's Reader Mode. It analyzes the DOM, finds the most likely article body, strips navigation, sidebars, and ads.
ReadabilityResult Fields
| Field | Type | Purpose | Scraping Use |
|---|---|---|---|
title | string | Article title | Collected title |
content | string | Article HTML | Raw content for further processing |
textContent | string | Plain text | Direct LLM input |
length | number | Content length | Token estimation |
excerpt | string | Summary | Quick preview |
byline | string | Author | Metadata |
siteName | string | Site name | Source attribution |
publishedTime | string | Publish date | Temporal dimension |
lang | string | Language | Model selection |
dir | string | Text direction | Layout handling |
Readability Scoring Algorithm (Simplified)
1. Scan page for text-dense elements (<p>, <pre>, <td>)
2. Score each candidate container:
- Class contains "article", "post", "content" → bonus
- Class contains "sidebar", "comment", "ad" → penalty
- Has <p> children → bonus
- High link density → penalty
3. Select highest-scoring container
4. Clean: remove scripts, styles, ads
5. Extract metadata: title, author, date
6. Return structured ReadabilityResultPipeline Comparison
| Aspect | getMarkdownContent | getReadabilityContent |
|---|---|---|
| Library | turndown | Mozilla Readability |
| Output | Markdown string | Structured object (HTML + text + metadata) |
| Content filtering | By CSS selector region | Algorithmic article identification |
| Best for | Any page | Article pages (blogs, news, docs) |
| Not for | Dynamic content | Pages without article structure |
| Metadata | None | Title, author, date, site name |
| Token efficiency | Fixed compression ratio | High (60-80% reduction for articles) |
Injection Mechanism
Both pipelines depend on functions available in the page's global scope. These are injected during extension initialization via content scripts or other mechanisms.
The key is the chrome.scripting.executeScript API:
// Method 1: Inline function (used here)
chrome.scripting.executeScript({
target: { tabId },
func: () => window.parserReadability(),
})
// Method 2: From file (used for buildDomTree)
chrome.scripting.executeScript({
target: { tabId },
files: ['buildDomTree.js'],
})The func parameter is serialized to a string and deserialized in the target page. Closure variables and external imports are unavailable — all dependencies must exist on the page's window scope. Hence turn2Markdown and parserReadability are injected as global functions.
Lessons for Scraper Developers
When to Use Markdown Pipeline
# Scraping pseudocode: full page → Markdown
page = await browser.new_page()
await page.goto("https://example.16yun.cn/products")
markdown = await page.evaluate("window.turn2Markdown()")
# ~40% fewer tokens than raw HTML for the LLMWhen to Use Readability Pipeline
# Scraping pseudocode: article → structured result
page = await browser.new_page()
await page.goto("https://example.16yun.cn/blog/article")
result = await page.evaluate("window.parserReadability()")
# textContent → plain text for LLM
# content → HTML with structure preserved
# title, byline, publishedTime → metadataSelection Decision Tree
加载图表中...
Summary
getMarkdownContent and getReadabilityContent are Nanobrowser's two content extraction pipelines for AI agents. Their implementations are under 40 lines, backed by turndown and Mozilla Readability.
For scraper developers: full extraction → Markdown. Article extraction → Readability. Neither works for dynamic content or comments — that's when you need a custom third pipeline.
The next article analyzes Nanobrowser's clickable element detection system — how the agent knows what can be clicked, how it distinguishes interactive from static elements, and how it uses hash-based deduplication to avoid redundant operations.
Need an enterprise proxy plan?
We can tailor architecture to your target domains, concurrency, and reliability goals.