Nanobrowser Source Code: Markdown Extraction and Readability — Two Data Pipelines for Scraping

Inside Nanobrowser: turndown-based getMarkdownContent and Mozilla Readability-based getReadabilityContent. How they inject code, execute in page context, and what each pipeline is good for.

16Yun Engineering TeamApr 9, 20264 min read

Introduction: How a Browser Extension Extracts Data

Nanobrowser is a Chrome extension that runs inside the user's browser, controlling the active tab via Puppeteer's CDP. When the agent needs to read information from a page, it doesn't feed raw HTML directly to the LLM. Instead, it uses built-in extraction pipelines that transform page content into formats the AI can consume efficiently.

Both pipelines live in browser/dom/service.ts — fewer than 40 lines of core code. They are getMarkdownContent and getReadabilityContent.

Pipeline 1: turn2Markdown (Full Conversion)

Source Location

browser/dom/service.ts:44-61

export async function getMarkdownContent(
  tabId: number, selector?: string
): Promise<string> {
  const results = await chrome.scripting.executeScript({
    target: { tabId: tabId },
    func: sel => {
      return window.turn2Markdown(sel);
    },
    args: [selector || ''],
  });
 
  const result = results[0]?.result;
  if (!result) {
    throw new Error('Failed to get markdown content');
  }
  return result as string;
}

The function injects code into the target page's JavaScript context via chrome.scripting.executeScript, calling window.turn2Markdown(selector).

turn2Markdown is a global function that Nanobrowser injects during initialization. It wraps the turndown library — a mature open-source tool for converting HTML to Markdown.

Key design points:

  1. Optional CSS selector — the selector parameter limits conversion to a specific page region (e.g., #product-description). Useful for focused scraping tasks.

  2. Executed via chrome.scripting — not a regular function call. executeScript runs code in the target page's isolated world and serializes the result back. The turn2Markdown function must exist on the page's global scope.

  3. Simple error handling — throw on failure, no retry logic.

Flow

Nanobrowser background

  ├── chrome.scripting.executeScript({
  │     target: { tabId },
  │     func: (sel) => window.turn2Markdown(sel),
  │     args: ['#product-info']
  │   })


Target page context

  ├── window.turn2Markdown('#product-info')
  │     │
  │     ├── Locate #product-info element
  │     ├── Get innerHTML
  │     ├── turndown converts HTML → Markdown
  │     └── Return Markdown string


Returns to Nanobrowser → LLM consumes

Use Cases

ScenarioGood ForNot Good For
Full page extractionSimple structured documentsPages with lots of irrelevant content
Region extractionProduct detail, article with known containerContent spread across multiple regions
Tabular dataSimple tablesComplex nested tables
Mixed contentText + image pagesDynamic/lazy-loaded content

Pipeline 2: parserReadability (Article Extraction)

Source Location

browser/dom/service.ts:63-81

export interface ReadabilityResult {
  title: string;
  content: string;
  textContent: string;
  length: number;
  excerpt: string;
  byline: string;
  dir: string;
  siteName: string;
  lang: string;
  publishedTime: string;
}
 
export async function getReadabilityContent(
  tabId: number
): Promise<ReadabilityResult> {
  const results = await chrome.scripting.executeScript({
    target: { tabId },
    func: () => {
      return window.parserReadability();
    },
  });
  const result = results[0]?.result;
  if (!result) {
    throw new Error('Failed to get readability content');
  }
  return result as ReadabilityResult;
}

parserReadability wraps Mozilla's Readability library — the same algorithm behind Firefox's Reader Mode. It analyzes the DOM, finds the most likely article body, strips navigation, sidebars, and ads.

ReadabilityResult Fields

FieldTypePurposeScraping Use
titlestringArticle titleCollected title
contentstringArticle HTMLRaw content for further processing
textContentstringPlain textDirect LLM input
lengthnumberContent lengthToken estimation
excerptstringSummaryQuick preview
bylinestringAuthorMetadata
siteNamestringSite nameSource attribution
publishedTimestringPublish dateTemporal dimension
langstringLanguageModel selection
dirstringText directionLayout handling

Readability Scoring Algorithm (Simplified)

1. Scan page for text-dense elements (<p>, <pre>, <td>)
2. Score each candidate container:
   - Class contains "article", "post", "content" → bonus
   - Class contains "sidebar", "comment", "ad" → penalty
   - Has <p> children → bonus
   - High link density → penalty
3. Select highest-scoring container
4. Clean: remove scripts, styles, ads
5. Extract metadata: title, author, date
6. Return structured ReadabilityResult

Pipeline Comparison

AspectgetMarkdownContentgetReadabilityContent
LibraryturndownMozilla Readability
OutputMarkdown stringStructured object (HTML + text + metadata)
Content filteringBy CSS selector regionAlgorithmic article identification
Best forAny pageArticle pages (blogs, news, docs)
Not forDynamic contentPages without article structure
MetadataNoneTitle, author, date, site name
Token efficiencyFixed compression ratioHigh (60-80% reduction for articles)

Injection Mechanism

Both pipelines depend on functions available in the page's global scope. These are injected during extension initialization via content scripts or other mechanisms.

The key is the chrome.scripting.executeScript API:

// Method 1: Inline function (used here)
chrome.scripting.executeScript({
  target: { tabId },
  func: () => window.parserReadability(),
})
 
// Method 2: From file (used for buildDomTree)
chrome.scripting.executeScript({
  target: { tabId },
  files: ['buildDomTree.js'],
})

The func parameter is serialized to a string and deserialized in the target page. Closure variables and external imports are unavailable — all dependencies must exist on the page's window scope. Hence turn2Markdown and parserReadability are injected as global functions.

Lessons for Scraper Developers

When to Use Markdown Pipeline

# Scraping pseudocode: full page → Markdown
page = await browser.new_page()
await page.goto("https://example.16yun.cn/products")
markdown = await page.evaluate("window.turn2Markdown()")
# ~40% fewer tokens than raw HTML for the LLM

When to Use Readability Pipeline

# Scraping pseudocode: article → structured result
page = await browser.new_page()
await page.goto("https://example.16yun.cn/blog/article")
result = await page.evaluate("window.parserReadability()")
# textContent → plain text for LLM
# content → HTML with structure preserved
# title, byline, publishedTime → metadata

Selection Decision Tree

加载图表中...

Summary

getMarkdownContent and getReadabilityContent are Nanobrowser's two content extraction pipelines for AI agents. Their implementations are under 40 lines, backed by turndown and Mozilla Readability.

For scraper developers: full extraction → Markdown. Article extraction → Readability. Neither works for dynamic content or comments — that's when you need a custom third pipeline.

The next article analyzes Nanobrowser's clickable element detection system — how the agent knows what can be clicked, how it distinguishes interactive from static elements, and how it uses hash-based deduplication to avoid redundant operations.

Need an enterprise proxy plan?

We can tailor architecture to your target domains, concurrency, and reliability goals.