CI/CD 里跑 AI 浏览器测试（一）：版本锁与 Flaky Test

问题一：Chrome 版本自动更新

agent-browser 的 M138 AutoDeElevate 事件就是典型。Chrome 一次安全更新，agent-browser 的守护进程全部断开，所有依赖 agent-browser 的测试全线变红。不是代码改了，是 Chrome 自动更新了。

解决方案：版本锁定

一个简单的锁定策略可以让 CI 不受浏览器版本更新的影响。

# Dockerfile.ci
FROM node:22-slim
 
# 安装指定版本的 Chrome
RUN npx agent-browser install --version 137.0.7150.0
 
# 将版本号写入环境变量
ENV CHROME_VERSION=137.0.7150.0
 
# 运行前验证版本
RUN test "$(google-chrome --version | grep -oP '\d+\.\d+\.\d+\.\d+')" = "$CHROME_VERSION"

版本更新检测

当团队决定升级 Chrome 版本时，需要一个验证步骤来确认兼容性：

# .github/workflows/chrome-upgrade-check.yml
jobs:
  check-chrome-version:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Test with new Chrome
        run: |
          npx agent-browser install --version ${{ inputs.chrome_version }}
          npm test  # 运行全量测试

问题二：无头环境和有头环境的差异

这是最隐蔽的坑之一。你的测试在本地 CI runner 上跑过了，但在真正的用户环境（有头的桌面浏览器）中可能完全不通过。

差异包括：

A11y 树的元素顺序不同——无头模式下某些不可见元素不会被渲染，导致 A11y 树的 ref ID 偏移
字体渲染差异——CI 容器没有目标用户的字体，文本可能被截断或走样
GPU 不可用——使用 WebGL 的页面在无头模式下行为不同
媒体查询——prefers-color-scheme、prefers-reduced-motion 等 CSS 媒体查询的值在无头环境中可能是默认值

class HeadlessCompatibilityChecker:
    def __init__(self):
        self.diffs = []
 
    def compare_a11y_tree(self, headless_tree, headed_tree):
        """对比无头和完整模式的 A11y 树差异"""
        headless_refs = {e["ref"]: e["text"] for e in headless_tree}
        headed_refs = {e["ref"]: e["text"] for e in headed_tree}
 
        for ref in headed_refs:
            if ref not in headless_refs:
                self.diffs.append(f"Element {ref} missing in headless mode")
 
        for ref in headless_refs:
            if ref in headed_refs and headless_refs[ref] != headed_refs[ref]:
                self.diffs.append(
                    f"Element {ref} text differs: "
                    f"headless='{headless_refs[ref]}' vs headed='{headed_refs[ref]}'"
                )

问题三：并发测试的 Session 隔离

多个测试用例在同一个 CI runner 上并行执行时，如果共享浏览器实例，会出现 session 冲突。

# 不安全的并发测试
@pytest.mark.parametrize("url", test_urls)
async def test_extraction(url, shared_browser):  # Bad: 共享浏览器
    page = await shared_browser.new_page()
    await page.goto(url)
    # 两个测试用例可能操作同一个标签页
 
# 安全的并发测试
@pytest.mark.parametrize("url", test_urls)
async def test_extraction(url):
    browser = await launch_browser()           # Good: 每个测试独立浏览器
    page = await browser.new_page()
    await page.goto(url)
    await browser.close()

在 CI 中配置并发隔离：

# 使用 pytest-xdist 但每个 worker 使用独立的端口
pytest -n auto --browser-port-range 9222-9322

问题四：测试结果不稳定（Flaky Test）

Flaky Test 在 AI 浏览器测试中比传统 UI 测试更常见，因为：

网络延迟——目标页面加载时间不稳定
页面动态内容——广告、推荐内容、A/B 测试导致页面结构不同
AI 决策概率性——同一个任务，LLM 可能每次选择不同的交互路径

处理策略

class FlakyTestHandler:
    def __init__(self, max_retries=3):
        self.max_retries = max_retries
 
    @retry(stop_max_attempt_number=3, wait_exponential_multiplier=1000)
    async def flaky_operation(self, page, operation_fn):
        """对已知的 flaky 操作使用指数退避重试"""
        return await operation_fn(page)
 
    def should_retry(self, error, attempt):
        """哪些错误值得重试"""
        retryable = [
            "TimeoutError",
            "NavigationError",
            "ElementNotFoundError",
        ]
        error_type = type(error).__name__
        should = error_type in retryable and attempt < self.max_retries
 
        # 记录哪些错误被重试了，用于后续分析
        if should:
            logger.warning(f"Retrying {error_type} (attempt {attempt + 1})")
 
        return should

总结

CI/CD 里跑 AI 浏览器测试的四个关键问题：

Chrome 版本锁定——否则一次自动更新就能让 CI 全红
无头/有头差异——A11y 树、字体、GPU 行为不同
Session 隔离——并发测试使用独立的浏览器实例
Flaky Test 处理——重试策略 + 区分可重试的不可重试的错误

agent-browser CI/CD 测试：版本锁定、无头差异与 Flaky Test 处理