Skip to content

fix(douban): read search results from page data#1213

Open
warkcod wants to merge 4 commits intojackwener:mainfrom
warkcod:fix/douban-search-data-fallback
Open

fix(douban): read search results from page data#1213
warkcod wants to merge 4 commits intojackwener:mainfrom
warkcod:fix/douban-search-data-fallback

Conversation

@warkcod
Copy link
Copy Markdown
Contributor

@warkcod warkcod commented Apr 29, 2026

Summary

  • read Douban search results from embedded page data when rendered DOM results are unavailable
  • use CDP mode for Douban when OPENCLI_CDP_ENDPOINT is configured
  • reuse a named CDP tab per workspace to avoid accumulating browser tabs

Verification

  • npx vitest run src/browser/cdp.test.ts src/runtime.test.ts clis/douban/utils.test.js
  • npm run typecheck
  • npm run build
  • manual: repeated douban search calls against localhost:9222 reused the same named target instead of opening new tabs

@jackwener
Copy link
Copy Markdown
Owner

@warkcod thanks for the PR. The Douban page-data fallback is fine, but our reviewers found two browser-core blockers in src/browser/cdp.ts that need to be fixed before merge. Since maintainerCanModify=false, please apply the patch below directly to your branch.

Blocker 1: named-tab hijack

When selectNamedCDPTarget(targets, tabName) returns nothing, the current implementation falls back to selectCDPTarget(targets), which takes over any inspectable tab in the browser and re-marks it as opencli:<workspace>. This is not "named-tab reuse" — it is hijack of a user's existing tab.

Fix: in named path, only selectNamedCDPTarget(...) ?? createCDPTarget(...). Never fall back to a generic target.

Blocker 2: registered Electron apps get implicitly covered

buildCDPTabName() now applies to every site:<workspace>, including registered Electron apps (cursor, codex, notion, …). These should keep the old score-based selection / OPENCLI_CDP_TARGET behavior, otherwise opencli cursor … may attach to the wrong process or create an about:blank target.

Fix: buildCDPTabName() returns undefined for Electron apps by default. Users can still opt-in via OPENCLI_CDP_TAB_NAME=….

Combined patch (apply with git apply)

diff --git a/src/browser/cdp.test.ts b/src/browser/cdp.test.ts
index 836269c..120f266 100644
--- a/src/browser/cdp.test.ts
+++ b/src/browser/cdp.test.ts
@@ -80,6 +80,11 @@ describe('CDP target reuse', () => {
     expect(__test__.buildCDPTabName('site:douban', { OPENCLI_CDP_REUSE_TAB: 'false' })).toBeUndefined();
   });

+  it('does not enable named-tab reuse by default for registered Electron apps', () => {
+    expect(__test__.buildCDPTabName('site:cursor', {})).toBeUndefined();
+    expect(__test__.buildCDPTabName('site:cursor', { OPENCLI_CDP_TAB_NAME: 'cursor-fixed' })).toBe('cursor-fixed');
+  });
+
   it('selects an existing CDP target by persistent window.name', async () => {
     const targets = [
       {
@@ -107,6 +112,36 @@ describe('CDP target reuse', () => {
     expect(selected?.id).toBe('b');
   });

+  it('creates a new target instead of hijacking an unrelated tab when no named tab exists', async () => {
+    const targets = [
+      {
+        id: 'a',
+        type: 'page',
+        title: '普通标签页',
+        url: 'https://www.douban.com/',
+        webSocketDebuggerUrl: 'ws://127.0.0.1/a',
+      },
+    ];
+
+    const created = {
+      id: 'new',
+      type: 'page',
+      title: 'about:blank',
+      url: 'about:blank',
+      webSocketDebuggerUrl: 'ws://127.0.0.1/new',
+    };
+
+    const selected = await __test__.resolveCDPTarget(
+      'http://127.0.0.1:9222',
+      targets,
+      'opencli:site:douban',
+      async () => '',
+      async () => created,
+    );
+
+    expect(selected).toBe(created);
+  });
+
   it('does not pick Chrome internal popup targets', () => {
     expect(__test__.scoreCDPTarget({
       type: 'page',
diff --git a/src/browser/cdp.ts b/src/browser/cdp.ts
index 859d760..b595bb0 100644
--- a/src/browser/cdp.ts
+++ b/src/browser/cdp.ts
@@ -17,7 +17,7 @@ import { wrapForEval } from './utils.js';
 import { generateStealthJs } from './stealth.js';
 import { waitForDomStableJs } from './dom-helpers.js';
 import { isRecord, saveBase64ToFile } from '../utils.js';
-import { getAllElectronApps } from '../electron-apps.js';
+import { getAllElectronApps, isElectronApp } from '../electron-apps.js';
 import { BasePage } from './base-page.js';

 export interface CDPTarget {
@@ -65,9 +65,7 @@ export class CDPBridge implements IBrowserFactory {
     if (endpoint.startsWith('http')) {
       const baseEndpoint = endpoint.replace(/\/$/, '');
       const targets = await fetchJsonDirect(\`\${baseEndpoint}/json\`) as CDPTarget[];
-      const target = (tabName ? await selectNamedCDPTarget(targets, tabName) : undefined)
-        ?? selectCDPTarget(targets)
-        ?? await createCDPTarget(baseEndpoint);
+      const target = await resolveCDPTarget(baseEndpoint, targets, tabName);
       if (!target || !target.webSocketDebuggerUrl) {
         throw new Error('No inspectable targets found at CDP endpoint');
       }
@@ -404,6 +402,20 @@ function selectCDPTarget(targets: CDPTarget[]): CDPTarget | undefined {
   return rankCDPTargets(targets)[0]?.target;
 }

+async function resolveCDPTarget(
+  baseEndpoint: string,
+  targets: CDPTarget[],
+  tabName?: string,
+  readWindowName: (target: CDPTarget) => Promise<string | undefined> = readTargetWindowName,
+  createTarget: (endpoint: string) => Promise<CDPTarget | undefined> = createCDPTarget,
+): Promise<CDPTarget | undefined> {
+  if (tabName) {
+    return await selectNamedCDPTarget(targets, tabName, readWindowName)
+      ?? await createTarget(baseEndpoint);
+  }
+  return selectCDPTarget(targets) ?? await createTarget(baseEndpoint);
+}
+
 async function selectNamedCDPTarget(
   targets: CDPTarget[],
   tabName: string,
@@ -481,6 +493,7 @@ function escapeRegExp(value: string): string {

 export const __test__ = {
   buildCDPTabName,
+  resolveCDPTarget,
   selectCDPTarget,
   selectNamedCDPTarget,
   scoreCDPTarget,
@@ -496,6 +509,9 @@ function buildCDPTabName(
   const explicitName = env.OPENCLI_CDP_TAB_NAME?.trim();
   if (explicitName) return explicitName;

+  const siteMatch = workspace?.match(/^site:(.+)$/);
+  if (siteMatch && isElectronApp(siteMatch[1])) return undefined;
+
   const suffix = workspace?.trim() || 'default';
   return \`opencli:\${suffix}\`;
 }

How to apply

```bash

save the diff above as /tmp/cdp-fix.patch, then:

git apply /tmp/cdp-fix.patch
git add src/browser/cdp.ts src/browser/cdp.test.ts
git commit -m "fix(cdp): scope named-tab reuse to user browser sessions"
git push
```

After push we will re-review on the new head and run the heavyweight CI gate (e2e-headed) before merge.

@jackwener
Copy link
Copy Markdown
Owner

@warkcod follow-up: a third blocker on the adapter side.

Blocker 3: silent empty-result when both DOM and page-data are missing

The current searchDouban() returns [] when wait(selector) times out, the DOM has no rendered items, and window.__DATA__.items is also absent. That hides structure / readiness failure as an empty search result. The PR's stated intent is "fall back to page-data when DOM is empty" — not "succeed silently when both sources are gone."

Fix: when both DOM rendering and page-data are missing, throw EmptyResultError('douban search', ...). Backward-compatible: any non-empty result still returns the array; only the double-empty case fails fast.

Sequential patch (apply after the cdp combined patch)

diff --git a/clis/douban/utils.js b/clis/douban/utils.js
index c038999..a094e7f 100644
--- a/clis/douban/utils.js
+++ b/clis/douban/utils.js
@@ -564,7 +564,8 @@ export async function searchDouban(page, type, keyword, limit) {
       const normalize = (value) => (value || '').replace(/\\s+/g, ' ').trim();
       const seen = new Set();
       const sleep = (ms) => new Promise((resolve) => setTimeout(resolve, ms));
-      const rawItems = Array.isArray(window.__DATA__?.items) ? window.__DATA__.items : [];
+      const hasPageDataItems = Array.isArray(window.__DATA__?.items);
+      const rawItems = hasPageDataItems ? window.__DATA__.items : [];
       const rawItemsById = new Map(
         rawItems
           .map((item) => [String(item?.id || '').trim(), item])
@@ -648,10 +649,19 @@ export async function searchDouban(page, type, keyword, limit) {
           if (results.length >= \${safeLimit}) break;
         }
       }
-      return results;
+      return {
+        results,
+        hasRenderedItems: items.length > 0,
+        hasPageDataItems,
+      };
     })()
   \`);
     });
+    if (Array.isArray(data)) return data;
+    if (!data?.results?.length && data?.hasRenderedItems === false && data?.hasPageDataItems === false) {
+        throw new EmptyResultError('douban search', 'Search page rendered neither DOM results nor window.__DATA__.items. Douban page structure may have changed or the page did not finish loading.');
+    }
+    if (Array.isArray(data?.results)) return data.results;
     return Array.isArray(data) ? data : [];
 }
 /**
diff --git a/clis/douban/utils.test.js b/clis/douban/utils.test.js
index 2c7aa2c..5f41f10 100644
--- a/clis/douban/utils.test.js
+++ b/clis/douban/utils.test.js
@@ -241,6 +241,21 @@ describe('douban utils', () => {
         ]);
     });

+    it('fails fast when neither rendered DOM nor window data items are available', async () => {
+        const page = {
+            goto: vi.fn().mockResolvedValue(undefined),
+            wait: vi.fn().mockResolvedValue(undefined),
+            evaluate: vi.fn()
+                .mockResolvedValueOnce({ blocked: false, title: 'empty - 豆瓣搜索', href: 'https://search.douban.com/book/subject_search?search_text=empty&cat=1001' })
+                .mockImplementationOnce((script) => runSearchEvaluate(script, undefined, [])),
+        };
+
+        await expect(searchDouban(page, 'book', 'empty', 3)).rejects.toMatchObject({
+            code: 'EMPTY_RESULT',
+            hint: expect.stringContaining('neither DOM results nor window.__DATA__.items'),
+        });
+    });
+
     it('normalizes douban book subject raw data into structured fields', () => {
         const normalized = normalizeDoubanBookSubject({
             id: '2567698',

How to apply (after the cdp combined patch)

```bash

save the diff above as /tmp/douban-empty.patch, then:

git apply /tmp/douban-empty.patch
git add clis/douban/utils.js clis/douban/utils.test.js
git commit -m "fix(douban): fail fast when both DOM and page-data are empty"
git push
```

Once both patches are pushed, we'll re-review the new head and run the heavyweight CI gate before merge.

@warkcod
Copy link
Copy Markdown
Contributor Author

warkcod commented Apr 29, 2026

Addressed the review feedback in two follow-up commits:\n\n- Named CDP reuse no longer falls back to arbitrary inspectable tabs. When a named tab is requested, OpenCLI now reuses only a matching window.name target or creates a new target.\n- Default named-tab reuse is skipped for registered Electron app workspaces; explicit OPENCLI_CDP_TAB_NAME still opts in.\n- Douban search now returns source metadata from page evaluation and throws EMPTY_RESULT when neither rendered result items nor window.DATA.items are present.\n\nVerification run locally:\n\n- npx vitest run src/browser/cdp.test.ts src/runtime.test.ts clis/douban/utils.test.js\n- npm run typecheck\n- npm run build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants