Skip to content

innerkorehq/page_segmenter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Page Segmenter

A highly robust, structure-aware tool that parses HTML pages into non-overlapping logical segments (e.g., header, navigation, sidebar, main content, footer, cards). It uses dynamic visual and structural DOM heuristics rather than content-density metrics.

Features

  • Logical Structure Parsing: Determines segments using developer intent via DOM structure, ARIA landmarks, and semantic HTML.
  • Dynamic Threshold Configuration: Adjusts structural thresholds based on page_type (e.g., commerce, content, marketing) to handle varied website architectures without splintering or collapsing components.
  • Playwright-Backed Evaluation: Executes visual heuristics directly in the browser context to account for exact layout constraints (widths, heights, visibilities).

Installation

You can install the package directly from PyPI (once published):

pip install page-segmenter

For development, clone the repository and install it using make:

git clone https://github.com/innerkorehq/page_segmenter.git
cd page_segmenter
make install
# or for dev dependencies
make install-dev

Usage

Command-Line Interface (CLI)

You can segment a live URL directly from the terminal. Use the optional --type argument to apply type-specific heuristics.

python main.py "https://example.com" --type "marketing"

You can also segment a local HTML file:

python main.py --html ./path/to/page.html --type "doc_page"

Visual Tester

To visually debug and inspect the detected segments inside a browser window:

python visual_tester.py "https://example.com" --type "product_list"

Python API

You can use the segmenter programmatically in your asynchronous Python applications:

import asyncio
import json
from page_segmenter import find_segments, find_segments_from_html

async def main():
    # Segment a live URL
    url = "https://docs.python.org/3/"
    segments = await find_segments(url, page_type="doc_page")
    print(json.dumps(segments, indent=2))

    # Segment from raw HTML
    html_content = "<html>...</html>"
    segments = await find_segments_from_html(html_content, base_url="https://example.com", page_type="commerce")

if __name__ == "__main__":
    asyncio.run(main())

How It Works

The segmenter processes the DOM in a series of logical phases:

  1. Pruning: Discards invisible nodes, tracking noise (like script or modal), and microscopic elements.
  2. Decision Logic: Recursively traverses the DOM evaluating ARIA landmarks, semantic tags, parent identity scores (padding, borders, shadows), raw text density, structural similarity (card grids), and orphaned child checks.
  3. Adaptive Thresholds: Changes internal variables (like MIN_SUBTREE_NODES or MIN_HEIGHT) dynamically based on the passed page_type family (commerce, content, marketing, etc.).

Read the full algorithm details in algo.md.

Development

A Makefile is included to streamline development tasks:

  • make install: Install the project.
  • make install-dev: Install with development dependencies.
  • make build: Build the distribution packages (sdist and wheel).
  • make publish: Build and publish the package to PyPI using twine.
  • make clean: Clean up build artifacts and cache directories.
  • make lint: Run basic syntax checks.
  • make docs: Build Sphinx documentation.
  • make test-run: Run a quick smoke test.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors