A highly robust, structure-aware tool that parses HTML pages into non-overlapping logical segments (e.g., header, navigation, sidebar, main content, footer, cards). It uses dynamic visual and structural DOM heuristics rather than content-density metrics.
- Logical Structure Parsing: Determines segments using developer intent via DOM structure, ARIA landmarks, and semantic HTML.
- Dynamic Threshold Configuration: Adjusts structural thresholds based on
page_type(e.g.,commerce,content,marketing) to handle varied website architectures without splintering or collapsing components. - Playwright-Backed Evaluation: Executes visual heuristics directly in the browser context to account for exact layout constraints (widths, heights, visibilities).
You can install the package directly from PyPI (once published):
pip install page-segmenterFor development, clone the repository and install it using make:
git clone https://github.com/innerkorehq/page_segmenter.git
cd page_segmenter
make install
# or for dev dependencies
make install-devYou can segment a live URL directly from the terminal. Use the optional --type argument to apply type-specific heuristics.
python main.py "https://example.com" --type "marketing"You can also segment a local HTML file:
python main.py --html ./path/to/page.html --type "doc_page"To visually debug and inspect the detected segments inside a browser window:
python visual_tester.py "https://example.com" --type "product_list"You can use the segmenter programmatically in your asynchronous Python applications:
import asyncio
import json
from page_segmenter import find_segments, find_segments_from_html
async def main():
# Segment a live URL
url = "https://docs.python.org/3/"
segments = await find_segments(url, page_type="doc_page")
print(json.dumps(segments, indent=2))
# Segment from raw HTML
html_content = "<html>...</html>"
segments = await find_segments_from_html(html_content, base_url="https://example.com", page_type="commerce")
if __name__ == "__main__":
asyncio.run(main())The segmenter processes the DOM in a series of logical phases:
- Pruning: Discards invisible nodes, tracking noise (like
scriptormodal), and microscopic elements. - Decision Logic: Recursively traverses the DOM evaluating ARIA landmarks, semantic tags, parent identity scores (padding, borders, shadows), raw text density, structural similarity (card grids), and orphaned child checks.
- Adaptive Thresholds: Changes internal variables (like
MIN_SUBTREE_NODESorMIN_HEIGHT) dynamically based on the passedpage_typefamily (commerce,content,marketing, etc.).
Read the full algorithm details in algo.md.
A Makefile is included to streamline development tasks:
make install: Install the project.make install-dev: Install with development dependencies.make build: Build the distribution packages (sdistandwheel).make publish: Build and publish the package to PyPI using twine.make clean: Clean up build artifacts and cache directories.make lint: Run basic syntax checks.make docs: Build Sphinx documentation.make test-run: Run a quick smoke test.