You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add memory-efficient inspection tools for HTML documents to help agents extract specific data without parsing entire pages into context.
Motivation
HTML documents can be large (especially modern web pages with embedded scripts/styles). Agents often need to extract specific elements, count elements, or inspect structure without loading entire DOM into context.
Proposed Functions
High Priority - Selective Extraction
get_html_text_at_selector - Extract text from specific element(s) by CSS selector
get_html_element_at_selector - Extract element HTML by CSS selector
extract_html_attributes - Get all attributes from elements matching selector
extract_html_links - List all links (href) without full parse
Medium Priority - Inspection
count_html_elements - Count elements by tag name or selector
get_html_structure - Get DOM tree overview (tag hierarchy) without content
get_html_metadata - Extract meta tags, title, description only
search_html_text - Find elements containing text pattern
Medium Priority - Data Extraction
extract_html_tables_simple - Extract tables as structured data (complement to existing extract_table)
extract_html_lists - Extract ul/ol lists as arrays
extract_html_forms - Extract form structure (fields, actions)
preview_html_elements - Get first N elements matching selector
Lower Priority - Analysis
get_html_element_stats - Statistics for element types (count, attributes, depth)
validate_html_structure_simple - Quick validation without full parse
get_html_selector_path - Get CSS selector path for element
Design Principles
Google ADK compliant (JSON-serializable types, no defaults)
@strands_tool decorator
CSS selector support for element selection
Memory-efficient (selective parsing where possible)
Overview
Add memory-efficient inspection tools for HTML documents to help agents extract specific data without parsing entire pages into context.
Motivation
HTML documents can be large (especially modern web pages with embedded scripts/styles). Agents often need to extract specific elements, count elements, or inspect structure without loading entire DOM into context.
Proposed Functions
High Priority - Selective Extraction
get_html_text_at_selector- Extract text from specific element(s) by CSS selectorget_html_element_at_selector- Extract element HTML by CSS selectorextract_html_attributes- Get all attributes from elements matching selectorextract_html_links- List all links (href) without full parseMedium Priority - Inspection
count_html_elements- Count elements by tag name or selectorget_html_structure- Get DOM tree overview (tag hierarchy) without contentget_html_metadata- Extract meta tags, title, description onlysearch_html_text- Find elements containing text patternMedium Priority - Data Extraction
extract_html_tables_simple- Extract tables as structured data (complement to existing extract_table)extract_html_lists- Extract ul/ol lists as arraysextract_html_forms- Extract form structure (fields, actions)preview_html_elements- Get first N elements matching selectorLower Priority - Analysis
get_html_element_stats- Statistics for element types (count, attributes, depth)validate_html_structure_simple- Quick validation without full parseget_html_selector_path- Get CSS selector path for elementDesign Principles
Related
Module
html/parsing.py