Skip to content

Auto-extract extension taxonomy facts from XBRL linkbases #766

@dej-h

Description

@dej-h

Feature Request: Auto-extract extension taxonomy facts from XBRL linkbases

Feature Category

  • New API functionality
  • Performance improvement
  • Developer experience improvement
  • Documentation enhancement
  • Tool/utility addition

Problem Statement

Is your feature request related to a problem? Please describe.

When fetching financials for companies that define custom XBRL extension concepts, Tesla (tsla_AutomotiveRevenue), Berkshire (brka_*), banks, utilities, and many others, all extension facts are silently dropped from the normalized output. These concepts live in the company's own XBRL namespace and are never matched against the standard us-gaap tag map. The result is that the most analytically interesting(from a fundamental perspective), company-specific data (segment revenue breakdowns, custom balance sheet line items, non-standard cash flow adjustments) is invisible.

Today, company_mappings/ ships hardcoded JSON for exactly 3 companies (TSLA, MSFT, BRKA). This cannot scale: there are thousands of SEC filers with extension namespaces, and the mappings go stale as companies rename concepts across filings.

The data to solve this is already in every filing. The XBRL submission package that edgartools already fetches contains three linkbase files .xsd, _lab.xml, _cal.xml that together provide the human-readable label and the structural GAAP parent relationship for every extension concept, with no hardcoding required.

Who would benefit from this feature?

  • Beginner Python users working with SEC filings
  • ✅ Financial analysts and researchers
  • ✅ Advanced developers building financial applications
  • ✅ Data scientists working with financial datasets

Proposed Solution

Describe the solution you'd like

When edgartools parses an XBRL filing package, add a linkbase extraction pass that reads the three files already present in the submission:

  1. .xsd - identifies which extension elements are non-abstract (actual reportable values) vs. structural (axes, domains, members, table/line-item wrappers). The abstract="true" attribute is the authoritative signal.

  2. _lab.xml (label linkbase) - resolves the arc chain (loc, labelArc, label) to get the company-authored human-readable label for each extension concept. This is exactly the label that would appear in the printed 10-K/10-Q.

  3. _cal.xml (calculation linkbase) - for each extension concept that appears as a child of a us-gaap_ parent in a calculation arc, records the parent concept and weight (+1.0 additive, -1.0 subtractive). This is the company's own declaration of where the concept sits within the financial statement hierarchy.

The output would be a collection of ExtensionFact objects (or similar) exposed alongside the standard normalized facts, annotated with label and parent relationship so downstream consumers can place them correctly without guessing.

Describe alternatives you've considered

  • Expanding company_mappings/ manually, not scalable, goes stale, requires maintenance per company per filing year.
  • Using the companyfacts API endpoint, this endpoint pre-aggregates facts but strips the linkbase metadata. The label and parent relationship information is only available in the per-filing XML package, which edgartools already accesses for its XBRL parsing.
  • Ignoring extension facts entirely, the status quo. For many companies this is a significant gap in coverage, particularly for segment-level analysis.

Use Case Example

How would you use this feature?

from edgar import Company

company = Company("TSLA")
financials = company.get_financials()

# Standard facts work as today
revenue = financials.income_statement["Revenue"]

# Extension facts — new
extension_facts = financials.extension_facts  # or similar

for fact in extension_facts:
    print(fact.concept)        # "tsla_RestructuringAndOtherExpenses"
    print(fact.label)          # "Restructuring And Other Expenses"
    print(fact.parent_concept) # "us-gaap_OperatingExpenses"
    print(fact.weight)         # 1.0  (additive component of parent)
    print(fact.value)          # 1_730_000_000
    print(fact.period)         # "Q3 2025"

For developers building applications that need segment-level data:

# Get all revenue sub-components Tesla reports that aren't standard GAAP
revenue_segments = [
    f for f in financials.extension_facts
    if f.parent_concept == "us-gaap_Revenues"
]
# → tsla_AutomotiveRevenues, tsla_EnergyGenerationAndStorageRevenues, tsla_ServicesAndOtherRevenues

Implementation Considerations

Proof of concept

Verified against Tesla's Q3 2025 10-Q using only stdlib + requests 17 non-abstract extension concepts with full parent relationships extracted from a single filing with zero hardcoding. Happy to share the extraction script if useful.

The GAAP parent relationship is precisely what's needed to slot each extension fact into the statement hierarchy and it's declared by the company itself in the filing.

Complexity Level:

  • Simple (minor API addition)
  • ✅ Moderate (new functionality with existing patterns)
  • Complex (significant architectural changes)

The linkbase parsing code already exists in edgartools' xbrl/ module for other purposes. The main work is wiring the extraction into the normalization pipeline and defining the output type for extension facts.

Backwards Compatibility:

  • ✅ This feature maintains backwards compatibility

Extension facts would be surfaced as an additive property or collection existing .income_statement, .balance_sheet, .cash_flow_statement access patterns are unchanged. The company_mappings/ JSON files could remain supported as an override layer for cases where manual curation is preferred.


Additional Context

Why the companyfacts API can't solve this

The SEC's data.sec.gov/api/xbrl/companyfacts/CIK.json endpoint, the most convenient EDGAR data source, pre-aggregates facts but strips all linkbase metadata. The label and parent relationship for extension concepts is only available in the per-filing XML submission package. Edgartools already accesses these packages for its XBRL instance parsing; this feature would read the linkbases that are fetched alongside them.

Filtering structural elements

Not all extension namespace elements are facts. XBRL filings use extension namespaces for structural scaffolding: *Axis, *Domain, *Member, *Abstract, *Table, *LineItems. These should be excluded. The reliable filter is abstract="true" in the .xsd, structural elements declare themselves abstract; reportable value elements do not. In the Tesla example above, 37 of 83 extension concepts are abstract and filtered out, leaving 46 actual reportable elements.

Statement role filtering

Some extension concepts appear in calculation links only under disclosure schedule roles (note tables, supplemental detail schedules) rather than primary financial statement roles (ConsolidatedBalanceSheets, ConsolidatedStatementsofOperations, etc.). These could optionally be flagged with lower prominence, since primary statement facts are what most consumers want by default.

Related Issues/Features:

  • The existing company_mappings/ system in edgar/xbrl/standardization/ addresses the same problem via hardcoding this feature would make that approach unnecessary for the common case.
  • The unmapped_logger.py in the same directory already tracks unrecognised concepts, suggesting this gap is known.

Feature requests are evaluated based on EdgarTools' core principles: Simple yet powerful, accurate financials, beginner-friendly, and joyful UX.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions