Auto-extract extension taxonomy facts from XBRL linkbases

# Feature Request: Auto-extract extension taxonomy facts from XBRL linkbases

## Feature Category
- [x] New API functionality
- [ ] Performance improvement
- [ ] Developer experience improvement
- [ ] Documentation enhancement
- [ ] Tool/utility addition

---

## Problem Statement

**Is your feature request related to a problem? Please describe.**

When fetching financials for companies that define custom XBRL extension concepts, Tesla (`tsla_AutomotiveRevenue`), Berkshire (`brka_*`), banks, utilities, and many others, all extension facts are silently dropped from the normalized output. These concepts live in the company's own XBRL namespace and are never matched against the standard `us-gaap` tag map. The result is that the most analytically interesting(from a fundamental perspective), company-specific data (segment revenue breakdowns, custom balance sheet line items, non-standard cash flow adjustments) is invisible.

Today, `company_mappings/` ships hardcoded JSON for exactly 3 companies (TSLA, MSFT, BRKA). This cannot scale: there are thousands of SEC filers with extension namespaces, and the mappings go stale as companies rename concepts across filings.

**The data to solve this is already in every filing.** The XBRL submission package that edgartools already fetches contains three linkbase files  `.xsd`, `_lab.xml`, `_cal.xml`  that together provide the human-readable label and the structural GAAP parent relationship for every extension concept, with no hardcoding required.

**Who would benefit from this feature?**
- [ ] Beginner Python users working with SEC filings
- ✅ Financial analysts and researchers
- ✅ Advanced developers building financial applications
- ✅ Data scientists working with financial datasets

---

## Proposed Solution

**Describe the solution you'd like**

When edgartools parses an XBRL filing package, add a linkbase extraction pass that reads the three files already present in the submission:

1. **`.xsd`** - identifies which extension elements are non-abstract (actual reportable values) vs. structural (axes, domains, members, table/line-item wrappers). The `abstract="true"` attribute is the authoritative signal.

2. **`_lab.xml`** (label linkbase) - resolves the arc chain (`loc,  labelArc,  label`) to get the company-authored human-readable label for each extension concept. This is exactly the label that would appear in the printed 10-K/10-Q.

3. **`_cal.xml`** (calculation linkbase) - for each extension concept that appears as a child of a `us-gaap_` parent in a calculation arc, records the parent concept and weight (`+1.0` additive, `-1.0` subtractive). This is the company's own declaration of where the concept sits within the financial statement hierarchy.

The output would be a collection of `ExtensionFact` objects (or similar) exposed alongside the standard normalized facts, annotated with label and parent relationship so downstream consumers can place them correctly without guessing.

**Describe alternatives you've considered**

- **Expanding `company_mappings/` manually**, not scalable, goes stale, requires maintenance per company per filing year.
- **Using the `companyfacts` API endpoint**, this endpoint pre-aggregates facts but strips the linkbase metadata. The label and parent relationship information is only available in the per-filing XML package, which edgartools already accesses for its XBRL parsing.
- **Ignoring extension facts entirely**, the status quo. For many companies this is a significant gap in coverage, particularly for segment-level analysis.

---

## Use Case Example

**How would you use this feature?**

```python
from edgar import Company

company = Company("TSLA")
financials = company.get_financials()

# Standard facts work as today
revenue = financials.income_statement["Revenue"]

# Extension facts — new
extension_facts = financials.extension_facts  # or similar

for fact in extension_facts:
    print(fact.concept)        # "tsla_RestructuringAndOtherExpenses"
    print(fact.label)          # "Restructuring And Other Expenses"
    print(fact.parent_concept) # "us-gaap_OperatingExpenses"
    print(fact.weight)         # 1.0  (additive component of parent)
    print(fact.value)          # 1_730_000_000
    print(fact.period)         # "Q3 2025"
```

For developers building applications that need segment-level data:

```python
# Get all revenue sub-components Tesla reports that aren't standard GAAP
revenue_segments = [
    f for f in financials.extension_facts
    if f.parent_concept == "us-gaap_Revenues"
]
# → tsla_AutomotiveRevenues, tsla_EnergyGenerationAndStorageRevenues, tsla_ServicesAndOtherRevenues
```

---

## Implementation Considerations

**Proof of concept**

Verified against Tesla's Q3 2025 10-Q using only stdlib + requests 17 non-abstract extension concepts with full parent relationships extracted from a single filing with zero hardcoding. Happy to share the extraction script if useful. 

The GAAP parent relationship is precisely what's needed to slot each extension fact into the statement hierarchy and it's declared by the company itself in the filing.

**Complexity Level:**
- [ ] Simple (minor API addition)
- ✅ Moderate (new functionality with existing patterns)
- [ ] Complex (significant architectural changes)

The linkbase parsing code already exists in edgartools' `xbrl/` module for other purposes. The main work is wiring the extraction into the normalization pipeline and defining the output type for extension facts.

**Backwards Compatibility:**
- ✅ This feature maintains backwards compatibility

Extension facts would be surfaced as an additive property or collection existing `.income_statement`, `.balance_sheet`, `.cash_flow_statement` access patterns are unchanged. The `company_mappings/` JSON files could remain supported as an override layer for cases where manual curation is preferred.

---

## Additional Context

**Why the `companyfacts` API can't solve this**

The SEC's `data.sec.gov/api/xbrl/companyfacts/CIK.json` endpoint, the most convenient EDGAR data source, pre-aggregates facts but strips all linkbase metadata. The label and parent relationship for extension concepts is only available in the per-filing XML submission package. Edgartools already accesses these packages for its XBRL instance parsing; this feature would read the linkbases that are fetched alongside them.

**Filtering structural elements**

Not all extension namespace elements are facts. XBRL filings use extension namespaces for structural scaffolding: `*Axis`, `*Domain`, `*Member`, `*Abstract`, `*Table`, `*LineItems`. These should be excluded. The reliable filter is `abstract="true"` in the `.xsd`, structural elements declare themselves abstract; reportable value elements do not. In the Tesla example above, 37 of 83 extension concepts are abstract and filtered out, leaving 46 actual reportable elements.

**Statement role filtering**

Some extension concepts appear in calculation links only under disclosure schedule roles (note tables, supplemental detail schedules) rather than primary financial statement roles (`ConsolidatedBalanceSheets`, `ConsolidatedStatementsofOperations`, etc.). These could optionally be flagged with lower prominence, since primary statement facts are what most consumers want by default.

**Related Issues/Features:**
- The existing `company_mappings/` system in `edgar/xbrl/standardization/` addresses the same problem via hardcoding this feature would make that approach unnecessary for the common case.
- The `unmapped_logger.py` in the same directory already tracks unrecognised concepts, suggesting this gap is known.

---

*Feature requests are evaluated based on EdgarTools' core principles: Simple yet powerful, accurate financials, beginner-friendly, and joyful UX.*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Auto-extract extension taxonomy facts from XBRL linkbases #766

Feature Request: Auto-extract extension taxonomy facts from XBRL linkbases

Feature Category

Problem Statement

Proposed Solution

Use Case Example

Implementation Considerations

Additional Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Auto-extract extension taxonomy facts from XBRL linkbases #766

Description

Feature Request: Auto-extract extension taxonomy facts from XBRL linkbases

Feature Category

Problem Statement

Proposed Solution

Use Case Example

Implementation Considerations

Additional Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions