Feat/rimmel etl#1
Open
rimmelasghar wants to merge 13 commits into
Open
Conversation
…aframe, SCHEMA, MAPPINGS, validators
…has no co-occurrences
…handling for non-WoS DBs
…hen frame is already standardised
…w, sidebar reveal for 3 entry points
…(bradford-on-cochrane is pre-existing)
…d both API sources
… bugs A-F, grader checklist
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Group Details:
Overview
This PR refactors
bibliometrix-pythonto reach the Advanced level ofthe UniNa Hardware & Software MOD-B assignment. The legacy per-database
parsing chain (
format_functions.biblio_json+ a longif source == "…"ladder) is replaced by a single declarative ETL that exposes one
public entry point —
convert2df(source, file_path, file_type=None)—and always returns the same 35-column WoS-style DataFrame, regardless of
the input database or file type.
The full write-up is in
REPORT.md(10 sections,including a matrix that maps each of the seven limitations of the
original Python implementation to the concrete ETL element that
resolves it).
What's inside
1. New ETL package —
www/services/etl/Five-phase pipeline (
extract → rename → cast → derive → validate)driven by two declarative dicts:
SCHEMA— the 35-column WoS-style contract (type + mandatory per column)MAPPINGS[(source, file_type)]— per-source recipes (src,list,sep,cast,builder,default)Public API:
convert2df(source, file_path, file_type=None, validate_strict=False)fetch_dataframe(source, query, *, max_results, **kwargs)— OpenAlex + PubMedvalidate(df, strict=False)— contract check with per-column NA reportSee
REPORT.md§3.2. Dashboard wiring —
app.pyapi_run_handler) — OpenAlex / PubMed query boxthat pipes results through
convert2dfand into the existingdfreactive value, so every downstream analytical panel (Overview,
Sources, Authors, Conceptual / Intellectual / Social structure, …)
works unchanged.
csv_unified_run) — reloads any CSVproduced by
tests/run_etl.pyand verifies it againstSCHEMA. Thisis the cross-database round-trip the brief requires.
schema with pill badges for mandatory-column coverage, rendered below
both the Fetch and Load buttons.
toggle_sidebarand its JS click delegator nowlisten to all three entry points (
start_button,api_run,csv_unified_run).See
REPORT.md§4.3. Cross-DB robustness patches to legacy analytics (10 files, ~76 LoC, no formula changes)
functions/get_citedcountries.py,get_citeddocuments.py,get_localcitedauthors.py,get_localciteddocuments.py,get_localcitedreferences.py,get_localcitedsources.py— guardsfor divide-by-zero on empty top-N slices and NaN-safe tick locators
functions/get_worldmapcollaboration.py— empty-graph fallbackwww/services/biblionetwork.py—None-crossprod guardwww/services/histnetwork.py— generic-WoS reference parserfallback + empty-
CRhandling (the original raised"Database not compatible…" on anything other than WoS)
www/services/format_functions.py—biblio_jsonnow tries the newETL first and falls back to the legacy parser only on error
(zero behavioural change for legacy callers)
See
REPORT.md§9 for the full diff inventory.4. Tests
tests/compat_etl.py— 120 source × file-type round-tripstests/smoke_etl.py— one fast end-to-end per sourcetests/dashboard_import_smoke.py— top-level importstests/run_etl.py --sweep— CSV exporter on all bundled samplestests/dashboard_compat.py— 7 standardised CSVs × 8 analysesThe single failure (
bradfordon Cochrane) is a pre-existing edge casein
get_bradford_law(the sameKeyErrorreproduces on the originalimport path) — surfaced transparently in our matrix rather than hidden.
Full breakdown and the actual matrix in
REPORT.md§5.5. Demo notebook
notebooks/ETL_Demonstration.ipynb—10 cells, all green, walks through
convert2df, the live API, schemaintrospection, validation and a re-use of two unmodified legacy
analytical functions through a tiny reactive shim.
6. CLI exporter
tests/run_etl.py—--sweep/--source/--file/--query/--max/--mailto/--strict. Writesout/etl/<source>__<stem>.csv. Sample outputs are included so graderscan feed them straight back into the dashboard.
How to run