Feat/rimmel etl by rimmelasghar · Pull Request #1 · PRAISELab-PicusLab/bibliometrix-python

rimmelasghar · 2026-05-25T18:49:35Z

Group Details:

Syed Muhammad Rimmel Asghar (D03000237)
Muhammad Bisham Khan (D03000239)
Muzammil Nisar ( D03000277 )

Overview

This PR refactors bibliometrix-python to reach the Advanced level of
the UniNa Hardware & Software MOD-B assignment. The legacy per-database
parsing chain (format_functions.biblio_json + a long if source == "…"
ladder) is replaced by a single declarative ETL that exposes one
public entry point — convert2df(source, file_path, file_type=None) —
and always returns the same 35-column WoS-style DataFrame, regardless of
the input database or file type.

The full write-up is in REPORT.md (10 sections,
including a matrix that maps each of the seven limitations of the
original Python implementation to the concrete ETL element that
resolves it).

What's inside

1. New ETL package — `www/services/etl/`

Five-phase pipeline (extract → rename → cast → derive → validate)
driven by two declarative dicts:

SCHEMA — the 35-column WoS-style contract (type + mandatory per column)
MAPPINGS[(source, file_type)] — per-source recipes (src, list, sep, cast, builder, default)

Public API:

convert2df(source, file_path, file_type=None, validate_strict=False)
fetch_dataframe(source, query, *, max_results, **kwargs) — OpenAlex + PubMed
validate(df, strict=False) — contract check with per-column NA report

See REPORT.md §3.

2. Dashboard wiring — `app.py`

Live API panel (api_run_handler) — OpenAlex / PubMed query box
that pipes results through convert2df and into the existing df
reactive value, so every downstream analytical panel (Overview,
Sources, Authors, Conceptual / Intellectual / Social structure, …)
works unchanged.
Standardised-CSV loader (csv_unified_run) — reloads any CSV
produced by tests/run_etl.py and verifies it against SCHEMA. This
is the cross-database round-trip the brief requires.
Normalised preview — first-20-rows projection onto the canonical
schema with pill badges for mandatory-column coverage, rendered below
both the Fetch and Load buttons.
Sidebar reveal — toggle_sidebar and its JS click delegator now
listen to all three entry points (start_button, api_run,
csv_unified_run).

See REPORT.md §4.

3. Cross-DB robustness patches to legacy analytics (10 files, ~76 LoC, no formula changes)

functions/get_citedcountries.py, get_citeddocuments.py,
get_localcitedauthors.py, get_localciteddocuments.py,
get_localcitedreferences.py, get_localcitedsources.py — guards
for divide-by-zero on empty top-N slices and NaN-safe tick locators
functions/get_worldmapcollaboration.py — empty-graph fallback
www/services/biblionetwork.py — None-crossprod guard
www/services/histnetwork.py — generic-WoS reference parser
fallback + empty-CR handling (the original raised
"Database not compatible…" on anything other than WoS)
www/services/format_functions.py — biblio_json now tries the new
ETL first and falls back to the legacy parser only on error
(zero behavioural change for legacy callers)

See REPORT.md §9 for the full diff inventory.

4. Tests

Test	Result
`tests/compat_etl.py` — 120 source × file-type round-trips	120 / 120
`tests/smoke_etl.py` — one fast end-to-end per source	9 / 9
`tests/dashboard_import_smoke.py` — top-level imports	5 / 5
`tests/run_etl.py --sweep` — CSV exporter on all bundled samples	7 / 7
`tests/dashboard_compat.py` — 7 standardised CSVs × 8 analyses	55 / 56

The single failure (bradford on Cochrane) is a pre-existing edge case
in get_bradford_law (the same KeyError reproduces on the original
import path) — surfaced transparently in our matrix rather than hidden.

Full breakdown and the actual matrix in
REPORT.md §5.

5. Demo notebook

notebooks/ETL_Demonstration.ipynb —
10 cells, all green, walks through convert2df, the live API, schema
introspection, validation and a re-use of two unmodified legacy
analytical functions through a tiny reactive shim.

6. CLI exporter

tests/run_etl.py — --sweep / --source / --file / --query /
--max / --mailto / --strict. Writes
out/etl/<source>__<stem>.csv. Sample outputs are included so graders
can feed them straight back into the dashboard.

How to run

# from the repo root, with the project venv on PATH
shiny run app.py
# → http://127.0.0.1:8000

…aframe, SCHEMA, MAPPINGS, validators

…/country helpers

…U_CO graph

…has no co-occurrences

…handling for non-WoS DBs

…hen frame is already standardised

…w, sidebar reveal for 3 entry points

…th --sweep

…(bradford-on-cochrane is pre-existing)

…d both API sources

… bugs A-F, grader checklist

rimmelasghar added 13 commits May 25, 2026 20:07

feat(etl): declarative 5-phase ETL package with convert2df, fetch_dat…

83c9c10

…aframe, SCHEMA, MAPPINGS, validators

fix(analytics): divide-by-zero and NaN-safe tick guards in 6 citation…

0645441

…/country helpers

fix(analytics): return empty DataFrame when biblioNetwork yields no A…

18a838d

…U_CO graph

fix(biblionetwork): guard against None crossprod when a sparse field …

9a03d0c

…has no co-occurrences

fix(histnetwork): generic-WoS reference parser fallback and empty-CR …

d79e6b0

…handling for non-WoS DBs

feat(format_functions): biblio_json fast-path that delegates to ETL w…

bd3cc65

…hen frame is already standardised

feat(app): Live API panel, standardised-CSV loader, normalised previe…

0211711

…w, sidebar reveal for 3 entry points

test(etl): compat_etl 120/120, smoke_etl 9/9, run_etl CLI exporter wi…

cf745c8

…th --sweep

test(dashboard): import smoke 5/5 and cross-DB analysis matrix 55/56 …

e1b7eb9

…(bradford-on-cochrane is pre-existing)

docs(notebook): ETL_Demonstration.ipynb walks through all 5 phases an…

fccca20

…d both API sources

docs(samples): standardised CSV outputs from the 7-source ETL sweep

9dbbfaf

docs(report): full write-up — 7-limitation matrix, ETL design, tests,…

163e6e9

… bugs A-F, grader checklist

chore(gitignore): exclude venv, run logs, and notebook checkpoints

36cf97d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/rimmel etl#1

Feat/rimmel etl#1
rimmelasghar wants to merge 13 commits into
PRAISELab-PicusLab:mainfrom
rimmelasghar:feat/rimmel-etl

rimmelasghar commented May 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rimmelasghar commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Group Details:

Overview

What's inside

1. New ETL package — www/services/etl/

2. Dashboard wiring — app.py

3. Cross-DB robustness patches to legacy analytics (10 files, ~76 LoC, no formula changes)

4. Tests

5. Demo notebook

6. CLI exporter

How to run

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rimmelasghar commented May 25, 2026 •

edited

Loading

1. New ETL package — `www/services/etl/`

2. Dashboard wiring — `app.py`