Skip to content

Feat/rimmel etl#1

Open
rimmelasghar wants to merge 13 commits into
PRAISELab-PicusLab:mainfrom
rimmelasghar:feat/rimmel-etl
Open

Feat/rimmel etl#1
rimmelasghar wants to merge 13 commits into
PRAISELab-PicusLab:mainfrom
rimmelasghar:feat/rimmel-etl

Conversation

@rimmelasghar
Copy link
Copy Markdown

@rimmelasghar rimmelasghar commented May 25, 2026

Group Details:

  • Syed Muhammad Rimmel Asghar (D03000237)
  • Muhammad Bisham Khan (D03000239)
  • Muzammil Nisar ( D03000277 )

Overview

This PR refactors bibliometrix-python to reach the Advanced level of
the UniNa Hardware & Software MOD-B assignment. The legacy per-database
parsing chain (format_functions.biblio_json + a long if source == "…"
ladder) is replaced by a single declarative ETL that exposes one
public entry point — convert2df(source, file_path, file_type=None)
and always returns the same 35-column WoS-style DataFrame, regardless of
the input database or file type.

The full write-up is in REPORT.md (10 sections,
including a matrix that maps each of the seven limitations of the
original Python implementation to the concrete ETL element that
resolves it).

What's inside

1. New ETL package — www/services/etl/

Five-phase pipeline (extract → rename → cast → derive → validate)
driven by two declarative dicts:

  • SCHEMA — the 35-column WoS-style contract (type + mandatory per column)
  • MAPPINGS[(source, file_type)] — per-source recipes (src, list, sep, cast, builder, default)

Public API:

  • convert2df(source, file_path, file_type=None, validate_strict=False)
  • fetch_dataframe(source, query, *, max_results, **kwargs) — OpenAlex + PubMed
  • validate(df, strict=False) — contract check with per-column NA report

See REPORT.md §3.

2. Dashboard wiring — app.py

  • Live API panel (api_run_handler) — OpenAlex / PubMed query box
    that pipes results through convert2df and into the existing df
    reactive value, so every downstream analytical panel (Overview,
    Sources, Authors, Conceptual / Intellectual / Social structure, …)
    works unchanged.
  • Standardised-CSV loader (csv_unified_run) — reloads any CSV
    produced by tests/run_etl.py and verifies it against SCHEMA. This
    is the cross-database round-trip the brief requires.
  • Normalised preview — first-20-rows projection onto the canonical
    schema with pill badges for mandatory-column coverage, rendered below
    both the Fetch and Load buttons.
  • Sidebar revealtoggle_sidebar and its JS click delegator now
    listen to all three entry points (start_button, api_run,
    csv_unified_run).

See REPORT.md §4.

3. Cross-DB robustness patches to legacy analytics (10 files, ~76 LoC, no formula changes)

  • functions/get_citedcountries.py, get_citeddocuments.py,
    get_localcitedauthors.py, get_localciteddocuments.py,
    get_localcitedreferences.py, get_localcitedsources.py — guards
    for divide-by-zero on empty top-N slices and NaN-safe tick locators
  • functions/get_worldmapcollaboration.py — empty-graph fallback
  • www/services/biblionetwork.pyNone-crossprod guard
  • www/services/histnetwork.py — generic-WoS reference parser
    fallback + empty-CR handling (the original raised
    "Database not compatible…" on anything other than WoS)
  • www/services/format_functions.pybiblio_json now tries the new
    ETL first and falls back to the legacy parser only on error
    (zero behavioural change for legacy callers)

See REPORT.md §9 for the full diff inventory.

4. Tests

Test Result
tests/compat_etl.py — 120 source × file-type round-trips 120 / 120
tests/smoke_etl.py — one fast end-to-end per source 9 / 9
tests/dashboard_import_smoke.py — top-level imports 5 / 5
tests/run_etl.py --sweep — CSV exporter on all bundled samples 7 / 7
tests/dashboard_compat.py7 standardised CSVs × 8 analyses 55 / 56

The single failure (bradford on Cochrane) is a pre-existing edge case
in get_bradford_law (the same KeyError reproduces on the original
import path) — surfaced transparently in our matrix rather than hidden.

Full breakdown and the actual matrix in
REPORT.md §5.

5. Demo notebook

notebooks/ETL_Demonstration.ipynb
10 cells, all green, walks through convert2df, the live API, schema
introspection, validation and a re-use of two unmodified legacy
analytical functions through a tiny reactive shim.

6. CLI exporter

tests/run_etl.py--sweep / --source / --file / --query /
--max / --mailto / --strict. Writes
out/etl/<source>__<stem>.csv. Sample outputs are included so graders
can feed them straight back into the dashboard.

How to run

# from the repo root, with the project venv on PATH
shiny run app.py
# → http://127.0.0.1:8000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant