feat: harden migration pipeline - parallelism, streaming, resume, re-auth by vidiecan · Pull Request #255 · dataquest-dev/dspace-import-clarin

vidiecan · 2026-02-24T11:28:57Z

Summary

This PR improves migration reliability, observability, and performance in the import pipeline, with a strong focus on validation signal quality and long-running operation diagnostics.

What Changed

1) Runtime profiling and duration accuracy

src/repo_import.py
- Fixed end-of-run profiling so total runtime reflects full wall-clock duration.
- Added total_elapsed to final profile checkpoint logging.
- Added human-readable duration formatting (e.g. 1d 03:12:45).
- Kept phase elapsed timing while adding total elapsed context.

2) Logging UX and status semantics

src/utils.py
- Added console color formatter with Windows/PyCharm ANSI handling (colorama support).
- Standardized level rendering to 4-char codes (DEBU, INFO, WARN, CRIT).
- Status color semantics:
  - CRIT -> red
  - WARN -> orange
  - OK -> green
- Removed transient highlight markers from emitted message text.
src/pump/_utils.py
- Standardized post-import status lines:
  - OK when expected == imported
  - WARN otherwise
- Improved alignment and count readability in one-line status summaries.

3) Validation severity and signal quality

src/pump/_db.py
- Introduced severity classification for length comparisons:
  - CRIT when v7 < v5
  - WARN for other mismatches
  - OK when equal
- Applied severity to table len checks, non-null checks, and SQL fallback checks.
- Improved bool-field diagnostics (withdrawn, in_archive, discoverable, deleted) with true/false breakdowns.
- Removed noisy SQL-validation progress logs from INFO; moved fetch/normalize/set-diff step logs to DEBUG.
- Simplified SQL-validation progress phrasing (preparing fetch v5 and v7).

4) Metadata normalization performance improvements

src/pump/_metadata.py
- Optimized hot path in _metadatavalue_process:
  - hoisted repeated property/method lookups
  - cached int-key maps for UUID and field-id mapping
  - reduced per-row conversions/dispatch overhead
  - replaced regex-in-loop pattern checks with cheaper substring checks in the complex branch
  - tightened text normalization path and append bindings
- Improved ignored field reporting to include human-readable field names.

5) Import progress and resiliency enhancements

src/pump/_bitstream.py
- Enhanced progress postfix with richer counters (skipped_deleted, resumed, checkpoints).
- Made checkpoint feedback less noisy and more progress-oriented.
src/pump/_resourcepolicy.py
- Added SQL-based validation definitions for embargo-related policy checks.
- Improved import accounting for skipped deleted targets/missing bitstreams.
- Added progress postfix updates for worker-mode imports.
src/pump/_eperson.py
- Added non-null validation coverage for epersongroup2eperson.
src/pump/_sequences.py
- Improved sequence drift warning readability with aligned structured values.

6) Dependencies

requirements.txt
- Added colorama>=0.4.6 for better ANSI support on Windows/PyCharm.
- pysolr line currently appears without version pin in this branch.

Results / Impact

Cleaner and more actionable validation output:
- critical regressions (v7 < v5) now stand out as CRIT.
- warning and success lines are easier to visually scan.
Reduced INFO log noise during SQL validation by shifting operational progress to DEBUG.
Improved runtime visibility with accurate total elapsed and readable duration.
Metadata normalization path was optimized; during iterative runs, throughput trends improved versus earlier baseline in user-reported logs.

Validation Performed

Pre-commit checks passed on commit:
- autopep8 -> passed
- ruff -> passed
No new diagnostics errors in modified Python files during implementation iterations.

Commit

671ebb6 feat(import): improve validation signals, profiling, and normalization performance

Copilot

Pull request overview

This pull request implements performance optimizations for the DSpace import process, focusing on memory efficiency and import speed improvements.

Changes:

Added memory profiling and performance checkpointing throughout the import process
Implemented streaming JSONL processing for large metadata tables to reduce memory usage
Optimized database operations including chunked exports and connection reuse
Enhanced error handling with fail-fast behavior on critical import failures

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
src/utils.py	Added memory logging configuration parameter
src/repo_import.py	Added memory profiling, performance checkpointing, and changed default --resume to False
src/pump/_repo.py	Implemented chunked table export with progress tracking and JSONL support
src/pump/_metadata.py	Converted to streaming JSONL processing with tuple-based storage for memory efficiency
src/pump/_item.py	Added reset_progress method for progress tracking
src/pump/_handle.py	Added database check to skip already-imported handles
src/pump/_group.py	Enhanced error handling with concurrent creation detection
src/pump/_eperson.py	Added hydrate_uuid_map for efficient eperson mapping and netid support
src/pump/_db.py	Refactored connection health checks and added estimated vs exact count modes
src/pump/_community.py	Changed error from log-and-continue to raise exception
src/pump/_collection.py	Optimized to accept pre-computed collection-to-group mapping
src/pump/_bundle.py	Added reset_progress method
src/pump/_bitstream.py	Enhanced error handling with validation and fail-fast behavior
src/project_settings.py	Added memory log file config and changed localhost to 127.0.0.1
src/dspace/_rest.py	Modified post method and added HTTP timeouts
scripts/start.local.dspace.db.bat	Enhanced script with environment variables and localhost binding
scripts/Readme.md	Added OXFORD mock setup documentation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/dspace/_rest.py

src/repo_import.py

src/utils.py

src/pump/_repo.py

src/pump/_group.py

scripts/start.local.dspace.db.bat

src/pump/_eperson.py

src/pump/_repo.py

scripts/start.local.dspace.db.bat

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…d improve operator safety - add startup configuration confirmation prompt before import begins, including key runtime flags and license URL rewrite settings - introduce backend config options for parallel item and bundle imports and wire them through the import pipeline - add worker-client spawning in REST layer to support safe per-thread API sessions - improve auth resilience by retrying 401/403 responses with forced re-authentication - add reusable parallel task runner and use it in item and bundle import flows - optionally skip deleted bitstreams via configuration - improve log messages

Copilot

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-03T23:43:32Z

src/repo_import.py

 import os
 import argparse
 import logging
+import gc
+import tracemalloc


time.time() / time.perf_counter() are used, but time is not imported in the shown imports. Add import time to avoid a runtime NameError.

Copilot · 2026-03-03T23:43:32Z

src/repo_import.py

    s = time.time()
+    if args.memory_profile and not tracemalloc.is_tracing():
+        tracemalloc.start(25)
+    checkpoint_ts = time.perf_counter()


time.time() / time.perf_counter() are used, but time is not imported in the shown imports. Add import time to avoid a runtime NameError.

Copilot · 2026-03-03T23:43:33Z

src/utils.py

+    if memory_log_file:
+        memory_log_dir = os.path.dirname(memory_log_file)
+        os.makedirs(memory_log_dir, exist_ok=True)


os.path.dirname(memory_log_file) can be an empty string when memory_log_file is just a filename (no directory). os.makedirs('') raises FileNotFoundError. Guard with if memory_log_dir: before calling os.makedirs, or default to '.'.

Copilot · 2026-03-03T23:43:33Z

src/pump/_collection.py

-            self._col2group[int(m.group(1))] = meta['resource_id']
+        if col2group is not None:
+            self._col2group = col2group
+        else:


metadata_file_str is now optional, but it’s still required when col2group is not provided. If metadata_file_str is None, this will call read_json(None) and fail unclearly. Consider validating inputs (raise a clear ValueError) when col2group is None and metadata_file_str is missing.

Suggested change

else:

else:

if metadata_file_str is None:

raise ValueError(

"metadata_file_str must be provided when col2group is None"

)

Copilot · 2026-03-03T23:43:33Z

scripts/init.dspacedb5.sh

+run_with_retry "createdb clarin-dspace" createdb --username=postgres --owner=dspace --encoding=UNICODE clarin-dspace
+run_with_retry "psql import clarin-dspace" bash -lc "psql -U postgres clarin-dspace < ../dump/clarin-dspace.sql > /dev/null 2>&1"
+run_with_retry "psql import clarin-dspace log" bash -lc "psql -U postgres clarin-dspace < ../dump/clarin-dspace.sql > ./__clarin-dspace.log 2>&1"

 echo "Importing clarin-utilities"
-createdb --username=postgres --encoding=UNICODE clarin-utilities
-psql -U postgres clarin-utilities < ../dump/clarin-utilities.sql &> /dev/null
-psql -U postgres clarin-utilities < ../dump/clarin-utilities.sql &> ./__clarin-utilities.log
+run_with_retry "createdb clarin-utilities" createdb --username=postgres --encoding=UNICODE clarin-utilities
+run_with_retry "psql import clarin-utilities" bash -lc "psql -U postgres clarin-utilities < ../dump/clarin-utilities.sql > /dev/null 2>&1"
+run_with_retry "psql import clarin-utilities log" bash -lc "psql -U postgres clarin-utilities < ../dump/clarin-utilities.sql > ./__clarin-utilities.log 2>&1"


Each SQL dump is imported twice (once to /dev/null, then again to a log file), which is slow and can break if the dump isn’t idempotent. Prefer a single import that logs output (e.g., redirect once to the log, optionally tail/print summary on failure).

Copilot · 2026-03-03T23:43:33Z

src/pump/_bitstream.py

+        expected = len([b for b in (self._bs or []) if not (
+            skip_deleted and b.get('deleted', False))])


This builds a full filtered list just to compute expected, which can be expensive for large bitstream tables. Use a generator-based count (e.g., sum(1 for b in ...)) to avoid allocating a large intermediate list.

Suggested change

expected = len([b for b in (self._bs or []) if not (

skip_deleted and b.get('deleted', False))])

expected = sum(

1

for b in (self._bs or [])

if not (skip_deleted and b.get('deleted', False))

)

…ate docs

…n performance - add total runtime to final profile checkpoints with human-readable duration\n- improve console log UX: 4-letter levels, status colorization (OK/WARN/CRIT), PyCharm ANSI handling\n- remove redundant SQL fetch progress at INFO and move SQL-validation step telemetry to DEBUG\n- classify table-length diffs by severity: CRIT when v7 < v5, WARN otherwise\n- standardize status lines and alignment for import/test comparison output\n- optimize metadatavalue normalization hot path by reducing per-row overhead and lookup costs\n- add/adjust validation coverage for resourcepolicy and epersongroup2eperson paths\n- improve sequence drift logging readability and bitstream/resourcepolicy progress reporting

Copilot

Pull request overview

Copilot reviewed 25 out of 26 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-06T11:57:16Z

src/repo_import.py

+    try:
+        _logger.info(
+            "Press ENTER to continue if this configuration is OK (Ctrl+C to abort)...")
+        input()
+    except EOFError:
+        _logger.warning(
+            "No interactive input available; continuing without confirmation keypress.")


The unconditional input() can block indefinitely in non-interactive runs (e.g., CI, redirected stdin that isn’t EOF). Consider guarding with sys.stdin.isatty() (or a --no-confirm / --non-interactive flag) and skipping the prompt when not interactive.

Copilot · 2026-03-06T11:57:17Z

src/pump/_bitstream.py

+        def _update_progress(pbar_ref):
+            """Centralized progress-bar postfix update."""
+            pbar_ref.set_postfix(
+                imported=self._imported['bitstream'],
+                skipped_deleted=skipped_deleted,
+                resumed=skipped_already_imported,
+                errored=errored,
+                checkpoints=checkpoints_saved,
+                to_checkpoint=checkpoint_every - checkpoint_counter,
+            )


progress_bar() can return a plain iterator (when len(arr) < 2), which does not have set_postfix(). This will raise AttributeError for very small bitstream datasets (0/1 elements) because _update_progress() is called unconditionally later. Make _update_progress() resilient (e.g., if hasattr(pbar_ref, "set_postfix"):) or ensure a tqdm instance is always used when postfix updates are needed.

Copilot · 2026-03-06T11:57:17Z

src/pump/_utils.py

+def run_tasks(tasks, worker, workers: int = 1, desc: str = None, on_result=None):
+    tasks = list(tasks or [])
+    workers = max(1, int(workers or 1))


Converting tasks to a list unconditionally duplicates memory and defeats streaming for large iterables (which matters for migration workloads). Consider only materializing when necessary (e.g., if tasks isn’t a Sequence and you need len()/multiple passes), or accept a total parameter and iterate without copying.

Copilot · 2026-03-06T11:57:17Z

src/utils.py

 def init_logging(
-        logger,
-        log_file: str,
-        console_level=logging.INFO,
-        file_level=logging.INFO,
-        format: str = '%(asctime)s:%(levelname)s: %(message)s'):
+    logger,
+    log_file: str,
+    memory_log_file: str = None,
+    console_level=logging.INFO,
+    file_level=logging.INFO,
+        format: str = '%(asctime)s:%(levelname).4s: %(message)s'):


memory_log_file: str = None should be typed as Optional[str] (and format is oddly indented compared to the other parameters). Adjusting the typing/formatting will improve readability and static analysis without changing behavior.

jm added 4 commits February 23, 2026 19:42

optimize

a6ed253

improve logging

8913656

major improvements

1078667

optim by AI

cd385b1

Copilot AI review requested due to automatic review settings February 24, 2026 11:28

Copilot started reviewing on behalf of vidiecan February 24, 2026 11:29 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

vidiecan and others added 7 commits February 24, 2026 12:51

Apply suggestions from code review

6df4c9b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

minor

5842f6a

cleanup

6caffc7

reauth

7e6f9a3

further optimization, logging updates

5aa8a8d

refactor

d6fca95

vidiecan changed the title ~~Optim~~ feat: harden migration pipeline - parallelism, streaming, resume, re-auth Mar 3, 2026

vidiecan requested a review from Copilot March 3, 2026 23:41

Copilot AI reviewed Mar 3, 2026

View reviewed changes

jm added 4 commits March 4, 2026 19:07

bitstream: remove cutoff, add testing assetstore diagnostics, and upd…

43b7f90

…ate docs

remove unecessary code

8c112ad

logs improvements

1fcd5de

vidiecan requested a review from Copilot March 6, 2026 11:53

Copilot AI reviewed Mar 6, 2026

View reviewed changes

-        else:
+        else:
+            if metadata_file_str is None:
+                raise ValueError(
+                    "metadata_file_str must be provided when col2group is None"
+                )

		expected = len([b for b in (self._bs or []) if not (
		skip_deleted and b.get('deleted', False))])

Conversation

vidiecan commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

1) Runtime profiling and duration accuracy

2) Logging UX and status semantics

3) Validation severity and signal quality

4) Metadata normalization performance improvements

5) Import progress and resiliency enhancements

6) Dependencies

Results / Impact

Validation Performed

Commit

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vidiecan commented Feb 24, 2026 •

edited

Loading