Skip to content

MVP 2 Release: Merge dev into main#30

Merged
karim-sharkawy merged 89 commits into
mainfrom
dev
Jun 12, 2026
Merged

MVP 2 Release: Merge dev into main#30
karim-sharkawy merged 89 commits into
mainfrom
dev

Conversation

@karim-sharkawy

@karim-sharkawy karim-sharkawy commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Contributors

Karim El-Sharkawy

Responsible for topic modeling, emotion classification, project architecture, repository organization, NLP pipeline development, documentation, Streamlit webapp development, dependency management, type safety improvements, and overall project direction.

Ardavan Shahrabi

Responsible for dynamic topic modeling, visualization systems, pipeline stabilization, CI/CD improvements, preprocessing modernization, documentation, LDA improvements, dependency cleanup, testing infrastructure, environment standardization, bug fixes, and code quality enhancements.

Luis Ticas

Provided technical guidance, architecture discussions, troubleshooting support, testing assistance, design reviews, and implementation feedback throughout development. While not directly represented in commit authorship, Luis contributed significantly to many of the improvements included in this merge and helped shape several of the architectural and reliability decisions adopted in the project.

This pull request consolidates approximately five months (January 12, 2026 – June 8, 2026) of development work including 75 commits across 3 contributors. Due to the breadth of changes, this PR should be reviewed at the system, feature, and architecture level rather than commit-by-commit.

Executive Summary

MVP 2 transforms the project from a research prototype into a maintainable, reproducible NLP platform through architectural modernization, CI/CD, testing infrastructure, dataset registry management, Streamlit visualization tooling, and significant pipeline reliability improvements.

Major improvements include:

  • Complete Streamlit visualization platform
  • Production-grade pipeline architecture
  • Dataset registry system
  • Centralized runtime and environment management
  • Structured logging
  • Extensive testing infrastructure
  • CI/CD automation
  • Type annotations and static analysis
  • Dependency modernization
  • Topic modeling improvements
  • Emotion analysis enhancements
  • Large-scale documentation expansion
  • Significant bug fixes and pipeline stabilization

Major Architecture Initiative: Production-Grade Pipeline Foundation

One of the largest changes included in this merge was a complete architectural modernization of the pipeline.

NLP and Topic Modeling Improvements

Major improvements include:

  • BERTopic grid search tooling
  • Dynamic Topic Modeling integration
  • Separate DTM workflows
  • Improved small-dataset handling
  • LDA cleanup and modernization
  • Emotion analysis integration
  • Improved keyword filtering
  • Cohere integration fixes

Runtime Configuration System

New runtime.py

Introduced a centralized runtime management system built around:

load_runtime()

which returns a typed RuntimeConfig object containing canonical paths for:

  • Data directories
  • Processed data directories
  • Output directories
  • Visualization directories
  • Model directories

Improvements

  • Automatic AzureML environment detection
  • Consistent path management
  • Local development support
  • Elimination of duplicated environment-loading code previously scattered across scripts

Structured Logging

New logging_config.py

Replaced scattered print statements throughout the codebase with centralized structured logging.

Benefits include:

  • Timestamped logs
  • Log levels
  • Consistent formatting
  • Improved debugging
  • Improved observability

Dataset Registry System

New Dataset Configuration Architecture

Introduced:

src/config/datasets.yaml

and typed dataset specifications.

Previous Approach

Pipeline stages previously relied on logic such as:

"twitter" in filename.lower()

to determine processing behavior.

New Approach

Datasets are now defined through registry entries containing:

  • Filename patterns
  • Text columns
  • Timestamp columns
  • Topic modeling profiles
  • Emotion classification profiles

Benefits

Adding a new dataset now requires:

  1. Adding a YAML configuration entry
  2. Running the pipeline

No code modifications are required.


Shared IO Utilities

Introduced:

io_helpers.py

Including:

  • require_columns()
  • drop_missing_text()
  • safe_write_csv()

Benefits:

  • Early validation failures
  • Improved error visibility
  • Prevention of silent failures
  • Consistent file-writing behavior

Raw Data Protection

A major reliability issue was resolved.

Previous Behavior

Several pipeline stages overwrote raw CSV files directly.

Affected components included:

  • data_preprocessing.py
  • topic_modeling.py

A failed run could permanently alter source data.

New Behavior

Pipeline stages now:

  • Read from DATA_DIR
  • Write to PROCESSED_DATA_DIR

Raw inputs remain untouched.

Validation

Added smoke tests that:

  • Hash raw files before execution
  • Hash raw files after execution
  • Fail if any modification occurs

This guarantees raw dataset integrity.


Critical Pipeline Reliability Fixes

A broad stabilization effort fixed multiple issues affecting end-to-end execution across the pipeline. For the full list of changes in detail, please see commit dcc57fe97017eb4dcfd431e0d2eeec7933f856e3

Emotion Classification

  • Resolved configuration and environment-loading issues.
  • Fixed dataset/output directory handling and summary reporting.
  • Improved execution reliability and logging.

Topic Modeling

  • Restored safe Cohere integration to prevent runtime failures.
  • Fixed dataset name detection and parameter selection, enabling dataset-specific tuning.
  • Improved topic assignment stability by handling missing-text rows correctly.
  • Eliminated duplicate dataset discovery and processing.
  • Corrected Reddit and Twitter processing/output issues.
  • Removed leftover notebook commands that caused LDA execution failures.

Result: More reliable dataset processing, correct configuration behavior, improved topic assignment stability, and fewer runtime failures.


Pipeline Component Modernization

Data Preprocessing

Refactored to:

  • Use dataset registry
  • Use shared utility functions
  • Eliminate duplicated prefix stripping
  • Write outputs to processed-data locations

Topic Modeling

Refactored to:

  • Use load_runtime()
  • Use registry-defined topic profiles
  • Read processed data instead of raw inputs
  • Use safe write operations
  • Remove hidden file mutation side effects

Emotion Classification

Modernized to:

  • Use registry-defined emotion profiles
  • Load models only once per profile
  • Use structured logging
  • Use safe output writing

Emotion Visualizations

Improved workflow by:

  • Reading classified outputs directly
  • Detecting existing emotion labels
  • Avoiding unnecessary reclassification

This removes redundant model execution and prevents inconsistent label taxonomies.


Utility Script Improvements

Reddit Filtering

  • Replaced broad exception handling
  • Improved error visibility

Twitter Cleaner

  • Removed unnecessary Colab dependencies
  • Improved error handling

Twitter Utilities

Rebuilt:

  • twitter_chunks.py
  • twitter_sample.py

as proper command-line tools using:

argparse

instead of notebook-style execution.


Streamlit WebApp

A complete web interface was introduced that is intended to replace an overly complex one. Go to src/webui/app for the full code.


Testing Infrastructure

A comprehensive testing framework was added.

Test Coverage

Introduced:

  • Runtime configuration tests
  • Dataset registry tests
  • Dataset loading tests
  • IO helper tests
  • Preprocessing tests
  • Smoke tests

Coverage

33 unit and smoke tests covering:

  • Runtime configuration
  • Dataset registry
  • Shared utilities
  • Preprocessing helpers
  • End-to-end preprocessing workflows

Validation

Verified:

  • Raw files remain unchanged
  • Expected outputs are generated
  • Registry behavior functions correctly

CI/CD Automation

Introduced GitHub Actions CI.

Automated Checks

Every push and pull request now runs:

  • Ruff linting
  • Ruff format validation
  • Python compile checks
  • Pytest suite

Optimization

Heavy ML dependencies are skipped using:

pytest.importorskip

allowing CI to remain lightweight and fast.


Dependency Management Modernization

Packaging Improvements

PyProject

Established:

pyproject.toml

as the primary source of truth.

Requirements

Re-aligned dependencies with AzureML environments.

Removed

  • Stale editable installs
  • Redundant package definitions
  • Empty package skeletons

Added

  • Dependency validation tooling
  • Documentation for requirements generation

Type Safety and Static Analysis

Implemented:

  • Extensive type annotations
  • Mypy integration
  • Typed utility functions
  • Typed preprocessing modules
  • Typed pipeline components

Benefits:

  • Better IDE support
  • Improved maintainability
  • Earlier error detection

Documentation Expansion

Added or expanded:

  • Repository documentation
  • Pipeline documentation
  • Model documentation
  • LDA documentation
  • Visualization documentation
  • Azure configuration documentation
  • Data schema documentation
  • Environment variable documentation
  • Dataset onboarding guides
  • README updates

The project is now substantially better positioned for future model development, experimentation, deployment, collaboration, and public-facing demonstrations.

karim-sharkawy and others added 30 commits January 12, 2026 20:47
… well as a new update_model() code, but none of this is any good right now.
…s) and started working on the git structure
…min_samples to 3 for small Reddit datasets. Add dynamic topic modeling import and execution.
- Rename src/emotion_visualization.py -> src/emotion_visualizations.py
  so azureml/run_scripts.sh (which invokes emotion_visualizations.py)
  no longer crashes with FileNotFoundError.
- emotion_classification.py: fix undefined data_dir/output_dir
  references in load_environment, move "datasets loaded" summary
  print out of the per-file loop.
- topic_modeling.py:
  * Restore Cohere imports behind a try/except so cohere_integration()
    no longer NameErrors when COHERE_API_KEY is set.
  * Fix dataset-name regex: re.sub on Path.stem never matched .csv$,
    leaving the "filtered_" prefix intact.
  * Add _select_params() substring lookup so twitter-specific params
    are actually applied (exact-match lookup on real dataset names
    always fell back to defaults).
  * Drop NaN-text rows at load time so docs_dict and dfs[name] stay
    aligned, preventing length-mismatch crashes on topic assignment.
- data_preprocessing.py: dedupe dataset discovery - calling
  load_datasets twice produced duplicate entries processing the
  same file twice.
- reddit_data_filtering.py: error message referenced CODE_DIR but
  the variable checked is REDDIT_RAW_DIR.
- twitter_data_cleaner.py: "/n" -> "\n" in final print.
- LDA/lda_topic_modeling.py: comment out stray "!pip install" /
  "pip install" lines that were raw SyntaxErrors.

https://claude.ai/code/session_01SJy6pZnBRBJEgMCvfTQtGk
a-shahrabi and others added 29 commits May 5, 2026 21:50
…ect.toml as source of truth, added documentation to esily create requirements.txt again
@karim-sharkawy karim-sharkawy merged commit e906949 into main Jun 12, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants