MVP 2 Release: Merge dev into main by karim-sharkawy · Pull Request #30 · Climate-Resilient-Communities/ClimateLens

karim-sharkawy · 2026-06-12T02:29:43Z

Contributors

Karim El-Sharkawy

Responsible for topic modeling, emotion classification, project architecture, repository organization, NLP pipeline development, documentation, Streamlit webapp development, dependency management, type safety improvements, and overall project direction.

Ardavan Shahrabi

Responsible for dynamic topic modeling, visualization systems, pipeline stabilization, CI/CD improvements, preprocessing modernization, documentation, LDA improvements, dependency cleanup, testing infrastructure, environment standardization, bug fixes, and code quality enhancements.

Luis Ticas

Provided technical guidance, architecture discussions, troubleshooting support, testing assistance, design reviews, and implementation feedback throughout development. While not directly represented in commit authorship, Luis contributed significantly to many of the improvements included in this merge and helped shape several of the architectural and reliability decisions adopted in the project.

This pull request consolidates approximately five months (January 12, 2026 – June 8, 2026) of development work including 75 commits across 3 contributors. Due to the breadth of changes, this PR should be reviewed at the system, feature, and architecture level rather than commit-by-commit.

Executive Summary

MVP 2 transforms the project from a research prototype into a maintainable, reproducible NLP platform through architectural modernization, CI/CD, testing infrastructure, dataset registry management, Streamlit visualization tooling, and significant pipeline reliability improvements.

Major improvements include:

Complete Streamlit visualization platform
Production-grade pipeline architecture
Dataset registry system
Centralized runtime and environment management
Structured logging
Extensive testing infrastructure
CI/CD automation
Type annotations and static analysis
Dependency modernization
Topic modeling improvements
Emotion analysis enhancements
Large-scale documentation expansion
Significant bug fixes and pipeline stabilization

Major Architecture Initiative: Production-Grade Pipeline Foundation

One of the largest changes included in this merge was a complete architectural modernization of the pipeline.

NLP and Topic Modeling Improvements

Major improvements include:

BERTopic grid search tooling
Dynamic Topic Modeling integration
Separate DTM workflows
Improved small-dataset handling
LDA cleanup and modernization
Emotion analysis integration
Improved keyword filtering
Cohere integration fixes

Runtime Configuration System

New `runtime.py`

Introduced a centralized runtime management system built around:

load_runtime()

which returns a typed RuntimeConfig object containing canonical paths for:

Data directories
Processed data directories
Output directories
Visualization directories
Model directories

Improvements

Automatic AzureML environment detection
Consistent path management
Local development support
Elimination of duplicated environment-loading code previously scattered across scripts

Structured Logging

New `logging_config.py`

Replaced scattered print statements throughout the codebase with centralized structured logging.

Benefits include:

Timestamped logs
Log levels
Consistent formatting
Improved debugging
Improved observability

Dataset Registry System

New Dataset Configuration Architecture

Introduced:

src/config/datasets.yaml

and typed dataset specifications.

Previous Approach

Pipeline stages previously relied on logic such as:

"twitter" in filename.lower()

to determine processing behavior.

New Approach

Datasets are now defined through registry entries containing:

Filename patterns
Text columns
Timestamp columns
Topic modeling profiles
Emotion classification profiles

Benefits

Adding a new dataset now requires:

Adding a YAML configuration entry
Running the pipeline

No code modifications are required.

Shared IO Utilities

Introduced:

`io_helpers.py`

Including:

require_columns()
drop_missing_text()
safe_write_csv()

Benefits:

Early validation failures
Improved error visibility
Prevention of silent failures
Consistent file-writing behavior

Raw Data Protection

A major reliability issue was resolved.

Previous Behavior

Several pipeline stages overwrote raw CSV files directly.

Affected components included:

data_preprocessing.py
topic_modeling.py

A failed run could permanently alter source data.

New Behavior

Pipeline stages now:

Read from DATA_DIR
Write to PROCESSED_DATA_DIR

Raw inputs remain untouched.

Validation

Added smoke tests that:

Hash raw files before execution
Hash raw files after execution
Fail if any modification occurs

This guarantees raw dataset integrity.

Critical Pipeline Reliability Fixes

A broad stabilization effort fixed multiple issues affecting end-to-end execution across the pipeline. For the full list of changes in detail, please see commit dcc57fe97017eb4dcfd431e0d2eeec7933f856e3

Emotion Classification

Resolved configuration and environment-loading issues.
Fixed dataset/output directory handling and summary reporting.
Improved execution reliability and logging.

Topic Modeling

Restored safe Cohere integration to prevent runtime failures.
Fixed dataset name detection and parameter selection, enabling dataset-specific tuning.
Improved topic assignment stability by handling missing-text rows correctly.
Eliminated duplicate dataset discovery and processing.
Corrected Reddit and Twitter processing/output issues.
Removed leftover notebook commands that caused LDA execution failures.

Result: More reliable dataset processing, correct configuration behavior, improved topic assignment stability, and fewer runtime failures.

Pipeline Component Modernization

Data Preprocessing

Refactored to:

Use dataset registry
Use shared utility functions
Eliminate duplicated prefix stripping
Write outputs to processed-data locations

Topic Modeling

Refactored to:

Use load_runtime()
Use registry-defined topic profiles
Read processed data instead of raw inputs
Use safe write operations
Remove hidden file mutation side effects

Emotion Classification

Modernized to:

Use registry-defined emotion profiles
Load models only once per profile
Use structured logging
Use safe output writing

Emotion Visualizations

Improved workflow by:

Reading classified outputs directly
Detecting existing emotion labels
Avoiding unnecessary reclassification

This removes redundant model execution and prevents inconsistent label taxonomies.

Utility Script Improvements

Reddit Filtering

Replaced broad exception handling
Improved error visibility

Twitter Cleaner

Removed unnecessary Colab dependencies
Improved error handling

Twitter Utilities

Rebuilt:

twitter_chunks.py
twitter_sample.py

as proper command-line tools using:

argparse

instead of notebook-style execution.

Streamlit WebApp

A complete web interface was introduced that is intended to replace an overly complex one. Go to src/webui/app for the full code.

Testing Infrastructure

A comprehensive testing framework was added.

Test Coverage

Introduced:

Runtime configuration tests
Dataset registry tests
Dataset loading tests
IO helper tests
Preprocessing tests
Smoke tests

Coverage

33 unit and smoke tests covering:

Runtime configuration
Dataset registry
Shared utilities
Preprocessing helpers
End-to-end preprocessing workflows

Validation

Verified:

Raw files remain unchanged
Expected outputs are generated
Registry behavior functions correctly

CI/CD Automation

Introduced GitHub Actions CI.

Automated Checks

Every push and pull request now runs:

Ruff linting
Ruff format validation
Python compile checks
Pytest suite

Optimization

Heavy ML dependencies are skipped using:

pytest.importorskip

allowing CI to remain lightweight and fast.

Dependency Management Modernization

Packaging Improvements

PyProject

Established:

pyproject.toml

as the primary source of truth.

Requirements

Re-aligned dependencies with AzureML environments.

Removed

Stale editable installs
Redundant package definitions
Empty package skeletons

Added

Dependency validation tooling
Documentation for requirements generation

Type Safety and Static Analysis

Implemented:

Extensive type annotations
Mypy integration
Typed utility functions
Typed preprocessing modules
Typed pipeline components

Benefits:

Better IDE support
Improved maintainability
Earlier error detection

Documentation Expansion

Added or expanded:

Repository documentation
Pipeline documentation
Model documentation
LDA documentation
Visualization documentation
Azure configuration documentation
Data schema documentation
Environment variable documentation
Dataset onboarding guides
README updates

The project is now substantially better positioned for future model development, experimentation, deployment, collaboration, and public-facing demonstrations.

…h rate. Made small changes as well

… well as a new update_model() code, but none of this is any good right now.

…s) and started working on the git structure

…min_samples to 3 for small Reddit datasets. Add dynamic topic modeling import and execution.

- Rename src/emotion_visualization.py -> src/emotion_visualizations.py so azureml/run_scripts.sh (which invokes emotion_visualizations.py) no longer crashes with FileNotFoundError. - emotion_classification.py: fix undefined data_dir/output_dir references in load_environment, move "datasets loaded" summary print out of the per-file loop. - topic_modeling.py: * Restore Cohere imports behind a try/except so cohere_integration() no longer NameErrors when COHERE_API_KEY is set. * Fix dataset-name regex: re.sub on Path.stem never matched .csv$, leaving the "filtered_" prefix intact. * Add _select_params() substring lookup so twitter-specific params are actually applied (exact-match lookup on real dataset names always fell back to defaults). * Drop NaN-text rows at load time so docs_dict and dfs[name] stay aligned, preventing length-mismatch crashes on topic assignment. - data_preprocessing.py: dedupe dataset discovery - calling load_datasets twice produced duplicate entries processing the same file twice. - reddit_data_filtering.py: error message referenced CODE_DIR but the variable checked is REDDIT_RAW_DIR. - twitter_data_cleaner.py: "/n" -> "\n" in final print. - LDA/lda_topic_modeling.py: comment out stray "!pip install" / "pip install" lines that were raw SyntaxErrors. https://claude.ai/code/session_01SJy6pZnBRBJEgMCvfTQtGk

…oject.toml and requirements.txt

…on into emotion_analysis

…ect.toml as source of truth, added documentation to esily create requirements.txt again

…sights and bashobard pages

…ls.py

karim-sharkawy and others added 30 commits January 12, 2026 20:47

drasticaly sped up computation by changing models and increasing batc…

90f31bd

…h rate. Made small changes as well

removed classification code

1b967b0

minor changes

9613d99

added azure dependency

b2bab74

added some util files

10f80a3

added specific error handeling and changed some of the main() code as…

3b3f5a9

… well as a new update_model() code, but none of this is any good right now.

small cleanup

a36619d

moved azure env recognition to separate file

3fd5849

made cohere integration a separate file

de118b6

removed unneccesary code

180f9aa

seperated dtm from regular topic modeling script

93cd18f

cleaned up code and fixed pipeline issues

f971a29

added grid search file for finding best bertopic configs

e202486

Resolves #8: cleaned up setup details (check comment for extra detail…

39c3173

…s) and started working on the git structure

Closes #10: replaced unnecessary folders with useful ones for the future

26268ee

removed unnecessary README

308c72a

removed code and changed name of main() function

bc0fae0

Fix #12: adjust parameters for small files, set min_cluster_size and …

6f517aa

…min_samples to 3 for small Reddit datasets. Add dynamic topic modeling import and execution.

added visual resizing

e47ad41

IDE/Cache additions

4746c9a

moved files to appropriate folders

c218509

added documentation: azure configuration and data schema

9860259

added LDA documentation in docs/

382e4ad

model documentation

2349de7

added environment loader as util

14cdb16

parent folder name change

58a7f6e

Resolves #5: check comments

1880c05

major refactor

ea8b5c4

repository documentation

6409129

a-shahrabi and others added 29 commits May 5, 2026 21:50

Fix import sorting and pin ruff version to match CI

41dde97

Pass data_path to process_datasets call

b545b5f

Fix process_datasets call signature in run_pipeline

acc4b06

Remove duplicate loop that mutated raw input files

2a6c6a3

feat: added script to manage differences in dependencies between pypr…

4bfc945

…oject.toml and requirements.txt

created emotion analysis script

fab2e60

fix reformating issues

a355323

Clean up: remove stray =2.6 file, drop nltk from CI, wire visualizati…

a1f5a3e

…on into emotion_analysis

visualization docs

4a55e2e

updated dependency structure: removed requirements.txt to keep pyproj…

b34bb63

…ect.toml as source of truth, added documentation to esily create requirements.txt again

removed traceback module due to logging_config file

f98d44e

Clean up LDA baseline: fix deps, remove duplicates, add type annotations

11ae34a

Add click as explicit dependency for spaCy in CI

17542d0

Add trailing newlines to LDA files

d2d69bb

Apply ruff formatting to LDA files

2e5a7ba

initialized webui structure

5d6f39a

added pages and working streamlit app

fc047c0

added pages and navigation organization, and added some content to in…

4a5bd4c

…sights and bashobard pages

added content and descriptions

2c95825

moved visualizations

018d487

added render_visualization() to find the proper folder in dashboards.py

3adec2b

added page icons in navigation bar + ruff fixes

60ee1f5

fixed problems with rendering files via render_visuaizations() in uti…

8248811

…ls.py

ruff format fixes

d4537d3

added sprout logo

a94ce69

added try statements so app works on HF and locally without issues

c48148c

spacing and wording changes

a325d5f

added basic tests. A good start for #20

2a58e01

Fix #20, small changes to test_dataset_loading.py

1689309

karim-sharkawy merged commit e906949 into main Jun 12, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MVP 2 Release: Merge dev into main#30

MVP 2 Release: Merge dev into main#30
karim-sharkawy merged 89 commits into
mainfrom
dev

karim-sharkawy commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

karim-sharkawy commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contributors

Karim El-Sharkawy

Ardavan Shahrabi

Luis Ticas

Executive Summary

Major Architecture Initiative: Production-Grade Pipeline Foundation

NLP and Topic Modeling Improvements

Runtime Configuration System

New runtime.py

Improvements

Structured Logging

New logging_config.py

Dataset Registry System

New Dataset Configuration Architecture

Previous Approach

New Approach

Benefits

Shared IO Utilities

io_helpers.py

Raw Data Protection

Previous Behavior

New Behavior

Validation

Critical Pipeline Reliability Fixes

Emotion Classification

Topic Modeling

Pipeline Component Modernization

Data Preprocessing

Topic Modeling

Emotion Classification

Emotion Visualizations

Utility Script Improvements

Reddit Filtering

Twitter Cleaner

Twitter Utilities

Streamlit WebApp

Testing Infrastructure

Test Coverage

Coverage

Validation

CI/CD Automation

Automated Checks

Optimization

Dependency Management Modernization

Packaging Improvements

PyProject

Requirements

Removed

Added

Type Safety and Static Analysis

Documentation Expansion

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karim-sharkawy commented Jun 12, 2026 •

edited

Loading

New `runtime.py`

New `logging_config.py`

`io_helpers.py`