MVP 2 Release: Merge dev into main#30
Merged
Merged
Conversation
…h rate. Made small changes as well
… well as a new update_model() code, but none of this is any good right now.
…s) and started working on the git structure
…min_samples to 3 for small Reddit datasets. Add dynamic topic modeling import and execution.
- Rename src/emotion_visualization.py -> src/emotion_visualizations.py
so azureml/run_scripts.sh (which invokes emotion_visualizations.py)
no longer crashes with FileNotFoundError.
- emotion_classification.py: fix undefined data_dir/output_dir
references in load_environment, move "datasets loaded" summary
print out of the per-file loop.
- topic_modeling.py:
* Restore Cohere imports behind a try/except so cohere_integration()
no longer NameErrors when COHERE_API_KEY is set.
* Fix dataset-name regex: re.sub on Path.stem never matched .csv$,
leaving the "filtered_" prefix intact.
* Add _select_params() substring lookup so twitter-specific params
are actually applied (exact-match lookup on real dataset names
always fell back to defaults).
* Drop NaN-text rows at load time so docs_dict and dfs[name] stay
aligned, preventing length-mismatch crashes on topic assignment.
- data_preprocessing.py: dedupe dataset discovery - calling
load_datasets twice produced duplicate entries processing the
same file twice.
- reddit_data_filtering.py: error message referenced CODE_DIR but
the variable checked is REDDIT_RAW_DIR.
- twitter_data_cleaner.py: "/n" -> "\n" in final print.
- LDA/lda_topic_modeling.py: comment out stray "!pip install" /
"pip install" lines that were raw SyntaxErrors.
https://claude.ai/code/session_01SJy6pZnBRBJEgMCvfTQtGk
…oject.toml and requirements.txt
…on into emotion_analysis
…ect.toml as source of truth, added documentation to esily create requirements.txt again
…sights and bashobard pages
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Contributors
Karim El-Sharkawy
Responsible for topic modeling, emotion classification, project architecture, repository organization, NLP pipeline development, documentation, Streamlit webapp development, dependency management, type safety improvements, and overall project direction.
Ardavan Shahrabi
Responsible for dynamic topic modeling, visualization systems, pipeline stabilization, CI/CD improvements, preprocessing modernization, documentation, LDA improvements, dependency cleanup, testing infrastructure, environment standardization, bug fixes, and code quality enhancements.
Luis Ticas
Provided technical guidance, architecture discussions, troubleshooting support, testing assistance, design reviews, and implementation feedback throughout development. While not directly represented in commit authorship, Luis contributed significantly to many of the improvements included in this merge and helped shape several of the architectural and reliability decisions adopted in the project.
Executive Summary
MVP 2 transforms the project from a research prototype into a maintainable, reproducible NLP platform through architectural modernization, CI/CD, testing infrastructure, dataset registry management, Streamlit visualization tooling, and significant pipeline reliability improvements.
Major improvements include:
Major Architecture Initiative: Production-Grade Pipeline Foundation
One of the largest changes included in this merge was a complete architectural modernization of the pipeline.
NLP and Topic Modeling Improvements
Major improvements include:
Runtime Configuration System
New
runtime.pyIntroduced a centralized runtime management system built around:
load_runtime()which returns a typed
RuntimeConfigobject containing canonical paths for:Improvements
Structured Logging
New
logging_config.pyReplaced scattered print statements throughout the codebase with centralized structured logging.
Benefits include:
Dataset Registry System
New Dataset Configuration Architecture
Introduced:
src/config/datasets.yamland typed dataset specifications.
Previous Approach
Pipeline stages previously relied on logic such as:
to determine processing behavior.
New Approach
Datasets are now defined through registry entries containing:
Benefits
Adding a new dataset now requires:
No code modifications are required.
Shared IO Utilities
Introduced:
io_helpers.pyIncluding:
require_columns()drop_missing_text()safe_write_csv()Benefits:
Raw Data Protection
A major reliability issue was resolved.
Previous Behavior
Several pipeline stages overwrote raw CSV files directly.
Affected components included:
data_preprocessing.pytopic_modeling.pyA failed run could permanently alter source data.
New Behavior
Pipeline stages now:
DATA_DIRPROCESSED_DATA_DIRRaw inputs remain untouched.
Validation
Added smoke tests that:
This guarantees raw dataset integrity.
Critical Pipeline Reliability Fixes
A broad stabilization effort fixed multiple issues affecting end-to-end execution across the pipeline. For the full list of changes in detail, please see commit
dcc57fe97017eb4dcfd431e0d2eeec7933f856e3Emotion Classification
Topic Modeling
Result: More reliable dataset processing, correct configuration behavior, improved topic assignment stability, and fewer runtime failures.
Pipeline Component Modernization
Data Preprocessing
Refactored to:
Topic Modeling
Refactored to:
load_runtime()Emotion Classification
Modernized to:
Emotion Visualizations
Improved workflow by:
This removes redundant model execution and prevents inconsistent label taxonomies.
Utility Script Improvements
Reddit Filtering
Twitter Cleaner
Twitter Utilities
Rebuilt:
twitter_chunks.pytwitter_sample.pyas proper command-line tools using:
argparseinstead of notebook-style execution.
Streamlit WebApp
A complete web interface was introduced that is intended to replace an overly complex one. Go to
src/webui/appfor the full code.Testing Infrastructure
A comprehensive testing framework was added.
Test Coverage
Introduced:
Coverage
33 unit and smoke tests covering:
Validation
Verified:
CI/CD Automation
Introduced GitHub Actions CI.
Automated Checks
Every push and pull request now runs:
Optimization
Heavy ML dependencies are skipped using:
allowing CI to remain lightweight and fast.
Dependency Management Modernization
Packaging Improvements
PyProject
Established:
as the primary source of truth.
Requirements
Re-aligned dependencies with AzureML environments.
Removed
Added
Type Safety and Static Analysis
Implemented:
Benefits:
Documentation Expansion
Added or expanded:
The project is now substantially better positioned for future model development, experimentation, deployment, collaboration, and public-facing demonstrations.