The AI News Intelligence System is an end-to-end NLP pipeline that automatically analyzes news articles across four dimensions: multi-label topic classification, named entity recognition, entity-aware abstractive summarization, and misinformation risk scoring. It fine-tunes RoBERTa for multilabel classification with per-label threshold tuning, uses spaCy for NER, fine-tunes T5/BART for entity-grounded summarization, and engineers five composite misinformation signals into a single Mis-Risk Score [0โ1]. The system is served through a live Streamlit application, providing media platforms, PR teams, and trading desks a real-time article intelligence dashboard.
- Ingested a multilabel news dataset with columns:
article_id,headline,body_text,source_domain,published_at,language,labels,summary_ref,entities_ref,mis_risk_label,word_count,scrape_noise - Profiled data quality issues: HTML tags in ~8% of headlines, encoding noise in ~5% of bodies, mixed-case source domains, ~10% null language codes, ~35% partial
mis_risk_labels - Flagged evaluation-only columns (
summary_ref,entities_ref,mis_risk_label) as strictly no-input to prevent data leakage
- Stripped HTML tags, normalized Unicode, removed scrape boilerplate (nav text, ad fragments) using custom
preprocess.py - Programmatic language detection via
langdetectโ language column not trusted directly due to ~10% nulls and wrong labels - Lowercased, removed excess whitespace, handled encoding errors
- Profiled article-length distribution and applied 512-token truncation strategy for BERT compatibility (headline + first 3 sentences used as input for efficiency)
- Plotted label co-occurrence heatmap across 10 topic categories to identify overlapping label pairs
- Computed label density (avg labels per article) and identified rare label combinations requiring special handling
- Analyzed headline vs. body length correlation; visualized entity type distribution per topic category
- Performed noise audit: quantified HTML tag rate, encoding error rate, and scrape fragment presence
- Constructed
model_textcolumn combining headline + body for transformer input - Applied
safe_parse_labels()to robustly parse multi-hot label lists from string representations - Engineered five misinformation signals: clickbait score (headline sentiment intensity), emotional language ratio (NRC lexicon), source credibility (domain whitelist lookup), factual density (entity count per 100 words), quote authenticity (direct vs. indirect quote ratio)
- Extracted entity-level features:
entity_label_countsper article,entity_distributionper topic label
- Applied label-weighted
BCEWithLogitsLossin RoBERTa fine-tuning to address label imbalance - Searched for F1-optimal threshold per label on the validation set (not global 0.5) โ mandatory for imbalanced multilabel settings
- Used
OneVsRestClassifierwrappers in baseline models to handle multi-hot targets
- Built three baselines: TF-IDF + Logistic Regression (One-vs-Rest), TF-IDF + Linear SVM, Word2Vec averaged embeddings + MLP
- Evaluated with Micro-F1, Macro-F1, Hamming Loss, and Jaccard Similarity Score
- Compared results in a side-by-side table to justify moving to transformer fine-tuning
- Fine-tuned
roberta-basewith a sigmoid output head (one sigmoid per label) usingBCEWithLogitsLoss - Applied label-weighted loss; used headline + first 3 sentences as input for efficiency within 512-token limit
- Tuned per-label decision threshold using F1-optimal search on validation set
- Generated SHAP token attribution plots for model interpretability
- Achieved >0.81 Micro-F1 on the test set
- Loaded
spaCy en_core_web_smas NER backbone; extracted entities withtext,label_,start_char,end_char - Built
extract_entities()andpredict_entities()functions; computedentity_label_countsper article - Analyzed entity type distribution per topic label using
topic_entity_counts(defaultdict of Counters) - Saved spaCy model to disk (
spacy_ner_model/) and entity labels asentity_labels.pklfor Streamlit inference - Achieved >0.79 NER F1
- Fine-tuned T5/BART on (article, summary) pairs with entity-grounding constraint: summaries must include at least one Person, Organisation, or Location entity from the source article
- Evaluated with ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore (F1) โ ROUGE alone insufficient
- Generated sample article โ summary gallery for qualitative evaluation
- Engineered five rule-based + model-based features into a composite
Mis-Risk Score [0โ1] - Calibrated score vs. human-annotated
mis_risk_label; computed Brier Score; analyzed top-10 highest-risk articles - Risk labels: Low (green), Medium (amber), High (red) rendered with color-coded metrics in Streamlit
- Built multi-section UI: headline + body + source domain input โ Classification โ NER โ Summarization โ Misinformation Risk
- Custom CSS: gradient prediction cards, white summary cards, styled metric labels, themed background
- Error-isolated try/except blocks per component; probability table sortable by score; entity dataframe sortable by label
- Shared tokenizer across classifier, NER, and summarizer components for parameter efficiency
- Modularized inference into four utility files:
classifier_utils.py,ner_utils.py,summarizer_utils.py,misinformation_utils.py - All model artifacts versioned (
model_roberta_v1.pt, etc.); hyperparameters stored inconfig.yaml; seeds set for all random operations
Fine-tuned RoBERTa with per-label sigmoid heads and F1-optimal threshold tuning โ an article can be simultaneously tagged as Politics, Economy, and Health.
spaCy-powered NER extracting Person, Organisation, Location, Event, and Law entities with character-level span positions.
T5/BART abstractive summarizer constrained to ground every summary in at least one named entity from the source article.
Composite Mis-Risk Score combining clickbait detection, emotional language ratio, source credibility lookup, factual density, and quote authenticity into a calibrated [0โ1] score.
Full per-label probability breakdown displayed as a sortable dataframe, not just top predicted labels.
Gradient prediction cards, custom metric typography, white summary cards, and a themed background โ production-grade UI built entirely in CSS within Streamlit.
Four independent utility modules (classifier_utils, ner_utils, summarizer_utils, misinformation_utils) with isolated error handling โ one component failure doesn't crash the app.
Label-weighted BCE loss + per-label F1-optimal threshold search replacing naive global 0.5 threshold โ mandatory for real-world imbalanced multilabel datasets.
TF-IDF + LR, TF-IDF + SVM, and Word2Vec + MLP baselines benchmarked with Hamming Loss, Micro/Macro F1, and Jaccard Score before transformer fine-tuning.
langdetect-based programmatic language filtering โ the dataset's language column has ~10% nulls and wrong labels and cannot be trusted directly.
- RoBERTa fine-tuned with multilabel sigmoid head and
BCEWithLogitsLoss - Per-label threshold tuning via F1-optimal search on validation set
- SHAP token attribution plots for interpretability
- Micro-F1 > 0.81 on held-out test set
- Gradient prediction cards rendered for each predicted label in Streamlit
spaCy en_core_web_smwith entity types: PERSON, ORG, GPE, DATE, EVENT, LAW, MONEY, etc.- Entity results displayed as sortable dataframe with entity text, label, start/end character offsets
- Entity label distribution analyzed per topic category using
defaultdict(Counter) - Model saved to disk as
spacy_ner_model/for fast Streamlit loading
- Entity-grounding constraint: at least one Person/Org/Location entity must appear in the 3-sentence summary
- ROUGE-1, ROUGE-2, ROUGE-L + BERTScore (F1) evaluation
- Summary rendered in white card with large line-height for readability
- Clickbait score: headline sentiment intensity analysis
- Emotional language ratio: NRC lexicon-based emotion word density
- Source credibility: domain whitelist lookup from
source_domainfield - Factual density: named entity count per 100 words
- Quote authenticity: direct vs. indirect quote ratio
- Risk Level (Low/Medium/High) color-coded with composite score displayed as metric
- Full feature value table rendered below risk metrics
| Library | Role |
|---|---|
streamlit |
App framework, layout, widgets, custom CSS |
| Library | Role |
|---|---|
transformers |
RoBERTa, BERT-NER, T5/BART fine-tuning |
torch |
Model training, BCE loss, sigmoid heads |
spacy |
NER pipeline (en_core_web_sm) |
scikit-learn |
TF-IDF, Logistic Regression, SVM, MLP, metrics |
| Library | Role |
|---|---|
pandas |
DataFrame operations, label parsing, feature tables |
numpy |
Numerical ops, threshold tuning |
ast |
Safe label list parsing from string representations |
langdetect |
Programmatic language detection & filtering |
joblib |
Model artifact serialization (entity_labels.pkl) |
| Library | Role |
|---|---|
rouge_score |
ROUGE-1/2/L for summarization |
bert_score |
BERTScore (F1) for semantic similarity |
shap |
Token attribution plots for classifier interpretability |
| Library | Role |
|---|---|
collections |
Counter, defaultdict for entity distribution analysis |
os, shutil |
Model artifact path management |
| Tool | Role |
|---|---|
| Streamlit | Live demo app |
| Amazon EC2 | Cloud hosting |
| Git | Version control (one branch per component) |
git clone https://github.com/your-username/ai-news-intelligence.git
cd ai-news-intelligence# Windows
python -m venv venv
venv\Scripts\activate
# macOS / Linux
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txtKey libraries:
streamlit transformers torch spacy scikit-learn pandas numpy
joblib langdetect rouge-score bert-score shap
Download spaCy model:
python -m spacy download en_core_web_smPlace the dataset CSV files in the NoteBooks/ directory:
news_train_model.csv
news_test_model.csv
Run notebooks in order:
01_Data_Preprocessing.ipynb
02_EDA_Label_Analysis.ipynb
03_Baseline_ML_Models.ipynb
04_Roberta_multilabel_news.ipynb
05_NER_Entity_Extraction.ipynb
06_Entity_Aware_Summarization.ipynb
07_Misinformation_Scoring.ipynb
After running notebooks, confirm these exist:
models/
โโโ roberta/
โ โโโ model_roberta_v1.pt
โโโ ner/
โ โโโ spacy_ner_model/
โ โโโ entity_labels.pkl
โ โโโ entity_distribution.csv
โโโ summarizer/
โ โโโ model_t5_v1.pt
โโโ misinformation/
โโโ misinfo_model_v1.pkl
streamlit run app.pyVisit http://localhost:8501 in your browser.
-
๐ก News Aggregator Auto-Tagging โ Automatically assign multiple topic labels to every incoming article before it hits personalized feeds. Handles the reality that a government healthcare budget article is simultaneously Politics, Economy, and Health.
-
๐ข Brand & Reputation Monitoring โ Companies monitor thousands of daily news mentions. The NER layer flags when an organization appears alongside lawsuits, product launches, or regulatory actions; multilabel classification routes the alert to the right internal team.
-
๐ Financial News Intelligence โ Trading desks ingest news as signals. An article tagged Economy + International + Politics simultaneously signals a different market impact than one tagged Economy alone. The summarizer compresses a 900-word article to 3 sentences in milliseconds.
-
๐ก๏ธ Misinformation Detection Pipeline โ Media platforms and fact-checkers can triage articles by Mis-Risk Score, prioritizing human review for high-risk articles flagged for clickbait headlines, emotional language spikes, or low source credibility.
-
๐ฐ Editorial Workflow Automation โ News desks use predicted labels + entity-aware summaries as a first draft of article metadata, reducing manual tagging time and surfacing the key named entities for sub-editors.
-
๐ NLP Research Benchmark โ Provides a replicable multilabel classification + NER + summarization pipeline with per-label threshold tuning, BERTScore evaluation, and SHAP interpretability โ a strong research baseline.
- ๐ Multilingual Support โ Extend classification and NER to non-English articles using multilingual BERT (
bert-base-multilingual-cased) - ๐ Real-Time News Ingestion โ RSS feed integration for automatic article ingestion and live scoring without manual input
- ๐ SHAP Dashboard Integration โ Embed token attribution heatmaps directly in the Streamlit UI for end-user interpretability
- ๐ค Custom NER Fine-Tuning โ Fine-tune
dslim/bert-base-NERon domain-specific entity types (Law, Event, Financial Instrument) - ๐ Automated Calibration Reports โ Scheduled Brier Score recalibration as the misinformation signal weights drift over time
- โก FastAPI Backend โ Decouple inference from Streamlit UI with a FastAPI REST endpoint for production-scale throughput
- ๐งช Active Learning Loop โ Flag low-confidence predictions for human review and feed corrections back into periodic model retraining
- ๐ฆ Containerized Deployment โ Dockerize the full stack (model artifacts + Streamlit app) for reproducible EC2 / ECS deployment
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ STREAMLIT UI (app.py) โ
โ [Headline Input] [Article Body Input] [Source Domain Input] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
๐ Analyze News Button
โ
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโ
โผ โผ โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ
โ classifier_utilsโ โ ner_utils.py โ โ summarizer_utils.py โ
โ predict_categoriesโ โpredict_entitiesโ โ summarize_article() โ
โ (RoBERTa) โ โ (spaCy NER) โ โ (T5/BART) โ
โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโโโฌโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
labels + probs entity list summary text
(per-label (text, label, (3 sentences,
threshold) start, end) entity-grounded)
โ โ โ
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ misinformation_utils โ
โ predict_misinformation โ
โ _risk() โ
โ (5-signal composite) โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโ
โ mis_risk_score [0-1] โ
โ risk_label (L/M/H) โ
โ feature breakdown โ
โโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโผโโโโโโโโโโโโโ
โ STREAMLIT OUTPUT โ
โ ๐ท๏ธ Predicted Cards โ
โ ๐ Probability Table โ
โ ๐ท๏ธ Entity DataFrame โ
โ ๐ Summary Card โ
โ โ ๏ธ Risk Metrics + Tableโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
MODEL ARTIFACTS
โโโ model_roberta_v1.pt (classifier)
โโโ spacy_ner_model/ (NER)
โโโ entity_labels.pkl (NER labels)
โโโ model_t5_v1.pt (summarizer)
โโโ misinfo_model_v1.pkl (risk scorer)
The AI News Intelligence System is a production-grade multilabel NLP platform built to address the limitations of single-label news classification at scale. By fine-tuning roberta-base with a sigmoid output head and BCEWithLogitsLoss, the classifier assigns overlapping topic labels (Politics, Economy, Health, Crime, etc.) to each article, achieving over 0.81 Micro-F1 through per-label F1-optimal threshold tuning rather than a naive global 0.5 cutoff. A spaCy en_core_web_sm NER pipeline extracts named entities with character-level span positions, achieving over 0.79 NER F1, while an entity-grounding constraint on the T5/BART summarizer ensures every generated 3-sentence summary is anchored to at least one Person, Organisation, or Location entity from the source text. The misinformation scoring module engineers five signals โ clickbait headline intensity, NRC-lexicon emotional language ratio, source domain credibility, factual entity density, and quote authenticity โ into a calibrated composite Mis-Risk Score [0โ1]. All four components are modularized into isolated utility files and served through a custom-styled Streamlit dashboard with gradient prediction cards, sortable probability tables, and color-coded risk metrics. The system targets media platforms, PR monitoring teams, and financial trading desks requiring real-time article intelligence at scale.
โญ If you find this project useful, give it a star on GitHub and share your feedback!