Skip to content

houdhoudGH/knowledge-GraphRAG-Telecom-Chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

13 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ•ธ๏ธ Telecom GraphRAG

Mining hidden relationships from 300K customer conversations โ€” and asking an LLM to explain what it sees.

Python spaCy Neo4j LangChain Llama 3.1 Streamlit License: MIT


Buried inside 300,000+ raw support conversations are answers no one has time to read: which devices fail most, which teams really handle escalations, which issues recur in which cities. This project pulls those hidden relationships out into a typed knowledge graph โ€” then puts an LLM on top of it as an analyst, so you can ask the graph questions in plain English and get grounded, evidence-backed answers. You don't read the graph. The graph reads itself for you.


Telecom knowledge graph

Hidden dependencies surfaced from 300K+ telecom conversations โ€” typed entities, labeled relations, every edge validated by an LLM with confidence โ‰ฅ 0.90.


๐ŸŽฌ Live Demo

Ask the graph anything. It retrieves the relevant triples, hands them to an LLM, and the LLM gives you a synthesized answer โ€” with the source evidence shown below so you can verify the grounding yourself.

Streamlit app โ€” landing page
Project pipeline in the sidebar; suggested questions in the main panel.

Aggregation query โ€” device frequency
Aggregation across thousands of triples โ€” the LLM surfaces which device is most associated with reported issues, with raw graph counts as verifiable evidence.

Pattern detection across the graph
Hidden pattern discovery โ€” the LLM reads the retrieved subgraph and reports back three patterns a human would have needed hours of EDA to find: device concentration, recurring use case, geographic neutrality.

Long-form reasoning across the graph
End-to-end process reconstruction โ€” the LLM traces an entire escalation workflow from disconnected graph fragments, and explicitly flags what the schema does not yet capture.

๐Ÿงฌ The Core Idea

Raw text is unstructured. Sales meetings, customer calls, support tickets, complaint logs โ€” all of it sits in massive piles that nobody reads end-to-end. Inside those piles are real signals:

  • Which products keep breaking together?
  • Which agents handle which kinds of escalations?
  • Are issues geographic or device-specific?
  • Which billing problems map to which service tiers?

Traditional analytics requires you to already know what to look for. Vanilla RAG with embeddings can find similar passages but it cannot tell you how things connect. This project takes a different route:

Step 1 โ€” Convert text into a graph. Every entity (customer, device, issue, agent, account) becomes a typed node. Every relationship found in the text becomes a labeled edge with the original source sentence preserved as evidence. This turns scattered language into queryable structure.

Step 2 โ€” Validate the graph with an LLM. A rule-based extractor will pick up garbage. So every candidate triple is reviewed by Llama-3.1 before insertion โ€” the LLM can fix it, normalize it, or reject it. Only triples passing valid=True AND confidence โ‰ฅ 0.90 enter the final graph.

Step 3 โ€” Use the LLM as the analyst. When you ask a question, the system retrieves the relevant subgraph and hands it to the LLM along with strict grounding instructions. The LLM reads the graph, finds the pattern, and explains it in plain English. You never have to look at the graph itself.

The result is a system that discovers relationships you didn't know were there โ€” and surfaces them as readable insights.


โœจ Why This Project Stands Out

Most RAG projects This project
Knowledge representation Opaque embeddings Typed nodes + labeled relations
What it surfaces Semantically similar passages Structural relationships, dependencies, recurring patterns
Provenance Lost after chunking Every fact traces back to a source sentence
Hallucination handling "Best-effort" answers Explicit "I don't know" when graph evidence is missing
Validation None LLM reviews every triple before it enters the graph
Schema Ad-hoc Formal OWL ontology โ€” 9 classes, 14 object properties
Data sources Single dataset 3 heterogeneous telecom datasets unified into one graph

The novelty isn't any single component โ€” it's the end-to-end discipline: every triple is typed, validated, scored, and traceable. The LLM is constrained to act as a reader of graph evidence, not a free-form text generator.


๐Ÿ—บ๏ธ The Pipeline

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                300K+ Telecom Conversations                    โ”‚
โ”‚       (Talkmap 200K ยท Comcast 5K ยท Bitext 27K)                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                          โ–ผ
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚  1. Preprocessing     โ”‚  9-step custom cleaner
              โ”‚                       โ”‚  emoji ยท url ยท noise ยท dedup
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                          โ–ผ
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚  2. NER               โ”‚  spaCy en_core_web_lg
              โ”‚                       โ”‚  + EntityRuler + regex
              โ”‚                       โ”‚  โ†’ 6 typed entities
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                          โ–ผ
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚  3. Relation Mining   โ”‚  dependency-parse rules
              โ”‚                       โ”‚  โ†’ (subject, relation, object)
              โ”‚                       โ”‚  with confidence scores
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                          โ–ผ
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚  4. LLM Validation  โ˜… โ”‚  Llama-3.1-8B reviewer
              โ”‚                       โ”‚  validate ยท fix ยท reject
              โ”‚                       โ”‚  โ†’ conf โ‰ฅ 0.90 gate
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                          โ–ผ
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚  5. Neo4j Graph       โ”‚  typed nodes + weighted
              โ”‚                       โ”‚  edges in Neo4j Aura
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                          โ–ผ
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚  6. GraphRAG Analyst  โ”‚  Cypher retrieval +
              โ”‚                       โ”‚  LLM reads the subgraph
              โ”‚                       โ”‚  and reports insights
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โ˜… The LLM-in-the-loop validation stage is the project's novel contribution. The rule-based extractor produces ~98K candidate triples; the validator prunes them to ~1.1K high-confidence facts. Aggressive precision was a deliberate design choice โ€” better a small graph the system can trust than a large graph full of noise.


๐Ÿ—‚๏ธ The Data

Three heterogeneous telecom datasets, unified into one coherent corpus:

Dataset Records Type Content
Talkmap Telecom 200,000 Multi-turn dialogues Customerโ€“agent conversation transcripts
Comcast Complaints ~5,000 Free-form text Customer complaints with category labels
Bitext Customer Q&A 27,000 Intent-labeled pairs Instruction ยท response ยท intent ยท category
Total ~300K Mixed

A stratified 100K sample of Talkmap was used during development; the full 200K is available for production runs.


๐Ÿ”ฌ Stage-by-Stage

1๏ธโƒฃ Data Exploration โ€” 2_explore_data.ipynb

Systematic noise profiling across all three datasets revealed dataset-specific quirks that drove custom cleaning per source:

Noise type Talkmap Comcast Bitext
URLs / links โœ… frequent rare rare
Emojis โœ… frequent rare none
Repeated tokens ("to to to") โœ… frequent rare none
Parenthetical asides ("(to self)") โœ… frequent none none
Non-ASCII garbage โœ… frequent occasional rare

A reusable filter_by_pattern() utility was built for fast regex-based inspection across millions of rows.

2๏ธโƒฃ Preprocessing โ€” 3_preprocessing.ipynb

Custom 9-step text cleaner built from scratch โ€” no off-the-shelf cleaner used:

clean_pipeline = [
    remove_emojis,              # Unicode emoji stripping
    remove_urls,                # http/www links
    mask_emails,                # โ†’ [EMAIL] token
    remove_mentions_hashtags,   # @user, #tag
    remove_parenthetical_noise, # "(To self)", "(Thumbs up)"
    remove_repeated_tokens,     # "to to to" โ†’ "to"
    normalize_punctuation,      # "!!!" โ†’ "!"
    remove_garbage_symbols,     # non-ASCII noise
    clean_whitespace,           # canonical spacing
]

All three sources unified into one clean CSV โ€” the foundation for everything downstream.

3๏ธโƒฃ Named Entity Recognition โ€” 4_ner_extraction.ipynb

The first step toward turning text into a graph: find the entities. Hybrid NER combines spaCy's statistical model with hand-crafted deterministic rules for telecom-specific concepts:

Entity Type Examples Method
SERVICE data plan, broadband, roaming, voicemail EntityRuler patterns
PRODUCT router, SIM card, iPhone, Android, modem EntityRuler patterns
ISSUE no signal, billing issue, network outage EntityRuler patterns
ACTION refund, reset, escalate, cancel, upgrade EntityRuler patterns
ACCOUNT_ID AB-12345, JKL87654321 Regex [A-Z]{2,5}-?\d{4,}
PHONE_NUMBER +1-800-XXX-XXXX Regex (international)

GPU acceleration via thinc.prefer_gpu() for efficient batch processing at scale.

4๏ธโƒฃ Relation Mining โ€” 5_relation_extraction.ipynb

The crucial step: finding how entities connect. Dependency-parse-based rules walk every sentence looking for typed relationships between extracted entities. Each candidate is produced as a fully-traceable triple:

{
  "subject": "customer",
  "subject_type": "PERSON",
  "relation": "REPORTED",
  "object": "billing issue",
  "object_type": "ISSUE",
  "evidence_text": "The customer reported a billing issue on their account.",
  "confidence": 0.87,
  "extraction_method": "dep_rule",
  "source": "telecom_200k"
}

Typed entities, relation label, source sentence, confidence score, dataset origin โ€” every triple in the graph is fully auditable. This is what makes the system trustable: every claim points back to a real sentence in real data.

5๏ธโƒฃ LLM Validation โ€” 6_relation_cleaning.ipynb โญ

The novel stage. Rule-based extraction is fast but noisy. So every candidate triple gets reviewed by Llama-3.1-8B-Instruct via a LangChain validation chain:

You are a knowledge graph validation assistant.
Your task:
- Decide if the relation is correct
- If incorrect, fix it
- If meaningless, reject it
- Normalize entity names
- Normalize relation label
Return ONLY valid JSON.

The LLM acts as a quality gate โ€” catching errors the rule-based extractor cannot:

  • Semantic nonsense (e.g. "ticket REPORTED_VIA ticket")
  • Mis-typed arguments (e.g. VPN labeled as a device when it's a service)
  • Ambiguous references and partial entity captures

Quality gate: only triples where valid=True AND llm_confidence โ‰ฅ 0.90 survive. The threshold is deliberately strict โ€” the goal is a trustable graph, not a large one.

6๏ธโƒฃ The Knowledge Graph โ€” 7_neo4j_RAG.ipynb

Validated triples loaded into Neo4j Aura with typed node labels and confidence-weighted edges:

MERGE (a:Entity {name: $subject})
MERGE (b:Entity {name: $object})
MERGE (a)-[rel:REPORTED {confidence: $conf}]->(b)

Once loaded, the graph reveals patterns that were invisible in the raw text: which devices co-occur with which issues, which agents handle which escalations, how billing amounts cluster across accounts. The graph is also exported to NetworkX for local visualization and statistical analysis.

7๏ธโƒฃ GraphRAG as an Analyst โ€” 7_neo4j_RAG.ipynb

The chatbot doesn't generate answers โ€” it reads the graph and reports what it finds. Three steps:

  1. Question โ†’ Cypher โ€” natural language mapped to graph traversal
  2. Graph โ†’ Context โ€” Neo4j returns relevant (subject)-[relation]โ†’(object) triples + evidence sentences
  3. Context โ†’ Insight โ€” the LLM is strictly constrained to answer from those triples
prompt = f"""
You are a customer support assistant.
Answer ONLY using these facts:

{context}

Question: {question}
"""

The Answer ONLY using these facts constraint is what produces faithful, non-hallucinated insights โ€” including the explicit "I don't know" when the graph lacks evidence. The user gets the pattern; the LLM does the reading; the graph guarantees the truth.


๐Ÿง  OWL Ontology

The graph is backed by a formal OWL ontology (ontology.owl) that defines its semantic structure:

  • Classes: Customer, Service, Issue, Device, Action, Account, Ticket, Agent, Team
  • Object Properties: hasIssue, requestedAction, usesService, reportedBy, escalatedTo, memberOf
  • Data Properties: confidence scores, timestamps, source dataset, evidence text

The ontology ensures the graph is semantically consistent, extensible, and interoperable with other knowledge systems โ€” and it gives the LLM a stable vocabulary to reason about.


๐Ÿš€ Quick Start

Run the demo locally

git clone https://github.com/houdhoudGH/Customer_Service_ChatBot.git
cd Customer_Service_ChatBot

python -m venv .venv
source .venv/bin/activate          # Windows: .venv\Scripts\activate

pip install -r requirements.txt
streamlit run app.py

The Streamlit demo runs without any external dependencies โ€” no Neo4j credentials or LLM tokens required. It replays the question/answer pairs from the original Neo4j Aura pipeline.

Reproduce the full pipeline

To run the full extraction and validation chain against your own data:

export NEO4J_URI="neo4j+s://your-instance.databases.neo4j.io"
export NEO4J_USERNAME="neo4j"
export NEO4J_PASSWORD="your-password"
export HUGGINGFACEHUB_API_TOKEN="hf_..."

Run the notebooks in order: 1_load_data โ†’ 2_explore_data โ†’ 3_preprocessing โ†’ 4_ner_extraction โ†’ 5_relation_extraction โ†’ 6_relation_cleaning โ†’ 7_neo4j_RAG


โš™๏ธ Tech Stack

Layer Technology
NLP & NER spaCy en_core_web_lg, EntityRuler, custom regex
LLM Validation Llama-3.1-8B-Instruct (HuggingFace Inference API)
LLM Orchestration LangChain (LLMChain, SequentialChain, PromptTemplate)
Graph Database Neo4j Aura + neo4j Python driver
Graph Analysis NetworkX, Matplotlib
Data Processing Pandas, NumPy
GPU Acceleration CUDA via thinc / PyTorch
Ontology OWL / Protรฉgรฉ
Demo UI Streamlit

๐Ÿ“ Repository Structure

Customer_Service_ChatBot/
โ”œโ”€โ”€ app.py                                # Streamlit demo
โ”œโ”€โ”€ ontology.owl                          # Formal OWL knowledge graph ontology
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ raw/                              # Original datasets (git-ignored)
โ”‚   โ””โ”€โ”€ processed/
โ”‚       โ”œโ”€โ”€ all_clean_for_ner.csv
โ”‚       โ”œโ”€โ”€ relations_extraction.csv      # 98K extracted triples
โ”‚       โ””โ”€โ”€ relations_llm_validated.csv   # 1.1K LLM-validated triples
โ”œโ”€โ”€ notebooks/
โ”‚   โ”œโ”€โ”€ 1_load_data.ipynb                 # Multi-source ingestion + stratified sampling
โ”‚   โ”œโ”€โ”€ 2_explore_data.ipynb              # EDA + noise profiling across 3 datasets
โ”‚   โ”œโ”€โ”€ 3_preprocessing.ipynb             # 9-step custom cleaning pipeline
โ”‚   โ”œโ”€โ”€ 4_ner_extraction.ipynb            # Hybrid NER (spaCy + EntityRuler + regex)
โ”‚   โ”œโ”€โ”€ 5_relation_extraction.ipynb       # Relation mining โ†’ typed triples
โ”‚   โ”œโ”€โ”€ 6_relation_cleaning.ipynb         # Dedup + Llama-3.1 LLM validation loop
โ”‚   โ””โ”€โ”€ 7_neo4j_RAG.ipynb                 # Graph loading + GraphRAG analyst
โ””โ”€โ”€ docs/
    โ””โ”€โ”€ images/                           # Diagrams and demo screenshots

๐Ÿ”ฎ Roadmap

The pipeline is production-ready for static knowledge graphs. Five directions for extension:

  • Domain-adapted NER โ€” fine-tune spaCy on annotated telecom data for higher entity recall
  • Temporal reasoning โ€” extend the ontology with time-indexed relations (e.g. issue resolved on date) to enable trend analysis
  • Community detection โ€” Louvain clustering to surface natural issue groups and hidden subnetworks the LLM can summarize
  • Graph embeddings โ€” Node2Vec embeddings for similarity-based retrieval, enabling "find tickets like this one" queries
  • Retrieval benchmarking โ€” head-to-head comparison of GraphRAG vs. vanilla vector RAG on a held-out Q&A evaluation set

๐Ÿ“„ License

MIT โ€” see LICENSE for details.


Built by Gheffari Nour El Houda

Master 2 Data Science & NLP ยท AI Engineer


This project explores knowledge graphs, LLM-in-the-loop validation, and faithful RAG โ€”
demonstrating how structured extraction can surface hidden patterns from unstructured text,
and how an LLM can act as an analyst that
reads the graph so you don't have to.



spaCy ยท Neo4j ยท LangChain ยท HuggingFace ยท Llama 3.1 ยท NetworkX ยท Streamlit ยท OWL



โญ If you found this useful, consider giving the repo a star

About

๐Ÿ•ธ๏ธ End-to-end NLP pipeline turning 300K+ telecom conversations into a knowledge graph-powered chatbot โ€” spaCy NER โ†’ relation extraction โ†’ Llama 3.1 validation โ†’ Neo4j โ†’ GraphRAG with faithful uncertainty.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors