Mining hidden relationships from 300K customer conversations โ and asking an LLM to explain what it sees.
Buried inside 300,000+ raw support conversations are answers no one has time to read: which devices fail most, which teams really handle escalations, which issues recur in which cities. This project pulls those hidden relationships out into a typed knowledge graph โ then puts an LLM on top of it as an analyst, so you can ask the graph questions in plain English and get grounded, evidence-backed answers. You don't read the graph. The graph reads itself for you.
Hidden dependencies surfaced from 300K+ telecom conversations โ typed entities, labeled relations, every edge validated by an LLM with confidence โฅ 0.90.
Ask the graph anything. It retrieves the relevant triples, hands them to an LLM, and the LLM gives you a synthesized answer โ with the source evidence shown below so you can verify the grounding yourself.
Aggregation across thousands of triples โ the LLM surfaces which device is most associated with reported issues, with raw graph counts as verifiable evidence.
Hidden pattern discovery โ the LLM reads the retrieved subgraph and reports back three patterns a human would have needed hours of EDA to find: device concentration, recurring use case, geographic neutrality.
End-to-end process reconstruction โ the LLM traces an entire escalation workflow from disconnected graph fragments, and explicitly flags what the schema does not yet capture.
Raw text is unstructured. Sales meetings, customer calls, support tickets, complaint logs โ all of it sits in massive piles that nobody reads end-to-end. Inside those piles are real signals:
- Which products keep breaking together?
- Which agents handle which kinds of escalations?
- Are issues geographic or device-specific?
- Which billing problems map to which service tiers?
Traditional analytics requires you to already know what to look for. Vanilla RAG with embeddings can find similar passages but it cannot tell you how things connect. This project takes a different route:
Step 1 โ Convert text into a graph. Every entity (customer, device, issue, agent, account) becomes a typed node. Every relationship found in the text becomes a labeled edge with the original source sentence preserved as evidence. This turns scattered language into queryable structure.
Step 2 โ Validate the graph with an LLM. A rule-based extractor will pick up garbage. So every candidate triple is reviewed by Llama-3.1 before insertion โ the LLM can fix it, normalize it, or reject it. Only triples passing
valid=TrueANDconfidence โฅ 0.90enter the final graph.Step 3 โ Use the LLM as the analyst. When you ask a question, the system retrieves the relevant subgraph and hands it to the LLM along with strict grounding instructions. The LLM reads the graph, finds the pattern, and explains it in plain English. You never have to look at the graph itself.
The result is a system that discovers relationships you didn't know were there โ and surfaces them as readable insights.
| Most RAG projects | This project | |
|---|---|---|
| Knowledge representation | Opaque embeddings | Typed nodes + labeled relations |
| What it surfaces | Semantically similar passages | Structural relationships, dependencies, recurring patterns |
| Provenance | Lost after chunking | Every fact traces back to a source sentence |
| Hallucination handling | "Best-effort" answers | Explicit "I don't know" when graph evidence is missing |
| Validation | None | LLM reviews every triple before it enters the graph |
| Schema | Ad-hoc | Formal OWL ontology โ 9 classes, 14 object properties |
| Data sources | Single dataset | 3 heterogeneous telecom datasets unified into one graph |
The novelty isn't any single component โ it's the end-to-end discipline: every triple is typed, validated, scored, and traceable. The LLM is constrained to act as a reader of graph evidence, not a free-form text generator.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 300K+ Telecom Conversations โ
โ (Talkmap 200K ยท Comcast 5K ยท Bitext 27K) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโ
โ 1. Preprocessing โ 9-step custom cleaner
โ โ emoji ยท url ยท noise ยท dedup
โโโโโโโโโโโโโฌโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโ
โ 2. NER โ spaCy en_core_web_lg
โ โ + EntityRuler + regex
โ โ โ 6 typed entities
โโโโโโโโโโโโโฌโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโ
โ 3. Relation Mining โ dependency-parse rules
โ โ โ (subject, relation, object)
โ โ with confidence scores
โโโโโโโโโโโโโฌโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโ
โ 4. LLM Validation โ
โ Llama-3.1-8B reviewer
โ โ validate ยท fix ยท reject
โ โ โ conf โฅ 0.90 gate
โโโโโโโโโโโโโฌโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโ
โ 5. Neo4j Graph โ typed nodes + weighted
โ โ edges in Neo4j Aura
โโโโโโโโโโโโโฌโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโ
โ 6. GraphRAG Analyst โ Cypher retrieval +
โ โ LLM reads the subgraph
โ โ and reports insights
โโโโโโโโโโโโโโโโโโโโโโโโโ
โ The LLM-in-the-loop validation stage is the project's novel contribution. The rule-based extractor produces ~98K candidate triples; the validator prunes them to ~1.1K high-confidence facts. Aggressive precision was a deliberate design choice โ better a small graph the system can trust than a large graph full of noise.
Three heterogeneous telecom datasets, unified into one coherent corpus:
| Dataset | Records | Type | Content |
|---|---|---|---|
| Talkmap Telecom | 200,000 | Multi-turn dialogues | Customerโagent conversation transcripts |
| Comcast Complaints | ~5,000 | Free-form text | Customer complaints with category labels |
| Bitext Customer Q&A | 27,000 | Intent-labeled pairs | Instruction ยท response ยท intent ยท category |
| Total | ~300K | Mixed |
A stratified 100K sample of Talkmap was used during development; the full 200K is available for production runs.
Systematic noise profiling across all three datasets revealed dataset-specific quirks that drove custom cleaning per source:
| Noise type | Talkmap | Comcast | Bitext |
|---|---|---|---|
| URLs / links | โ frequent | rare | rare |
| Emojis | โ frequent | rare | none |
| Repeated tokens ("to to to") | โ frequent | rare | none |
| Parenthetical asides ("(to self)") | โ frequent | none | none |
| Non-ASCII garbage | โ frequent | occasional | rare |
A reusable filter_by_pattern() utility was built for fast regex-based inspection across millions of rows.
Custom 9-step text cleaner built from scratch โ no off-the-shelf cleaner used:
clean_pipeline = [
remove_emojis, # Unicode emoji stripping
remove_urls, # http/www links
mask_emails, # โ [EMAIL] token
remove_mentions_hashtags, # @user, #tag
remove_parenthetical_noise, # "(To self)", "(Thumbs up)"
remove_repeated_tokens, # "to to to" โ "to"
normalize_punctuation, # "!!!" โ "!"
remove_garbage_symbols, # non-ASCII noise
clean_whitespace, # canonical spacing
]All three sources unified into one clean CSV โ the foundation for everything downstream.
The first step toward turning text into a graph: find the entities. Hybrid NER combines spaCy's statistical model with hand-crafted deterministic rules for telecom-specific concepts:
| Entity Type | Examples | Method |
|---|---|---|
SERVICE |
data plan, broadband, roaming, voicemail | EntityRuler patterns |
PRODUCT |
router, SIM card, iPhone, Android, modem | EntityRuler patterns |
ISSUE |
no signal, billing issue, network outage | EntityRuler patterns |
ACTION |
refund, reset, escalate, cancel, upgrade | EntityRuler patterns |
ACCOUNT_ID |
AB-12345, JKL87654321 | Regex [A-Z]{2,5}-?\d{4,} |
PHONE_NUMBER |
+1-800-XXX-XXXX | Regex (international) |
GPU acceleration via thinc.prefer_gpu() for efficient batch processing at scale.
The crucial step: finding how entities connect. Dependency-parse-based rules walk every sentence looking for typed relationships between extracted entities. Each candidate is produced as a fully-traceable triple:
{
"subject": "customer",
"subject_type": "PERSON",
"relation": "REPORTED",
"object": "billing issue",
"object_type": "ISSUE",
"evidence_text": "The customer reported a billing issue on their account.",
"confidence": 0.87,
"extraction_method": "dep_rule",
"source": "telecom_200k"
}Typed entities, relation label, source sentence, confidence score, dataset origin โ every triple in the graph is fully auditable. This is what makes the system trustable: every claim points back to a real sentence in real data.
The novel stage. Rule-based extraction is fast but noisy. So every candidate triple gets reviewed by Llama-3.1-8B-Instruct via a LangChain validation chain:
You are a knowledge graph validation assistant.
Your task:
- Decide if the relation is correct
- If incorrect, fix it
- If meaningless, reject it
- Normalize entity names
- Normalize relation label
Return ONLY valid JSON.
The LLM acts as a quality gate โ catching errors the rule-based extractor cannot:
- Semantic nonsense (e.g. "ticket REPORTED_VIA ticket")
- Mis-typed arguments (e.g. VPN labeled as a device when it's a service)
- Ambiguous references and partial entity captures
Quality gate: only triples where valid=True AND llm_confidence โฅ 0.90 survive. The threshold is deliberately strict โ the goal is a trustable graph, not a large one.
Validated triples loaded into Neo4j Aura with typed node labels and confidence-weighted edges:
MERGE (a:Entity {name: $subject})
MERGE (b:Entity {name: $object})
MERGE (a)-[rel:REPORTED {confidence: $conf}]->(b)Once loaded, the graph reveals patterns that were invisible in the raw text: which devices co-occur with which issues, which agents handle which escalations, how billing amounts cluster across accounts. The graph is also exported to NetworkX for local visualization and statistical analysis.
The chatbot doesn't generate answers โ it reads the graph and reports what it finds. Three steps:
- Question โ Cypher โ natural language mapped to graph traversal
- Graph โ Context โ Neo4j returns relevant
(subject)-[relation]โ(object)triples + evidence sentences - Context โ Insight โ the LLM is strictly constrained to answer from those triples
prompt = f"""
You are a customer support assistant.
Answer ONLY using these facts:
{context}
Question: {question}
"""The Answer ONLY using these facts constraint is what produces faithful, non-hallucinated insights โ including the explicit "I don't know" when the graph lacks evidence. The user gets the pattern; the LLM does the reading; the graph guarantees the truth.
The graph is backed by a formal OWL ontology (ontology.owl) that defines its semantic structure:
- Classes:
Customer,Service,Issue,Device,Action,Account,Ticket,Agent,Team - Object Properties:
hasIssue,requestedAction,usesService,reportedBy,escalatedTo,memberOf - Data Properties: confidence scores, timestamps, source dataset, evidence text
The ontology ensures the graph is semantically consistent, extensible, and interoperable with other knowledge systems โ and it gives the LLM a stable vocabulary to reason about.
git clone https://github.com/houdhoudGH/Customer_Service_ChatBot.git
cd Customer_Service_ChatBot
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
streamlit run app.pyThe Streamlit demo runs without any external dependencies โ no Neo4j credentials or LLM tokens required. It replays the question/answer pairs from the original Neo4j Aura pipeline.
To run the full extraction and validation chain against your own data:
export NEO4J_URI="neo4j+s://your-instance.databases.neo4j.io"
export NEO4J_USERNAME="neo4j"
export NEO4J_PASSWORD="your-password"
export HUGGINGFACEHUB_API_TOKEN="hf_..."Run the notebooks in order:
1_load_data โ 2_explore_data โ 3_preprocessing โ 4_ner_extraction โ 5_relation_extraction โ 6_relation_cleaning โ 7_neo4j_RAG
| Layer | Technology |
|---|---|
| NLP & NER | spaCy en_core_web_lg, EntityRuler, custom regex |
| LLM Validation | Llama-3.1-8B-Instruct (HuggingFace Inference API) |
| LLM Orchestration | LangChain (LLMChain, SequentialChain, PromptTemplate) |
| Graph Database | Neo4j Aura + neo4j Python driver |
| Graph Analysis | NetworkX, Matplotlib |
| Data Processing | Pandas, NumPy |
| GPU Acceleration | CUDA via thinc / PyTorch |
| Ontology | OWL / Protรฉgรฉ |
| Demo UI | Streamlit |
Customer_Service_ChatBot/
โโโ app.py # Streamlit demo
โโโ ontology.owl # Formal OWL knowledge graph ontology
โโโ requirements.txt
โโโ data/
โ โโโ raw/ # Original datasets (git-ignored)
โ โโโ processed/
โ โโโ all_clean_for_ner.csv
โ โโโ relations_extraction.csv # 98K extracted triples
โ โโโ relations_llm_validated.csv # 1.1K LLM-validated triples
โโโ notebooks/
โ โโโ 1_load_data.ipynb # Multi-source ingestion + stratified sampling
โ โโโ 2_explore_data.ipynb # EDA + noise profiling across 3 datasets
โ โโโ 3_preprocessing.ipynb # 9-step custom cleaning pipeline
โ โโโ 4_ner_extraction.ipynb # Hybrid NER (spaCy + EntityRuler + regex)
โ โโโ 5_relation_extraction.ipynb # Relation mining โ typed triples
โ โโโ 6_relation_cleaning.ipynb # Dedup + Llama-3.1 LLM validation loop
โ โโโ 7_neo4j_RAG.ipynb # Graph loading + GraphRAG analyst
โโโ docs/
โโโ images/ # Diagrams and demo screenshots
The pipeline is production-ready for static knowledge graphs. Five directions for extension:
- Domain-adapted NER โ fine-tune spaCy on annotated telecom data for higher entity recall
- Temporal reasoning โ extend the ontology with time-indexed relations (e.g. issue resolved on date) to enable trend analysis
- Community detection โ Louvain clustering to surface natural issue groups and hidden subnetworks the LLM can summarize
- Graph embeddings โ Node2Vec embeddings for similarity-based retrieval, enabling "find tickets like this one" queries
- Retrieval benchmarking โ head-to-head comparison of GraphRAG vs. vanilla vector RAG on a held-out Q&A evaluation set
MIT โ see LICENSE for details.
Master 2 Data Science & NLP ยท AI Engineer
This project explores knowledge graphs, LLM-in-the-loop validation, and faithful RAG โ
demonstrating how structured extraction can surface hidden patterns from unstructured text,
and how an LLM can act as an analyst that reads the graph so you don't have to.
spaCy ยท Neo4j ยท LangChain ยท HuggingFace ยท Llama 3.1 ยท NetworkX ยท Streamlit ยท OWL
โญ If you found this useful, consider giving the repo a star
