Welcome to the GNN-based Fake News Detection Challenge! This competition focuses on detecting fake news propagation on Twitter using Graph Neural Networks (GNNs).
⭐📊 Live Leaderboard 📊⭐
Real_Or_Fake/
├── data/
│ ├── public/
│ │ ├── A.txt
│ │ ├── new_bert_feature.npz
│ │ ├── new_spacy_feature.npz
│ │ ├── new_profile_feature.npz
│ │ ├── node_graph_id.npy
│ │ ├── train_idx.npy
│ │ ├── train_labels.csv
│ │ ├── val_idx.npy
│ │ ├── val_labels.csv
│ │ ├── test_idx.csv
│ │ └── test_idx.npy
│ └── test_labels.csv (kept private, restored via GitHub Secret)
│
├── submissions/
│ ├── sample_submission/
│ │ └── predictions.csv
│ └── inbox/
│ └── team_name/
│ └── run_name/
│ ├── metadata.json
│ └── predictions.csv.gpg
│
├── competition/
│ ├── metrics.py
│ └── scoring_script.py
│
├── docs/
│ ├── leaderboard.css
│ ├── leaderboard.csv
│ ├── leaderboard.html
│ └── leaderboard.js
│
├── key/
│ └── competition_public_key.asc
│
├── models/
│ ├── model.py
│ └── saved_model.model
│
├── dataloader.py
├── evaluate.py
├── test.py
├── train.py
├── update_leaderboard.py
├── validate_submission.py
├── requirements.txt
├── README.md
└── LICENSE
The dataset used in this repository is from the paper:
Dou, Y., Shu, K., Xia, C., Yu, P. S., & Sun, L. (2021). User Preference-aware Fake News Detection. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '21), pp. 2051–2055. https://doi.org/10.1145/3404835.3462990
This competition uses the GossipCop dataset, which contains Twitter news propagation graphs. Each graph represents the spread of a single news article
Each graph corresponds to a news article (root node) and all users who engaged with it (child nodes).
- Nodes: represent either the news article or a user who interacted with it.
- Edges: represent interactions or retweets between nodes. Only edges connecting nodes in the same graph are used for that graph
- The root node corresponds to the news article itself.
- Child nodes correspond to users who retweeted or engaged with the news.
Graphs are used as input to Graph Neural Networks (GNNs) to classify news as Real (0) or Fake (1).
The dataset is split into public and private parts:
- Public: Available to participants for training, validation, and testing.
- Private: Hidden labels used for submission and leaderboard evaluation.
Graph visualization generated using ChatGPT
The graph connectivity and graph assignment information are stored in the following files:
-
A.txt
Contains all edges in the dataset. Each row is an edge represented by two node IDs (source and target).
Type: Integer array, shape(num_edges, 2) -
node_graph_id.npy
Maps each node to its corresponding graph. The value at indexiindicates the graph ID of nodei.
Type: Integer array, shape(num_nodes,)
Each node in the graph has text embeddings and optionally user profile features.
- BERT embeddings:
768-dimvectors representing the content of the news or user historical tweets.
File:new_bert_feature.npz - spaCy embeddings:
300-dimvectors representing the content of the news or user historical tweets.
File:new_spacy_feature.npz
These features are derived from the Twitter user object using the Twitter API:
- Verified? (
0or1) - Geo-spatial enabled? (
0or1) - Number of followers
- Number of friends
- Status/tweet count
- Number of favorites
- Number of lists the user is part of
- Account age (months since Twitter launch)
- Number of words in the user’s name
- Number of words in the user’s description
File: new_profile_feature.npz
-
train_idx.npy
Contains the list of3826graph IDs used for training.
Type: Integer array, shape(num_train_graphs,) -
val_idx.npy
Contains the list of546graph IDs used for validation.
Type: Integer array, shape(num_val_graphs,) -
test_idx.npy
Contains the list of1092graph IDs used for testing. Labels are hidden in the private folder for competition evaluation.
Type: Integer array, shape(num_test_graphs,)
Each graph in the dataset has a label indicating whether the news is real or fake:
0→ Real news1→ Fake news
Graph labels are stored separately for different splits:
- Training labels:
train_labels.csv - Validation labels:
val_labels.csv - Test labels (hidden for competition evaluation):
test_labels.csv - Each CSV file contains two columns:
id→ Graph IDy_true→ Label (0 or 1)
Here are some key statistics for the news propagation graphs in the competition datasets:
| Dataset | #Graphs (Fake) | #Total Nodes | #Total Edges | Avg. Nodes per Graph |
|---|---|---|---|---|
| GossipCop (GOS) | 5,464 (2,732) | 314,262 | 308,798 | 58 |
Task: Classify each news propagation graph as real or fake.
The baseline model is a Graph Neural Network (GNN) for fake news detection implemented in model.py. Its main components:
- Graph Attention Layers (GAT): 3 layers to learn node embeddings from the propagation graph.
- Global Max Pooling: Aggregates node embeddings to a single graph-level representation.
- Root Node Transformation: Linear layer processes the root node (news article) features.
- Concatenation & Output: Combines graph representation and root node features, then passes through a linear layer with sigmoid to predict fake/real news.
Features used in baseline:
- spaCy Text embeddings of news and historical user tweets
Output:
- Probability that a news graph is fake.
Follow these steps to replicate the baseline results and build your own implementation.
git clone https://github.com/TugaAhmed/Real_Or_Fake.git
cd Real_Or_FakeCreate a virtual environment and install the required dependencies:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txtDownload the public dataset ZIP file from this link .
Extract the contents inside the data/public folder so that the folder structure looks like this:
├── data/
│ ├── public/
│ │ ├── A.txt
│ │ ├── new_bert_feature.npz
│ │ ├── new_spacy_feature.npz
│ │ ├── new_profile_feature.npz
│ │ ├── node_graph_id.npy
│ │ ├── train_idx.npy
│ │ ├── train_labels.csv
│ │ ├── val_idx.npy
│ │ ├── val_labels.csv
│ │ ├── test_idx.csv
│ │ └── test_idx.npy
Run the training script to train the GNN on the dataset:
python train.pyThis will train the model and generate saved_model.model in the models/ folder.
The saved model corresponds to the one with the best validation accuracy.
Metrics tracked during training: Accuracy and F1 score.
After training, run the test script to generate predictions:
python test.pyThis will create a predictions.csv file inside the submissions/ folder.
The CSV contains two columns: id , y_pred
You can evaluate your predictions using the evaluation script:
python evaluate.py- The file
test_labels.csvis not publicly available. - The evaluation script is provided only to demonstrate how scoring works.
- The script will run successfully only when the ground-truth labels are available.
- Final scoring is performed on the competition server after submission.
Metrics reported include:
- Accuracy
- F1 Score
Your goal is to beat the baseline accuracy on the fake news detection task using the provided Twitter news propagation graphs.
- Uses only spaCy text embeddings of news articles and user historical tweets.
- Achieves:
- Accuracy: 0.7216
- F1 score: 0.7071
- Build a Graph Neural Network (GNN) based pipeline.
- Use any combination of available features:
- SpaCy text embeddings (baseline feature)
- BERT embeddings (
new_bert_feature.npz) - User profile features (
new_profile_feature.npz)
- Train your model on the public training data and validate on the validation set.
- Generate predictions for the test set as
predictions.csv.
- Your model must use a GNN; other models alone will not be accepted.
- You may combine features in any way to improve performance.
- The objective is to maximize accuracy on the hidden test set.
Follow these steps to participate in the competition and submit your results.
- Use the public dataset in
data/public/. - Train your model using your own implementation.
- Generate predictions for the test set.
- Create a
predictions.csvfile locally.
Each submission must include:
Must contain exactly two columns:
| Column | Description |
|---|---|
id |
Graph identifier (must exactly match public test IDs) |
y_pred |
Predicted probability or score |
Incorrect formatting will cause automatic validation failure.
🚫 Do NOT upload predictions.csv to the repository.
Before submission, you must encrypt your predictions.csv using the competition public key located in:
key/competition_public_key.asc
Run the following commands in bash:
# Import the public key
gpg --import competition_public_key.asc
# Encrypt predictions file
gpg --output predictions.csv.gpg \
--encrypt \
--recipient "GNN competition (Real or Fake) " \
predictions.csvThis will generate:
predictions.csv.gpg
✅ Only this encrypted .gpg file is allowed for submission.
{
"team": "example_team",
"run_id": "example_run_id",
"type": "human",
"model": "GAT",
"notes": "Additional notes"
}type must be one of:
"human""llm-only""human+llm"
Your Pull Request must add files in the following structure:
submissions/inbox/<team_name>/<run_id>/
├── predictions.csv.gpg
└── metadata.json
Example:
submissions/inbox/team_alpha/run_01/
├── predictions.csv.gpg
└── metadata.json
🚫 Uploading predictions.csv will result in automatic rejection.
- Fork the repository.
- Add your metadata and encrypted submission files in the correct directory.
- Open a Pull Request (PR) to the main repository.
When the Pull Request is opened:
- Submission format is validated
- The encrypted file is securely decrypted by the competition server
- Predictions are scored using hidden test labels
- Score is posted automatically as a PR comment
- Invalid submissions fail automatically
- Your score is appended to
docs/leaderboard.csv - The leaderboard page is automatically updated
This competition is part of the BASIRA-LAB (GNNs) for Rising Stars Mentorship Program
The lab’s tutorials on Deep Graph Learning served as guidance for preparing this challenge: Tutorials Link
