Welcome to Sanad Citation Network Challenge! This challenge focuses on node classification in a citation network using Graph Neural Networks (GNNs).
Participants are asked to predict the category of scientific papers based solely on the citation graph structure and node features.
Dataset used: PubMed Citation Network (standard benchmark in GNN literature)
Scientific papers cite other papers, forming a graph where:
| Element | Description |
|---|---|
| Nodes | represent papers |
| Edges | represent citation links |
| Node features | represent paper content (bag-of-words) |
| Node labels | represent disease categories |
Your Task
Given a paper and its citation neighborhood, predict the disease category of the paper.
This is a semi-supervised node classification task:
Only a subset of nodes are labeled for training
The model must generalize to unseen nodes
Each paper belongs to one of 3 categories:
| Label | Category |
|---|---|
| 0 | Diabetes Mellitus Experimental |
| 1 | Diabetes Mellitus Type 1 |
| 2 | Diabetes Mellitus Type 2 |
Graph Dependency: Papers are not independent — predictions rely on citation neighborhoods.
Homophily Bias: Papers often cite papers from the same domain, but not always.
Limited Labels: Only a portion of nodes are labeled for training (60%).
Over-smoothing Risk: Deeper GNNs may degrade performance if not carefully designed.
No External Information: Only graph structure and node features are allowed.
PubMed Citation Network Statistics
| Property | Value |
|---|---|
| Nodes | 19,717 papers |
| Edges | 88,648 citation links |
| Node Features | 500 |
| Classes | 3 |
| Graph Type | Homogeneous |
The dataset is provided as preprocessed PyTorch tensors:
import torch
# Load raw data
features = torch.load('data/raw/features.pt') # [19717, 500]
edges = torch.load('data/raw/edges.pt') # [2, 88648]
labels = torch.load('data/raw/labels.pt') # [19717]
# Load masks
train_mask = torch.load('data/splits/train_mask.pt')
val_mask = torch.load('data/splits/val_mask.pt')
test_mask = torch.load('data/splits/test_mask.pt')Using features.pt, edges.pt, labels.pt, and masks:
1.Makes your GNN easy to implement
2.Efficient in memory
3.Compatible with PyG functions
4.Handles semi-supervised masking cleanly
5.Supports batching / inductive learning
6.Improves reproducibility and standardization
| Split | Nodes | Ratio |
|---|---|---|
| Training | 11,829 | 60% |
| Validation | 3,942 | 20% |
| Test | 3,946 | 20% |
The challenge uses private files for evaluation:
Submission file: submission.private.csv
Test labels: test_labels_hidden.private.pt
These files are hosted on Hugging Face and linked to this repository.
Private files are used internally for scoring and leaderboard computation; they are not included in the repository.
Data Processing: Load node features, adjacency information, and graph structure.
Model Training:
GCN model with hidden layers (ReLU + dropout)
Output layer computes log softmax for node classification
Submissions are scored using the private test labels
The scoring workflow uses Hugging Face to download the private files securely
Private File Handling (Hugging Face)
The private files are downloaded automatically using a Hugging Face token (HF_TOKEN) in the CI workflow:
submission.private.csv → private submission for evaluation
test_labels_hidden.private.pt → private labels for scoring
This ensures reproducible evaluation while keeping sensitive data private.
| Metric | Split | Purpose | Definition |
|---|---|---|---|
| Accuracy | train, val, test | Overall node prediction correctness | Fraction of correctly predicted nodes out of all nodes in the split: Accuracy = #correct_predictions / total_nodes |
| Weighted F1 | train, val, test | Class-balanced node performance | Harmonic mean of precision and recall, averaged across classes weighted by class size: F1 = 2 * (Precision * Recall) / (Precision + Recall) |
| Loss (NLL) | train | Optimization, early stopping | Negative log-likelihood loss for classification: NLL Loss = - (1/N) * sum_i y_i * log(y_hat_i), applied only on training nodes |
| Classification report | test | Detailed per-class performance | Shows precision, recall, F1-score, and support for each class separately; provides fine-grained insight for imbalanced classes. |
Baseline Performance
| Model | Accuracy |
|---|---|
| Random Guess | ~33% |
| 2-layer GCN | ~78% |
| Target | 80%+ |
To ensure fairness and pedagogical value:
No External Data: Use only the PubMed dataset.
Graph Features Only: No handcrafted or text-based features beyond provided node features.
CPU Training Only: Models must be trainable on CPU.
Must Use GNNs At least one message-passing layer (GCN, GraphSAGE, GAT, etc.).
pip install -r starter_code/requirements.txtpython starter_code/baseline.py --epochs 200This trains a 2-layer GCN model, saves the best checkpoint to results/best_model.pt, and generates submissions/submission.private.csv.
python starter_code/generate_submission.py --model-path results/best_model.ptpython scoring_script.py submissions/submission.private.csvimport torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
class GCN(nn.Module):
def __init__(self, num_features, hidden_channels, num_classes, num_layers=2, dropout=0.5):
super(GCN, self).__init__()
self.convs = nn.ModuleList()
self.convs.append(GCNConv(num_features, hidden_channels))
for _ in range(num_layers - 2):
self.convs.append(GCNConv(hidden_channels, hidden_channels))
self.convs.append(GCNConv(hidden_channels, num_classes))
self.dropout = dropout
self.num_layers = num_layers
def forward(self, x, edge_index):
for i in range(self.num_layers - 1):
x = self.convs[i](x, edge_index)
x = F.relu(x)
x = F.dropout(x, p=self.dropout, training=self.training)
x = self.convs[-1](x, edge_index)
return F.log_softmax(x, dim=1)gnn-challenge/
├── data/
│ ├── raw/ # features.pt, edges.pt, labels.pt, metadata.json
│ ├── splits/ # train_mask.pt, val_mask.pt, test_mask.pt
│ └── processed/ # train_labels.pt, val_labels.pt, test_labels_hidden.private.pt
├── starter_code/
│ ├── baseline.py # Baseline GCN training script
│ ├── generate_submission.py # Generate submission CSV from trained model
│ └── requirements.txt # Python dependencies
├── submissions/ # Submission CSV files
├── scoring_script.py # Score a submission
├── update_leaderboard.py # Update leaderboard.md from leaderboard.json
├── prepare_dataset.py # Script to regenerate the dataset
├── leaderboard.json
├── leaderboard.md
└── README.md
Overview
The script implements a Graph Convolutional Network (GCN) baseline for node classification on a graph dataset (in your example, the PubMed dataset). It handles data loading, training, evaluation, and test predictions, producing a CSV submission file. The design follows standard geometric deep learning practices.
- Data Input
Dataset: Nodes represent entities (e.g., PubMed papers), edges represent connections (e.g., citation links).
Node features: Continuous vectors for each node (e.g., 2D or more depending on dataset).
Masks: Boolean masks for training, validation, and test splits.
Data loader (PubMedDataLoader) loads all tensors (features, edges, labels) and moves them to GPU/CPU.
- Model Architecture
GCN layers: Default 2 layers (num_layers=2), each using GCNConv.
Forward pass:
Hidden layers: ReLU activation + dropout.
Output layer: Log softmax over classes.
Output: Node-level log probabilities for classification.
Training settings:
Optimizer: Adam
Loss: Negative log likelihood (F.nll_loss) over training nodes only
Regularization: Dropout + weight decay
Early stopping monitored via validation accuracy.
- Training & Evaluation
Epoch loop:
Forward + backward pass on training nodes.
Evaluate on training and validation masks.
Save best model based on validation accuracy.
Early stopping if validation accuracy does not improve.
Metrics:
Accuracy
Weighted F1
Classification report per class
- Submission
After training, predictions on test nodes are saved in a CSV (submissions/submission.private.csv) with columns: node_id, target.
Helper function save_submission_csv handles formatting and preview.
- Key Points / Features
Modular design: GCN class separated from data loading.
Device agnostic: Automatically uses GPU if available.
Configurable via CLI arguments (--hidden-channels, --num-layers, --dropout, --epochs, etc.).
Provides detailed logs of loss, train/val accuracy, best epoch, and final test results.
Ready for extension to other datasets or graph-level tasks.
features.pt → initial embeddings (x) edges.pt → adjacency matrix (edge_index)
x, edge_index ↓ GCNConv layer 1 ↓ hidden embedding ↓ GCNConv layer 2 ↓ final embedding ↓ log_softmax → predictions
- Kipf, T. N., & Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. arXiv:1609.02907
- Official GitHub repository for the paper: https://github.com/tkipf/pygcn
- GNNs: Basira Lab youtube