GNN Challenge: Paper Topic Prediction in Citation Networks

Challenge Overview

Welcome to Sanad Citation Network Challenge! This challenge focuses on node classification in a citation network using Graph Neural Networks (GNNs).

Participants are asked to predict the category of scientific papers based solely on the citation graph structure and node features.

Dataset used: PubMed Citation Network (standard benchmark in GNN literature)

Problem Description

Scientific papers cite other papers, forming a graph where:

Element	Description
Nodes	represent papers
Edges	represent citation links
Node features	represent paper content (bag-of-words)
Node labels	represent disease categories

Your Task

Given a paper and its citation neighborhood, predict the disease category of the paper.

This is a semi-supervised node classification task:

Only a subset of nodes are labeled for training

The model must generalize to unseen nodes

Classes (Disease Categories)

Each paper belongs to one of 3 categories:

Label	Category
0	Diabetes Mellitus Experimental
1	Diabetes Mellitus Type 1
2	Diabetes Mellitus Type 2

What Makes This Challenging?

Graph Dependency: Papers are not independent — predictions rely on citation neighborhoods.

Homophily Bias: Papers often cite papers from the same domain, but not always.

Limited Labels: Only a portion of nodes are labeled for training (60%).

Over-smoothing Risk: Deeper GNNs may degrade performance if not carefully designed.

No External Information: Only graph structure and node features are allowed.

Dataset

PubMed Citation Network Statistics

Property	Value
Nodes	19,717 papers
Edges	88,648 citation links
Node Features	500
Classes	3
Graph Type	Homogeneous

Data Format

The dataset is provided as preprocessed PyTorch tensors:

import torch

# Load raw data
features = torch.load('data/raw/features.pt')    # [19717, 500]
edges = torch.load('data/raw/edges.pt')           # [2, 88648]
labels = torch.load('data/raw/labels.pt')         # [19717]

# Load masks
train_mask = torch.load('data/splits/train_mask.pt')
val_mask = torch.load('data/splits/val_mask.pt')
test_mask = torch.load('data/splits/test_mask.pt')

Using features.pt, edges.pt, labels.pt, and masks:

1.Makes your GNN easy to implement

2.Efficient in memory

3.Compatible with PyG functions

4.Handles semi-supervised masking cleanly

5.Supports batching / inductive learning

6.Improves reproducibility and standardization

Split Information

Split	Nodes	Ratio
Training	11,829	60%
Validation	3,942	20%
Test	3,946	20%

Dataset & Private Files

The challenge uses private files for evaluation:

Submission file: submission.private.csv

Test labels: test_labels_hidden.private.pt

These files are hosted on Hugging Face and linked to this repository.

Private files are used internally for scoring and leaderboard computation; they are not included in the repository.

Workflow

Data Processing: Load node features, adjacency information, and graph structure.

Model Training:

GCN model with hidden layers (ReLU + dropout)

Output layer computes log softmax for node classification

Evaluation:

Submissions are scored using the private test labels

The scoring workflow uses Hugging Face to download the private files securely

Private File Handling (Hugging Face)

The private files are downloaded automatically using a Hugging Face token (HF_TOKEN) in the CI workflow:

submission.private.csv → private submission for evaluation

test_labels_hidden.private.pt → private labels for scoring

This ensures reproducible evaluation while keeping sensitive data private.

Evaluation Metric

Metric	Split	Purpose	Definition
Accuracy	train, val, test	Overall node prediction correctness	Fraction of correctly predicted nodes out of all nodes in the split: `Accuracy = #correct_predictions / total_nodes`
Weighted F1	train, val, test	Class-balanced node performance	Harmonic mean of precision and recall, averaged across classes weighted by class size: `F1 = 2 * (Precision * Recall) / (Precision + Recall)`
Loss (NLL)	train	Optimization, early stopping	Negative log-likelihood loss for classification: `NLL Loss = - (1/N) * sum_i y_i * log(y_hat_i)`, applied only on training nodes
Classification report	test	Detailed per-class performance	Shows precision, recall, F1-score, and support for each class separately; provides fine-grained insight for imbalanced classes.

Baseline Performance

Model	Accuracy
Random Guess	~33%
2-layer GCN	~78%
Target	80%+

Constraints

To ensure fairness and pedagogical value:

No External Data: Use only the PubMed dataset.

Graph Features Only: No handcrafted or text-based features beyond provided node features.

CPU Training Only: Models must be trainable on CPU.

Must Use GNNs At least one message-passing layer (GCN, GraphSAGE, GAT, etc.).

Quick Start

Install Dependencies

pip install -r starter_code/requirements.txt

Train the Baseline

python starter_code/baseline.py --epochs 200

This trains a 2-layer GCN model, saves the best checkpoint to results/best_model.pt, and generates submissions/submission.private.csv.

Generate a Submission

python starter_code/generate_submission.py --model-path results/best_model.pt

Score a Submission

python scoring_script.py submissions/submission.private.csv

Starter Model (GCN Baseline)

import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GCN(nn.Module):
    def __init__(self, num_features, hidden_channels, num_classes, num_layers=2, dropout=0.5):
        super(GCN, self).__init__()
        self.convs = nn.ModuleList()
        self.convs.append(GCNConv(num_features, hidden_channels))
        for _ in range(num_layers - 2):
            self.convs.append(GCNConv(hidden_channels, hidden_channels))
        self.convs.append(GCNConv(hidden_channels, num_classes))
        self.dropout = dropout
        self.num_layers = num_layers

    def forward(self, x, edge_index):
        for i in range(self.num_layers - 1):
            x = self.convs[i](x, edge_index)
            x = F.relu(x)
            x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.convs[-1](x, edge_index)
        return F.log_softmax(x, dim=1)

Repository Structure

gnn-challenge/
├── data/
│   ├── raw/              # features.pt, edges.pt, labels.pt, metadata.json
│   ├── splits/           # train_mask.pt, val_mask.pt, test_mask.pt
│   └── processed/        # train_labels.pt, val_labels.pt, test_labels_hidden.private.pt
├── starter_code/
│   ├── baseline.py       # Baseline GCN training script
│   ├── generate_submission.py  # Generate submission CSV from trained model
│   └── requirements.txt  # Python dependencies
├── submissions/          # Submission CSV files
├── scoring_script.py     # Score a submission
├── update_leaderboard.py # Update leaderboard.md from leaderboard.json
├── prepare_dataset.py    # Script to regenerate the dataset
├── leaderboard.json
├── leaderboard.md
└── README.md

Baseline Model (GCN) — Details

Overview

The script implements a Graph Convolutional Network (GCN) baseline for node classification on a graph dataset (in your example, the PubMed dataset). It handles data loading, training, evaluation, and test predictions, producing a CSV submission file. The design follows standard geometric deep learning practices.

Data Input

Dataset: Nodes represent entities (e.g., PubMed papers), edges represent connections (e.g., citation links).

Node features: Continuous vectors for each node (e.g., 2D or more depending on dataset).

Masks: Boolean masks for training, validation, and test splits.

Data loader (PubMedDataLoader) loads all tensors (features, edges, labels) and moves them to GPU/CPU.

Model Architecture

GCN layers: Default 2 layers (num_layers=2), each using GCNConv.

Forward pass:

Hidden layers: ReLU activation + dropout.

Output layer: Log softmax over classes.

Output: Node-level log probabilities for classification.

Training settings:

Optimizer: Adam

Loss: Negative log likelihood (F.nll_loss) over training nodes only

Regularization: Dropout + weight decay

Early stopping monitored via validation accuracy.

Training & Evaluation

Epoch loop:

Forward + backward pass on training nodes.

Evaluate on training and validation masks.

Save best model based on validation accuracy.

Early stopping if validation accuracy does not improve.

Metrics:

Accuracy

Weighted F1

Classification report per class

Submission

After training, predictions on test nodes are saved in a CSV (submissions/submission.private.csv) with columns: node_id, target.

Helper function save_submission_csv handles formatting and preview.

Key Points / Features

Modular design: GCN class separated from data loading.

Device agnostic: Automatically uses GPU if available.

Configurable via CLI arguments (--hidden-channels, --num-layers, --dropout, --epochs, etc.).

Provides detailed logs of loss, train/val accuracy, best epoch, and final test results.

Ready for extension to other datasets or graph-level tasks.

Mental map of the GCN code

features.pt → initial embeddings (x) edges.pt → adjacency matrix (edge_index)

x, edge_index ↓ GCNConv layer 1 ↓ hidden embedding ↓ GCNConv layer 2 ↓ final embedding ↓ log_softmax → predictions

References

Kipf, T. N., & Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. arXiv:1609.02907
Official GitHub repository for the paper: https://github.com/tkipf/pygcn
GNNs: Basira Lab youtube

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GNN Challenge: Paper Topic Prediction in Citation Networks

Challenge Overview

Problem Description

Classes (Disease Categories)

What Makes This Challenging?

Dataset

Data Format

Split Information

Dataset & Private Files

Workflow

Evaluation:

Evaluation Metric

Constraints

Quick Start

Install Dependencies

Train the Baseline

Generate a Submission

Score a Submission

Starter Model (GCN Baseline)

Repository Structure

Baseline Model (GCN) — Details

Mental map of the GCN code

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
data		data
results		results
starter_code		starter_code
.gitignore		.gitignore
Leaderboard.html		Leaderboard.html
README.md		README.md
guide.md		guide.md
leaderboard.json		leaderboard.json
leaderboard.md		leaderboard.md
leaderboard.mk		leaderboard.mk
prepare_dataset.py		prepare_dataset.py
scoring_script.py		scoring_script.py
update_leaderboard.py		update_leaderboard.py

Folders and files

Latest commit

History

Repository files navigation

GNN Challenge: Paper Topic Prediction in Citation Networks

Challenge Overview

Problem Description

Classes (Disease Categories)

What Makes This Challenging?

Dataset

Data Format

Split Information

Dataset & Private Files

Workflow

Evaluation:

Evaluation Metric

Constraints

Quick Start

Install Dependencies

Train the Baseline

Generate a Submission

Score a Submission

Starter Model (GCN Baseline)

Repository Structure

Baseline Model (GCN) — Details

Mental map of the GCN code

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages