This project contains implementations for various imputation strategies to reconstruct missing node features in graph data. This is done to enhance the quality of downstream tasks such as node classification or link prediction.
Three imputation strategies are implemented:
- Baseline Strategy - kNN: Impute the missing feature of a given node using the average of this feature among neighbouring nodes.
- Topology-Based Strategy - Community Average: Detects communities using the Louvain method and imputes missing values with the average computed within each community.
- Learning-Based Strategy - Attributed DeepWalk: Extends the DeepWalk algorithm by incorporating node features into the random walk process. The resulting embeddings capture both topology and feature information and are used for imputation. For more details, see our Wiki.
- Python (3.x) and pip.
git clone https://github.com/axkoro/graph-impute.git
cd graph-impute
git submodule init
git submodule update
pip install .Install additional functionalities (e.g., training, plotting, testing):
pip install .[extra]Extra options include: train_evaluate, plot, test.
To run one of the implemented strategies on a graph dataset (see Input Format for requirements), use either the CLI tool or the Python API.
graph-impute --strategy knn --dataset amazonLoads file paths from datasets.yaml for the "amazon" dataset and applies kNN imputation.
graph-impute --strategy deepwalk --edges path/to/edges.txt --features path/to/features.txtSpecifies custom paths and applies DeepWalk imputation.
graph-impute --strategy community --dataset amazon --evaluate --time --trainUses the "amazon" dataset, performs community-based imputation, evaluates quality, prints timing details, and trains a neural network for downstream evaluation.
You can also use the imputation tool as a Python library. For example:
edges_path = "path/to/edges.txt"
features_path = "path/to/features.txt"
reference_path = "path/to/reference_features.txt"
output_path = "path/to/imputed_features.txt"
graph = Graph.load(edges_path, features_path, feature_type=float)
imputer = create_imputer("knn", graph, k=150, use_k_hops=False)
imputer.impute()
graph.save(output_path)
metrics = evaluate_imputed_features_file(
imputed_features_path=output_path,
reference_path=reference_path,
input_features_path=features_path
)
print(metrics)- --strategy (required): Imputation strategy. Options: knn, community, deepwalk.
- --dataset: Pre-configured dataset name (e.g., amazon). When provided, file paths for edges and features are loaded from datasets.yaml.
- --edges and --features: Custom file paths for edge and feature data.
- --output: Output file path for imputed features (Default: ./imputed_features.txt)
- --feature-type: Data type for features. Options: bool, int, float (Default: float)
- --config: YAML file mapping dataset names to file paths (Default: ./datasets.yaml)
Tip: See the
datasets.yamlfile in the root folder of the repo for an example.
The datasets.yaml file maps dataset names to their corresponding file paths and other properties. For each dataset, it specifies:
- feature_type: The data type for features (bool, float, int).
- edges: Path to the file containing edge data.
- features: Path to the file containing feature data.
- reference: (Optional) Path to a reference features file for evaluation purposes.
- --evaluate: Evaluate the quality of imputed features.
- --reference: Reference file for evaluation.
- --train: Train a neural network using the imputed data (requires --evaluate).
- --time: Print timing details for major processing steps.
- --k: Number of topological neighbours to consider. (Default: 100)
- --use-k-hop: Use neighbours within k hops rather than simply the k closest nodes (Default: False)
- --community-algorithm: Community detection algorithm. (Default: "louvain")
- --max-levels: Maximum coarsening levels for Louvain (Default: 10)
- --max-iterations: Maximum iterations per coarsening level (Default: 50)
- --tolerance: Minimum modularity improvement to continue iterations (Default: 1e-4)
- --no-edge-weights: Skip edge weight preprocessing (Default: False)
- --fusion-coefficient: Fusion coefficient value (Default: 0.6)
- --walk-length: Length of each random walk (Default: 80)
- --num-walks: Walks per node (Default: 10)
- --embedding-size: Size of embedding vectors (Default: 128)
- --context-window: Context window size (Default: 10)
- --num-negative-samples: Negative samples per positive sample (Default: 10)
- --smoothing-exponent: Smoothing exponent (Default: 0.75)
- --num-epochs: Training epochs (Default: 1)
- --learning-rate: Optimizer learning rate (Default: 0.025)
- --top-similar: Most similar nodes to use for imputation (Default: 10)
- --similarity-metric: Metric for comparing embeddings (Default: "cosine")
Tip: Refer to the example datasets in
data/input/for both edge file and feature file formatting.
- Graph Type: Undirected/symmetric with no self-loops.
- Node Numbering: Nodes are numbered consecutively starting at 0. All nodes in the range must be present.
- Each line represents an undirected edge.
- The two nodes per edge are ordered descending (i.e. in the line
a brepresenting an edge, it holds thatb < a). - All edges are sorted in ascending order (e.g.,
2 0comes before3 0and2 1). - Example:
2 0
3 2
5 2
5 3
6 0
6 5
Each row corresponds to a node in the graph. Follow these guidelines:
-
File Format & Newlines
- Plain text file using LF for newlines.
- The last line must end with a newline.
-
Row Ordering & Node IDs
- Rows are in strictly ascending order by node ID (node 0 first).
-
Column Structure
- Node ID: The first field (integer).
- Feature List: A comma-separated list with each feature separated by a comma and a space (
,). - Missing features are indicated by a single
#. - All nodes must have the same number of feature values.
- Label: The last field (integer); required and cannot be missing.
-
Punctuation & Spacing
- Do not include a trailing comma after the last feature value.
The directory data/ contains modified versions of several datasets originally provided by PyTorch Geometric and DGL. The modifications include reformatting and the simulation of missing features to test the imputation strategies. Below are the datasets included:
- Twitch: Modified version of the dataset as described in the Twitch documentation.
- Cora: Adapted from the CitationFull dataset and referred to as "cora" in this repository. More details can be found in the CitationFull documentation.
- Amazon: Modified version available based on the original dataset described in the Amazon documentation.
- GitHub: Adaptation of the dataset as detailed in the GitHub documentation.
- Genius: Modified version of the dataset from LINKXDataset, here referred to as "genius". Refer to the LINKXDataset documentation for the original dataset details.
- FraudAmazon (DGL): Modified version of the dataset based on DGL's FraudAmazonDataset.
For each dataset, please consult the original documentation for complete licensing information and usage rights. The modifications in this repo (e.g., reformatting and simulating missing features) are provided for research and experimental purposes.
see our Wiki