ML Dataset Discovery

This repository is a comprehensive project for exploring machine learning dataset similarity and retrieval.

Table of Contents:

Knowledge Graph Embedding-Based Dataset Similarity
Dataset Retrieval Evaluation
Pipeline-specific meta-model analysis
Indicative Results: Link Prediction Models

1. Knowledge Graph Embedding-Based Dataset Similarity

Located in the kge-based-dataset-similarity folder, this component focuses on:

Preparing data.
Training RDF2Vec-based Knowledge Graph Embedding (KGE) models.
Calculating KGE-based similarities.
Computing Graph Edit Distances (GEDs).
Predicting ML pipeline performance using embeddings.
Analyzing retrieval evaluation results and generating figures and tables.

Refer to the README.md in this folder for detailed instructions on setup, usage, and commands.

2. Dataset Retrieval Evaluation

Located in the dataset-retrieval-evaluations folder, this component focuses on:

Calculation of ground truth based on the performance of ML pipelines after applied on the available datasets.
Calculating rank-based metrics for both baseline methods and KGE-based similarity.
Implementation of baseline methods.

Refer to the README.md in this folder for detailed instructions on setup, usage, and commands.

3. Pipeline-specific meta-model analysis

Located in the pipeline-specific-meta-model-analysis folder, this component focuses on:

Meta-feature retrieval from OpenML for the investigated datasets
Training of pipeline-specific meta-models for performance prediction
Aggregation and analysis of pipeline-specific meta-model evaluations

Refer to the README.md in this folder for detailed instructions on setup, usage, and commands.

4. Indicative Results: Link Prediction Models for Pipeline Performance Prediction

We conducted additional experiments using link prediction (LP)-based KGE models within KGmetaSP for the pipeline performance prediction (PPE) task. Across settings, KGmetaSP (RDF2Vec) consistently outperforms KGmetaSP (LP), aligning with prior findings that walk-based methods are better suited for sparse and structurally complex KGs such as MetaExe-KG. In settings where KG-based approaches are beneficial, KGmetaSP (LP) improves over baselines, supporting the value of KG-based modeling in KGmetaSP.

Experimental Setup

We trained and evaluated three LP models: TransE, DistMult, and ComplEx. We used the PyKEEN library with the following configuration:

Model Configuration

Training: Self-adversarial negative sampling, 1500 epochs
Embedding dimension: 128
Batch size: 2048
Learning rate: 0.0005
Negatives per positive: 3
Loss margin: 50

Pipeline Performance Prediction Results

Scenario 1: Unseen Datasets

Meta-Classification (Target: Accuracy)

Dataset Emb.	Pipeline Strategy	Acc.	F1
MF All	Conf.-specific	0.7363	0.7358
MF All	KGmetaSP (LP)	0.7351	0.7368
MF All	KGmetaSP (RDF2Vec)	0.7413	0.7427

Meta-Regression (Target: Accuracy)

Dataset Emb.	Pipeline Strategy	MSE	R²
MF All	Conf.-specific	0.0081	0.6748
MF All	KGmetaSP (LP)	0.0105	0.6032
MF All	KGmetaSP (RDF2Vec)	0.0101	0.6181

Key Findings

KGmetaSP (RDF2Vec) outperforms KGmetaSP (LP) in both tasks (F1: 0.7427 vs. 0.7368; R²: 0.6181 vs. 0.6032).
In meta-classification, KGmetaSP (LP) underperforms KGmetaSP (RDF2Vec) but outperforms the non-KG baseline.

Scenario 2: Unseen Pipelines

Meta-Classification (Target: Accuracy)

Method	Acc.	F1
Avg. Performance (Base)	0.3303	0.1640
Closest Embedding (Base)	0.7748	0.7747
MF All + KGmetaSP (LP)	0.8055	0.8045
MF All + KGmetaSP (RDF2Vec)	0.8250	0.8244

Meta-Regression (Target: Accuracy)

Method	MSE	R²
Avg. Performance (Base)	0.0267	-0.0005
Closest Embedding (Base)	0.0127	0.5241
MF All + KGmetaSP (LP)	0.0081	0.6976
MF All + KGmetaSP (RDF2Vec)	0.0070	0.7361

Key Findings

Both KGmetaSP variants substantially outperform baselines across both tasks.
KGmetaSP (LP) improves over the closest-embedding baseline (e.g., +3.9% F1 in meta-classification), while KGmetaSP (RDF2Vec) further improves over KGmetaSP (LP) (e.g., +2.4% F1).
Across both tasks, the ordering is consistent: KGmetaSP (RDF2Vec) > KGmetaSP (LP) > baselines.

Conclusions

These indicative results show that KGmetaSP (RDF2Vec) achieves the strongest KG-based PPE performance. Also, KGmetaSP (LP) provides improvements over baselines in settings where KGmetaSP outperforms baselines. Together, the findings indicate that walk-based embeddings offer an advantage for sparse and complex KGs such as MetaExe-KG, and that the improvements are driven by the KG-based modeling in KGmetaSP.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dataset-retrieval-evaluations		dataset-retrieval-evaluations
pipeline-specific-meta-model-analysis		pipeline-specific-meta-model-analysis
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML Dataset Discovery

1. Knowledge Graph Embedding-Based Dataset Similarity

2. Dataset Retrieval Evaluation

3. Pipeline-specific meta-model analysis

4. Indicative Results: Link Prediction Models for Pipeline Performance Prediction

Experimental Setup

Pipeline Performance Prediction Results

Scenario 1: Unseen Datasets

Scenario 2: Unseen Pipelines

Conclusions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ML Dataset Discovery

1. Knowledge Graph Embedding-Based Dataset Similarity

2. Dataset Retrieval Evaluation

3. Pipeline-specific meta-model analysis

4. Indicative Results: Link Prediction Models for Pipeline Performance Prediction

Experimental Setup

Pipeline Performance Prediction Results

Scenario 1: Unseen Datasets

Scenario 2: Unseen Pipelines

Conclusions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages