This code is part of my PhD research at PPG-CC/DC/UFSCar. The aim is to compute similarities measures for multilabel classification using only the label space!
@misc{Gatto2023,
author = {Gatto, E. C.},
title = {Compute Similarities Measures for Multi-Label Classification},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/cissagatto/SimilaritiesMultiLabel}}}
The main objective of this project is to provide tools for computing and analyzing similarity measures between labels in multi-label classification problems. By focusing solely on the label space, the project offers a unique perspective on label relationships, independent of the attributes of the data instances.
Multiple Similarity Measures: Implements several well-known similarity measures for categorical data, such as Jaccard and Rogers.
Flexible: Easily add or remove similarity measures to customize the analysis for your specific needs.
Similarity Matrices: The results are stored in similarity matrices, which can be used for further analysis or as inputs for other tasks.
Documentation: Detailed roxygen documentation to help users understand the functionality and easily modify the code to suit their needs.
For more detailed documentation on each function, check out the ~/SimilaritiesMultiLabel/docsfolder
A complete example is available in ~/SimilaritiesMultiLabel/examples folder
# install.packages("devtools")
library("devtools")
devtools::install_github("https://github.com/cissagatto/SimilaritiesMultiLabel")
library(SimilaritiesMultiLabel)This source code is part of an R project to be used in the RStudio IDE and includes the following R scripts:
- libraries.R – Loads the required R packages.
- utils.R – Contains helper functions for processing.
- functions.R – Defines various functions used across the project to handle data and generate results.
- similarities.R – Contains functions for calculating similarity measures between datasets or labels.
- run.R – The main execution script that runs the core logic of the project.
- sml.R – Used to run the script via the terminal.
- config_files.R – Generates the configuration file needed to run the code.
- jobs.R – If running on a cluster, this script can be used to generate the necessary
.shfiles to run the job in parallel.
A file called datasets-original.csv must be located in the root project directory. This file contains information about the datasets used by the code. It includes 90 multi-label datasets. To add a new dataset, include the following information in the file:
| Parameter | Status | Description |
|---|---|---|
| Id | mandatory | Integer ID to identify the dataset |
| Name | mandatory | Dataset name (must follow the benchmark) |
| Domain | optional | Dataset domain |
| Instances | mandatory | Total number of dataset instances |
| Attributes | mandatory | Total number of dataset attributes |
| Labels | mandatory | Total number of labels in the label space |
| Inputs | mandatory | Total number of input attributes |
| Cardinality | optional | ** |
| Density | optional | ** |
| Labelsets | optional | ** |
| Single | optional | ** |
| Max.freq | optional | ** |
| Mean.IR | optional | ** |
| Scumble | optional | ** |
| TCS | optional | ** |
| AttStart | mandatory | Column number where the attribute space begins * 1 |
| AttEnd | mandatory | Column number where the attribute space ends |
| LabelStart | mandatory | Column number where the label space begins |
| LabelEnd | mandatory | Column number where the label space ends |
| Distinct | optional | ** |
| xn | mandatory | Value for Dimension X of the Kohonen map |
| yn | mandatory | Value for Dimension Y of the Kohonen map |
| gridn | mandatory | X times Y. Kohonen's map must be square |
| max.neighbors | mandatory | The maximum number of neighbors (given by LABELS - 1) |
| Label Dependency | optional | The dependency between labels in all dataset |
1 - Since it is the first column, the number is always 1.
2 - Click here for explanations of each property.
3 - Label Dependency can be calculated like in this paper: Luaces 2012
You need X-Fold Cross-Validation files in tar.gz format. You can download the pre-made 10-fold files for multi-label datasets here. For a new dataset, in addition to adding it to the datasets-original.csv file, you must run this code here to generate the necessary cross-validation files. The tar.gz file can be placed in any directory on your computer or server. The absolute path of the file must be passed as a parameter in the configuration file, which will be read by the mlsm.R script.
You need to install all required Java, Python, and R packages to run this code on your machine or server. The code does not automatically install the packages.
You can use the Conda Environment I created to run the experiment. To install it, use the following command to extract the environment to your machine:
conda env create -file AmbienteTeste.yamlFor more information about Conda environments, visit the official Conda documentation.
Alternatively, you can run the code using the AppTainer container I use for running the code on a SLURM cluster. Check this tutorial (in Portuguese) for more details.
You need a configuration file saved in csv format containing the following information:
| Config | Value |
|---|---|
| Dataset_Path | Absolute path to the directory where the dataset tar.gz is stored |
| Temporary_Path | Absolute path to the directory where temporary processing will be done |
| Similarity_Path | Absolute path to the directory where similarity matrices will be stored |
| Similarity | "jaccard", "rogers", or another similarity measure |
| Dataset_Name | Dataset name according to the datasets-original.csv file |
| Number_Dataset | Dataset number according to the datasets-original.csv file |
| Number_Folds | Number of folds used in cross-validation |
| Number_Cores | Number of cores for parallel processing |
| R_clone | If you want to upload the results to your nuvem |
| Save_csv_files | If you want to save the resulting csv files in your machine |
You can save the configuration file anywhere you want and pass the absolute path as a command-line argument.
This code was develop in RStudio Version 1.4.1106 © 2009-2021 RStudio, PBC "Tiger Daylily" (2389bc24, 2021-02-11) for Ubuntu Bionic Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.12.8 Chrome/69.0.3497.128 Safari/537.36. The R Language version was: R version 4.1.0 (2021-05-18) -- "Camp Pontanezen" Copyright (C) 2021 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit).
This code may or may not be executed in parallel, however, it is highly recommended that you run it in parallel. The number of cores can be configured via the command line (number_cores). If number_cores = 1 the code will run sequentially. In our experiments, we used 10 cores. For reproducibility, we recommend that you also use ten cores. This code was tested with the birds dataset in the following machine:
System:
Host: bionote | Kernel: 5.8.0-53-generic | x86_64 bits: 64 | Desktop: Gnome 3.36.7 | Distro: Ubuntu 20.04.2 LTS (Focal Fossa)
CPU:
Topology: 6-Core | model: Intel Core i7-10750H | bits: 64 | type: MT MCP | L2 cache: 12.0 MiB | Speed: 800 MHz | min/max: 800/5000 MHz Core speeds (MHz): | 1: 800 | 2: 800 | 3: 800 | 4: 800 | 5: 800 | 6: 800 | 7: 800 | 8: 800 | 9: 800 | 10: 800 | 11: 800 | 12: 800 |
Then the experiment was executed in a cluster at UFSCar.
The results stored in the folder Similarities it will be used in the next phase: BuildDataFrameGraphMLC. The result for a dataset must be put in the folder Similarities in the respective code. Also, must be in "tar.gz" format.
To run the code, open the terminal, enter the ~/SimilaritiesMultiLabel/examples folder, and type
Rscript sml.R [absolute_path_to_config_file]
Example:
Rscript slm.R "~/SimilaritiesMultiLabel/config-files/jaccard/smj-emotions.csv"
[Click here]
For any questions or support, please contact:
- Prof. Elaine Cecilia Gatto (elainececiliagatto@gmail.com)
[Click here]
- CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brazil (Finance Code 001) 💼
- CNPq - Conselho Nacional de Desenvolvimento Científico e Tecnológico - Brazil (Process number 200371/2022-3) 💡
- FAPESP - Financial support 💰
- Special thanks to UFSCar and other institutions for their support! 🙏
I’m looking for help to improve and optimize this code. The areas where I need assistance include:
- Add or remove similarity measures: If you know of other similarity measures that could be relevant to this work, your contribution would be greatly appreciated! 🌟
- Check if all 109 categorical data similarity measures are correctly implemented: I need to verify if all measures have been implemented properly and efficiently, with minimal processing and memory usage. 📊
- Documentation: Write roxygen documentation for all functions to make the code more understandable and easier to use. 📚
- Code Optimization: Explore if the code can be further optimized for performance and readability. ⚙️
If you're interested in collaborating, please feel free to reach out! ✉️
| Site | Post-Graduate Program in Computer Science | Computer Department | Biomal | CNPQ | Ku Leuven | Embarcados | Read Prensa | Linkedin Company | Linkedin Profile | Instagram | Facebook | Twitter | Twitch | Youtube |