Skip to content

IsNoobgrammer/MinHash-LSH-DeDup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MinHash-LSH

Batched Minhash-LSH deduplication for large datasets This Repo is Highly Adopted from datasketch and text-dedup by default uses SHA-64 for hash signature

to get started

git clone https://github.com/IsNoobgrammer/MinHash-LSH-DeDup.git
cd MinHash-LSH-DeDup
pip install scipy datasets torch hf_transfer -qU
export HF_HUB_ENABLE_HF_TRANSFER=1
export HF_DATASETS_IN_MEMORY_MAX_SIZE=<some_bytes_less_than_your_ram>

To use

from datasets import load_dataset
from LSH import deduplicate_dataset

ds=load_dataset("fhai50032/HINGLISH-LIMA",split="train")
column="Hinglish"

dedup_ds=deduplicate_dataset(ds,column)

args

    ds: Dataset,
    column,
    threshold=0.8, 
    num_perm=256,
    batch_size=10_000,
    num_proc= 1 if os.name == "nt" else os.cpu_count() ,
    ngram_size=5,
    min_length=5,
    bands_rows=(None,None)

About

Batched Minhash-LSH deduplication for large datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages