You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Nov 1, 2024. It is now read-only.
Appreciate your sharing of the SemDedup implementation. After reading the source code and paper, I've encountered an issue concerning the generation of embeddings.
To perform SemDeDup, we pass documents through the open-sourced pre-trained 125M OPT model and save the last layer embedding for the last token in the document.
In the compute_pretrained_embeddings.py, the model output encodings is save to emd_memmap directly. Could you please provide more information about the implementation of last layer embedding for the last token?
Hi,
Appreciate your sharing of the SemDedup implementation. After reading the source code and paper, I've encountered an issue concerning the generation of embeddings.
In the
compute_pretrained_embeddings.py, the model outputencodingsis save toemd_memmapdirectly. Could you please provide more information about the implementation oflast layer embedding for the last token?Additionally, how do you handle lengthy documents? Currently, I presume there are two approaches:
In advance, I appreciate your assistance!