Inquiry on Generating Embeddings for Long Documents using 125M OPT Model

Hi,

Appreciate your sharing of the SemDedup implementation. After reading the source code and paper, I've encountered an issue concerning the generation of embeddings.

> To perform SemDeDup, we pass documents through the open-sourced pre-trained 125M OPT model and save the last layer embedding for the last token in the document. 

In the `compute_pretrained_embeddings.py`, the model output `encodings` is save to `emd_memmap` directly. Could you please provide more information about the implementation of `last layer embedding for the last token`?  

```python
with torch.no_grad():
    for data_batch, paths_batch, batch_indices in tqdm(dataloader):
        data_batch = data_batch.to(device)
        encodings = model(data_batch)
        emd_memmap[batch_indices] = normalize(encodings, dim=1)
```

Additionally, how do you handle lengthy documents? Currently, I presume there are two approaches:
1. Truncate the document to a maximum sequence length of 2048 during tokenization, which aligns with the OPT model's sequence limit.
2. Divide the document into smaller chunks and feed them into the model, then compute the mean embeddings.

In advance, I appreciate your assistance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry on Generating Embeddings for Long Documents using 125M OPT Model #1

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Inquiry on Generating Embeddings for Long Documents using 125M OPT Model #1

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions