Skip to content
This repository was archived by the owner on Nov 1, 2024. It is now read-only.
This repository was archived by the owner on Nov 1, 2024. It is now read-only.

Inquiry on Generating Embeddings for Long Documents using 125M OPT Model #1

@zhchbin

Description

@zhchbin

Hi,

Appreciate your sharing of the SemDedup implementation. After reading the source code and paper, I've encountered an issue concerning the generation of embeddings.

To perform SemDeDup, we pass documents through the open-sourced pre-trained 125M OPT model and save the last layer embedding for the last token in the document.

In the compute_pretrained_embeddings.py, the model output encodings is save to emd_memmap directly. Could you please provide more information about the implementation of last layer embedding for the last token?

with torch.no_grad():
    for data_batch, paths_batch, batch_indices in tqdm(dataloader):
        data_batch = data_batch.to(device)
        encodings = model(data_batch)
        emd_memmap[batch_indices] = normalize(encodings, dim=1)

Additionally, how do you handle lengthy documents? Currently, I presume there are two approaches:

  1. Truncate the document to a maximum sequence length of 2048 during tokenization, which aligns with the OPT model's sequence limit.
  2. Divide the document into smaller chunks and feed them into the model, then compute the mean embeddings.

In advance, I appreciate your assistance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions